01 Mar 2016
I am mentoring up to two projects this summer with Google Summer of Code (GSoC16), following on from the success of last year’s projects. They are described in section 5 of this page on the sponsoring organization’s web site. Thanks to the INCF for their support.
A brief summary of the projects, reproduced from the INCF page:
1. Improving the graphical exploration and analysis environment, Fovea.
Fovea is a userextensible graphical diagnostic tool for working with complex data and models. It is a library of Python classes that are built over components of PyDSTool and Matplotlib. Fovea’s present capabilities and applications to neural data and models were developed significantly in GSoC 2015 (see previous blog posts), but there are several directions remaining to develop and explore to maximize the utility and accessibility of this package to less specialized users.
A primary capability of Fovea is to assist the creation of organizing “layers” of graphic data in 2D or 3D subplots that are associated with specific calculations. These layers can be dynamically updated as parameters are changed or as time advances during a time series, and they can be grouped and hierarchically construced to ease navigation of complex data. Examples of layer usage is to display:
 Feature representations, such as characteristic threshold crossings
of time series, clusters or principal component projections, domains
of an approximation’s error bound satisfaction
 Augmenting metadata, such as vectors showing velocity or
acceleration at a specified position or time point during a simulation
 Other diagnostic quantities that can visually guide parameter
fitting of a model or algorithm.
In addition, GUI buttons and dialog boxes, and commandline
interfacing can provide additional interactivity with the graphical
views.
Aims: Depending on student interest and skills, there are three
possible directions this project could usefully take:

Redevelop the existing prototype “application” into an actual Django or flask webbased app running over a local server (maybe using Bokeh);

Extend the development of the existing core tools to work better for dynamical systems applications(especially biophysical models of neural excitability);

Continue to streamline the workflow for constructing new applications by improving core design and adding convenient utilities;

Integrate existing functionality with the literate modeling prototype discussed in Project 2 (maybe in collaboration with a student working directly on that project, if both projects run).
As part of any direction of this project, there might also be
opportunity to create innovative new forms of plugandplay UI
component that will assist in visualization and diagnostics of neural
(or similar) data.
Skills: Required: Python, Matplotlib, GUI design, documentation
tools, working knowledge of calculus and geometry or statistics and
data science
Desirable: git, general experience with either model simulation or data science techniques; agile development methodology; technical writing; Django or similar, where applicable, dynamical systems theory, where applicable.
Keywords: Python, data science, GUI, diagnostics
2. “Literate modeling” capabilities for data analysis and model development
A basic workflow and loosely structured package for “literate modeling” has recently been explored. This prototype reuses several PyDSTool classes and data structures, but is intended to work mostly standalone within other computational or analytical frameworks. It is a partgraphical, partcommandline tool to explore and test model behavior and parameter variations against quantitative objective data or qualitative features while working inside a core computational framework (e.g. simulator), and could work well as an integration with the Fovea package mentioned in Project 1.
“Literate modeling” is a natural extension to “literate programming” and reproducible research practices that is intended to create a rich audit trail of the model development process itself (in addition to recording the provenance of code versions and parameter choices for simulation runs once a model is developed). Literate modeling aims to add structured, interpretive metadata to versioncontrolled models or analysis workflows through specially structured “active documents” (similar to the Jupyter notebook). Examples of such metadata include validatory regression tests, exploratory hypotheses (as sets of expected features and datadriven constraints), and datadriven contexts for parameter search or optimization. There is presently no other software tool that aims to provide this advanced support for hypothesis generation and testing in a computational setting!
Aims: Refine, improve, or redesign existing prototype classes, database structure, and user workflow for the core functions of literate modeling. Add other core functionality to support other workflows. Option to design a browserbased interface to this system using Django or similar technology or to collaborate on integration with Fovea (see Project 5.1). Document the tools, including creating a tutorial for the website. Test according to predefined examples that will include manipulations of the Hodgkin Huxley neural model and other simple dynamical systems.
Skills: Required: Python, git, documentation tools, automatic
code generation principles, agile development methodology
Desirable: Pweave/Ptangle or similar; Django, flask, or similar; SQL; XML; general experience with either model simulation or data science techniques; technical writing; working knowledge of calculus and geometry or statistics and data science
Keywords: Python, data science, literate modeling, productivity
Inquiries and Applications for GSoC 2016
Please contact INCF at gsoc@incf.org
so they can give you access to our discussion group for INCF mentors and students. Please prepare a resume and 12 page statement about your specific interests in the work and why you’d make a good fit with the project. You can share that within the discussion group once you’re logged in.
28 Feb 2016
In 2011, I proudly developed and taught a new graduate course at Georgia State University named “Introduction to Modeling for the Life Sciences”. I took a bunch of students from six different academic departments and threw them into the deep end of studying the application of mathematical modeling in the life sciences. And I don’t mean the typical basic stuff you usually show newbies like elementary statistics, linear regression models, or remedial calculus. I wanted to discuss big questions such as:
 How can a nonmathematican tell a possibly good model from a bad one?
 How can a mathematician make sure they are building a reasonable
model that will satisfy scientific domain experts?
More specific questions we dug into were:
 How can we assess the modeling methods of the Blue Brain Project?
 What kinds of models of the brain can there even be? And why are there so many?
In reference to my promotional materials to attract students to my intimidatinglooking course description and reading list, we also considered more cryptic questions such as:
What is the connection between (a) the
ability of an F16 fighter jet to maneuver much more nimbly than an
F4, (b) the possibility that small genetic modifications lead to
large differences in phylogenetic outcome within a couple of
generations, and (c) the ability of a neural circuit controlling limb
motor patterns to switch an animal rapidly between locomotive gaits?
Now, of course the goal was not for my students to really understand
these technical papers in depth. But most of these trainee scientists
and philosophers had future plans to be involved in science in some
way, and as such they needed some preparation in the emerging ways that
computing is reshaping scientific understanding. Is there a middle
ground to understanding such literature without abandoning it for
purely popularist science journalism about the topic?
I recently discovered a dead link on the class page for our crowning
achievement: a collaborativelywritten document produced by the
students, with my help, that summarized their highlevel understanding
of a pretty complex mathematical model.
My students spent about three weeks dissecting Carson’s influential 2008 paper with Kevin Hall, which uses principles of qualitative dynamical systems in the form of analysis of fixed point stability and phaseplane diagrams. Nonetheless, the assumptions and derivation of the differential equations is, of course, not at all trivial. I invite you to take a look first – it’s open access.
The document produced by my students was used as a handout to accompany a public lecture given my friend and colleague, Carson Chow, visiting from NIH. I’d invited him to speak at GSU on “The Dynamics of Obesity”.
I thought I’d share this handout with you by embedding the pages here.
Now that I’m expending more effort at the interface of technical and
nontechnical users and readers, this seemed highly relevant. I think
it’s a good example of using a mixture of diagrams, math, and prose
to explain the technical basis of the work as well provide adequate
context to the assumptions and literature surrounding it.
(Link to the original, highresolution PDF.)
I’ll let the value of the handout speak for itself. I think the students did a
great job in expressing complex ideas without dumbing them down by
organizing information into small, interrelated chunks, and making
mathematically meaningful diagramatic representations.
The students graduated from the class with a lot of valuable,
broad insights into the world of realworld math modeling. Not
everyone has the time or interest to learn dynamical systems modeling
to a high degree of technical proficiency. There are not enough
articles in popular science magazines that dare to delve this deeply
into (relatively) complex mathematics such as this when it has clear
application in experimental science, but in the right hands I don’t think it’s so
difficult to present a fair overview of the ideas.
04 Sep 2015
Today’s post is the second that is guestcontributed by my Google Summer of Code student,
Alex Kuefler. He has just finished his project on further developing Fovea, the tool for
interactively investigating and analyzing complex data and models.
The previous post described setting up Fovea and using it to investigate a PCA of multielectrode spike data.
As usual, and the latest version of the project source code, including the code for this post’s example, can be
found on github.
Table of Contents
Introduction
In my previous post, I explored how Fovea can elucidate the process of picking the best number of dimensions to express multivariate data. As a corollary, I provided a bit of information about Principal Component Analysis (PCA) and showed off the geometry of this projectionbased dimensionality reduction method. With these basics in place, we can turn to a fun application of PCA (and visual diagnostics) to a real problem in neuroscience. Namely, how do we determine which brain cells emit which spikes if we record a sample containing different signals?
Background
A multielectrode array is a tiny bed of pinlike sensors driven into the soft jelly of the brain. Each pin (electrode) records electrical changes near individual neurons as a numerical value that dips and spikes as the cell hyper and depolarizes. Given that the array can contain tens or hundreds of electrodes, multielectrode arrays collect matrix data describing voltage at many different points.
Such direct recording of many cells is sometimes likened to lowering a series of microphones into a stadium during a sports game. Whereas lowresolution brain scans (e.g., fMRI, EEG) record the roar of the entire crowd, multielectrode arrays pick out the voices of individual neurons. But even though we can aim our recorders at individuals, we’re still bound to pick up surrounding speakers. Neuroscientists run into a problem identifying Alice the neuron, when Bob and Carol are shouting nearby.
Given waveforms recorded from electrodes at various points in a neural population, spike sorting is the process of determining which action potentials were generated by which neurons. Going from a neural waveform to a set of sorted spikes involves three basic stages:
Spike Detection
Given a continuous waveform, we need to parse which peaks and troughs can really be counted as action potentials.
Feature Extraction
We discriminate between Alice and Bob’s voices based on features like pitch, timbre, and wordchoice. Similarly, we must decide what qualities make one cell’s spikes different than others.
Classification
After detecting all the action potentials and building a profile of each one’s features, we need to leverage this information to infer which spikes (instances) were generated by which cells (classes).
Experts have a wealth of filters, classifiers, and numerical techniques for crunching through each step of a spike sort. But I’m not an expert. I like to gain insight about scary algorithms by fiddling with their knobs and seeing what happens. Fortunately, spike sorting can be visualized vividly at every step and Fovea provides the tools to do it.
Setting Up Fovea
The previous post walks through some commands for filling layers with data and slapping them over GUI subplots. The pca_disc.py
code there described is an example of working with the vanilla diagnosticGUI
(a global, singleton instance simply imported from graphics.py
), but for a more complicate spikesorting program, we’ll want a more personalized GUI.
Fovea now supports the option of initializing a new diagnosticGUIs to be subclassed by a user’s objects.
class spikesorter(graphics.diagnosticGUI):
def __init__(self, title):
#global plotter
plotter = graphics.plotter2D()
graphics.diagnosticGUI.__init__(self, plotter)
Instead of creating a ControlSys
class to be juggled alongside the diagnosticGUI (as in pca_disc
), spikesorter is a child of diagnosticGUI and can make use of all the plotting methods and attributes we’ve seen before.
In the initialization, we load in a sample waveform as a PyDSTool Pointset using importPointset()
and convert it to a trajectory using numeric_to_traj()
. The sample data included in the Fovea repository is a simulated neural waveform (described in Martinez, Pedreira, Ison, & Quiroga) and looks more or less like something an electrode would read out of a brain cell. For our purposes, all we need to know is that our waveform is composed from four different sources: There are two single neurons (Alice and Bob) in the immediate area of the recording channel, a cluster of nearby neurons which can’t be resolved as single cells, but whose spikes are still picked up by the electrode (multiunit activity), and a bunch of extracellular noise to fuzzy things up. The end result is a waveform looking something like this:
Where the yaxis represents voltage, and the xaxis is time.
Looking at this plot, we want to determine which peaks came from Alice, which came from Bob, and which are MultiUnit Activity from the electrode’s periphery, all while barring noise from further consideration.
Spike Detection
Finding true positives means admitting false positives. Such is life. When setting a threshold (i.e., the minimum acceptable height for a peak in the waveform to be considered an action potential) we invariably exclude some interesting spikes from our analysis. Or we error in the opposite direction and admit noise that’s tall enough to ride. Spike detection is a recurring theme of the last blogpost: data analysis is replete will subjective judgments, but visual diagnostics can help inform us before we make decisions.
Using Fovea’s new callbacks, we can create a threshold line to be translated up and down the waveform, admitting (or dismissing) candidate spikes as it moves along. Pressing “l” will activate line creation mode, at which point, a line_GUI
object can be drawn on the axes by clicking and dragging. Once created, we can produce a perfectly horizontal threshold that extends the breadth of the data by pressing “m”.
In spikesort.py, this line won’t start detecting spikes until it’s been renamed “thresh”. Since this line_GUI
is our currently selected object (having just created it), we can update its name with the following call:
ssort.selected_object.update(name = 'thresh')
At this point, pressing the “up” and “down” arrow keys will translate the threshold up and down the waveform, marking the points where it is crossed along the way.
This threshold line is best thought of as a “knob” on our analysis. We turn the knob in this or that direction and see how the tweaks propagate to our final product.
To implement special behaviors when context objects (like line_GUI
s) are translated across the axes, we make use of Fovea’s new “user functions”. When we create a custom diagnosticGUI object (here called spikesorter) through subclassing, we have the option of overriding diagnosticGUI methods by defining functions of the same name. user_update_func
is an example of an empty method defined in diagnosticGUI for no other reason than to be overridden. It is called every time a context object is updated. So by defining user_update_func
for spikesorter, we can patch in a bit of extra behavior:
def user_update_func(self):
#We only want the thresholding behavior to occur when 'thresh' is the updated context object.
if self.selected_object.name is 'thresh':
#Ensure the threshold line is horizontal by checkking for 0 slope.
if self.selected_object.m != 0:
print("Make 'thresh' a horizontal threshold by pressing 'm'.")
return
#The numeric value of the threshold is 'thresh's yintercept.
cutoff = self.selected_object.b
traj_samp = self.traj.sample()['x']
r = traj_samp > cutoff
above_thresh = np.where(r == True)[0]
spike = []
spikes = []
crosses = []
#Recover list of lists of points on waveform above the threshold line.
last_i = above_thresh[0]  1
for i in above_thresh:
if i  1 != last_i:
crosses.append(i)
#Return x value of the highest y value.
spikes.append(spike)
spike = []
spike.append(i)
last_i = i
self.traj_samp = traj_samp
self.crosses = crosses
self.spikes = spikes
self.plotter.addData([self.crosses, [cutoff]*len(self.crosses)], layer='thresh_crosses',
style='r*', name='crossovers', force= True)
self.show()
Once the threshold is positioned where we want it, the “d” key will detect spikes by locating each local maximum to the right of a crossover. These maxima are the peaks of the action potentials and each is centered in its own box_GUI
object created by spikesort.py
.
Unlike the “l” hot key, “d” is specific to our application. By writing a key handler method and attaching it to the canvas with mpl_connect
, it is easy to patch in new hot keys like this one. However, you must be careful not to reuse function names from Fovea, as they will be overridden. The following method is named ssort_key_on
(as opposed to key_on
, used in diagnosticGUI) so that new key presses will be added to the old library of hot keys, rather than replacing it:
def ssort_key_on(self, ev):
self._key = k = ev.key # keep record of last keypress
fig_struct, fig = self.plotter._resolveFig(None)
if k== 'd':
#Draw bounding boxes around spikes found in user_update_func.
spikes = self.spikes
self.X = self.compute_bbox(spikes)
self.default_colors = {}
#Draw the spike profiles in the second subplot.
if len(self.X.shape) == 1:
self.default_colors['spike0'] = 'k'
self.addDataPoints([list(range(0, len(self.X))), self.X], layer= 'detected',
style= self.default_colors['spike0']+'', name= 'spike0', force= True)
else:
c= 0
for spike in self.X:
name = 'spike'+str(c)
self.default_colors[name] = 'k'
self.addDataPoints([list(range(0, len(spike))), spike], layer= 'detected',
style= self.default_colors[name]+'', name= name, force= True)
c += 1
self.plotter.auto_scale_domain(xcushion = 0, subplot = '12')
self.show()
Once the threshold is positioned where we want it, the “d” key will detect spikes by locating each local maximum to the right of a crossover. These maxima are the peaks of the action potentials and each is centered in its own box_GUI
object created by spikesort.py.
Each box_GUI
captures 64 milliseconds of neural data by default, but this value can be changed in the spikesort GUI itself. Just press “b” to create your own box_GUI
, give it the name “ref_box”, and the program will use its width as the new size for detecting spikes. This trick can be used in conjunction with the toolbar zoom to make very narrow bounding boxes to fit your detected spikes.
The boxes’ contents are shown in the second (“detected spikes”) subplot, as a succession of spike profiles laid on top of each other. This plot will give you a general sense of the action potentials’ shapes (and their diversity), but it’s not yet clear how to sort them.
Any given object of study can be decomposed into features in an infinite number of ways. A birdwatcher confronting a strange specimen can consider anything from the animal’s tailplumage, to its birdsong, to the motion of its flight in order to identify it. There is no single “right” feature or set of features to look at, because the bird is a whole of many parts. Nevertheless, selecting some attributes over others and describing experimental objects as vectors across these attributes can be very useful. This process, called features extraction
, discretizes nature and helps us make decisions in the face of ambiguity.
Traditionally, researchers use their domain knowledge of the subject they study to select the attributes they think are most informative. For neuroscientists, such features might be the duration or the amplitude of a spike. But Alice and Bob are closely situated in the brain. Their signals are visually abstract and more similar to one another than the appearances of crows and condors. In other words, eyeballing these spike profiles and pointing to different qualities that “seem useful” for grouping them could easily lead us astray. Is there some hidden set of features that will give us a better handle on microelectrode data? If so, how do we find it?
Alarm bells should now be ringing in the heads of PCA enthusiasts. In the last post, I described the dilemma of a psychologist who wants to find the minimum, component feelings one might use to describe all emotions. Once this set is found, any complex emotion might be described as a weighted sum of these components (e.g., “nostalgia” might decompose to 0.8*sorrow + 0.7*happiness – 0.3*anger
). PCA comes up with these components by finding a set of orthogonal directions through the data that capture the greatest amount of variance. Here, “orthogonal” essentially means that no component contains even a smidgen of another component (e.g., happiness is zero parts anger and zero parts sorrow) and “greatest amount of variance” means no other components can be composed to give a more accurate reconstruction of the data than those found by PCA (assuming the components are orthogonal). By using components as our attributes, PCA acts as an automatic feature extractor driven by mathematics, rather than intuitions.
We neuroscientists with our spikes are now in an analogous situation to the psychologist. The difference here being the components PCA finds for our data will be a little more abstract and a little less amenable to an Englishlanguage description. Fortunately, we don’t need Englishlanguage descriptions of the components, because we have Fovea: We can visualize what spike components look like in the third subplot.
Pressing “p” after detecting a set of spikes will cause spikesort to carry out PCA on the n x m observation matrix whose rows are the waveform segments detected with the “d” keypress, and whose columns are the current voltages of every spike at a given timestep. PCA gives us an m x 3 projection matrix, whose columns are the principal components (PC). These PCs are displayed in the third subplot in red, green, and yellow respectively. Meanwhile, the spikes projected onto the first and second PC are shown in the fourth subplot.
The important thing to notice about the PCs in the third subplot is that they look a bit like action potentials. Sure, no single one of them seems quite like it would fit in with the spikes of the “detected” plot. But it’s easy to imagine how, say, hammering out a few bumps in the red PC or flipping the green PC upside down and merging them together might give us one of the black spike profiles shown. Indeed, this is what the projection step of PCA is doing. Recall that each observed instance (a spike) is a weighted combination of the components (e.g., spike1 = 0.5*pc1 – 0.8*pc2 + 0.2*pc3
). These weights stretch and flip the PCs in the manner above described before merging them through summation. In this sense, a PC might be described as a “protospike”: The wavy red, green and yellow lines don’t correspond to anything we’ve really recorded the brain doing, but they can be combined in different ways to generate any of the real spikes we did observe.
The fourth subplot, “Projected Spikes”, can be interpreted as a graph of the weights used to scale the PCs. In the below image, we use Fovea’s new picker tools to select a datapoint in the fourth subplot, which highlights its corresponding detected spike as well. The chosen point is at approximated 230 on the xaxis (the red PC) and +40 on the yaxis (the green PC).
In other words, this spike contains a LOT of first PC (except flipped upsidedown, as the weight is negative), and a fair amount, but substantially less of the second PC. This finding accords with intuition, as the big peak of the red PC looks like an inversion of the spike’s Vshaped refractory period (see my overlaid box A, which is not part of the Fovea output), and they both share a trough or hill at the end (see: B). Similarly, the flat onset and high peak of the spike resemble the first third of the green PC (see: C).
You may also notice that only the first and second PCs are fully drawn in, whereas the third is dotted out. This indicates the first and second PCs are being used as the axes of the fourth subplot. By clicking on the dotted PC, the user can change the projection axes of the “Projected Spikes” subplot. Below we see the same selected spike, but shown on new axes:
Classification
Two differences should stand out when projecting onto the first/second PCs rather than the second/third PCs. First, the data projected onto the first/second PCs should have a greater tendency to cluster.
Second, when we introspect these clusters they should tend to be more cohesive than groupings that show up when projecting to the second/third PCs. In other words, if we look at only those detected spike profiles that go along with a given cluster, they should all look pretty similar to one another. We use a combination of keypresses and context objects to facilitate this process. Pressing “b”, we create box_GUI
s in the fourth subplot to fit around potential clusters of data point. When one of these bounding boxes is selected (in bold), we can then press either “1”, 2”, or “3” to color the points in that box red, green or blue, respectively (“0” returns them to black). The detected spike profiles will receive the new colors as well. Below, we see a few clusters grouped in this way:
Here, four distinct clusters have been picked out and they’re all fairly easy to describe. The red spikes look like “classic” action potentials, characterized by a flat onset, high peak, and slow return to baseline. The green spikes start out with a sharp Vshape and have a distinctive, yet lower peak. The blue spikes are characterized by single, smooth swells. And finally, the black spikes always manage to cross the mean of the data, and are perhaps just noise or multiunit activity.
On the other hand, projecting data onto the third/second PCs is less clean. Not only do the clusters seem less distinctive, but the spike profiles they group together don’t appear to have very much in common:
My placement of rectangular bounding boxes using Fovea’s “b” command in this projection was more arbitrary, as these clusters weren’t as clearly separated. Although some of the colorations accord with what we’d expect, in this projection the difference between the highpeaked “classic” spikes and the multiunit noise we picked out from the previous example doesn’t come across (both are painted in black and belong to the large, blurry cluster in the middle). However, this projection isn’t entirely without merit. Consider the blue and green spikes. In the previous example, these were all bunched together into the same group. But it’s clear from this projection that although both sets have a distinctive V shape, for one group they precede the peak, and for the other, they follow it.
Although the literature on classification algorithms is extensive, the fact that we can turn the different “knobs” provided by Fovea and come up with unexpected results suggests the benefits of visualization for exploratory analysis.
Conclusion
In addition to exploring a realworld use case, this post lays out lots of new Fovea tools resulting from our work during Google Summer of Code 2015. We’ve seen how subclassing diagnosticGUI can produce more complicated programs and let us tailor Fovea’s behavior with user function overrides. We also took a look at builtin key presses, picking selected objects with the mouseclicks, and some applications for the line_GUI
and box_GUI
context objects.
Links
Realistic simulation of extracellular recordings
Martinez, Pedreira, Ison, & Quiroga’s simulated data
Helpful paper on spike sorting by Michael S. Lewicki
29 Jun 2015
Today’s post is guestcontributed by my Google Summer of Code student,
Alex Kuefler. He has been busy developing Fovea, the new tool for
interactively investigating and analyzing complex data and models, and
has a demo in the context of data dimensionality reduction to show off
some new features and possibilities. It also acts as a first tutorial in how
you can set up your own Foveabased project, about which there will be more.
We have recently improved support for Python 3, and the latest version can be
found on github
Table of Contents
Introduction
If you’ve worked with data, you’ve probably heard of principal component analysis. PCA is one of the most widely used techniques for preprocessing high dimensional data, but for the mathematically illiterate, it can be something of a black box. This tutorial won’t show you the algebra going on inside (for that, I highly recommend this tutorial by Jon Shlens), but it will build up some geometric intuitions about what PCA is doing and demonstrate how Fovea can be used to gain such insights.
Let’s start off in everyone’s three favorite dimensions (x, y, and z) before working our way up. Consider the following dataset:
Although I’ve taken the trouble to draw this datadisc in the volume of a cube, I really didn’t need to. All the action is happening across the xyplane with the zaxis contributing nothing to our understanding – I may as well have just appended some arbitrary coordinate to every point making up an otherwise interesting 2D object. In fact, this is exactly what I did.
Using PyDSTool’s synthetic_data tool and numpy’s concatenate function:
pts = sd.generate_ball(120, 2, 10)
pts = np.concatenate((pts,np.zeros((120, 1))), axis=1)
The end result is a list of points looking something like this:
[[ 5.94061792 3.76256284 0. ]
[0.86465872 3.35053331 0. ]
[ 3.4455166 2.50427016 0. ]
[ 7.75243164 5.58465333 0. ]
[ 2.11125897 3.93196272 0. ]
[5.87175514 5.98995138 0. ]
[1.12449966 2.32071201 0. ]
[ 2.0811796 0.78464252 0. ]
[ 5.65265684 0.11505331 0. ]
[ 3.43255688 3.83145878 0. ]]
One easily spots the slacking zcoordinate. In a perfect world, dimensionality reduction would consist of nothing more than tossing out the zeroes. But when the data have been rotated, or noised up, the problem becomes more apparent. By generating some random angles and composing the rotate_x()
, rotate_y()
, and rotate_z()
functions (in addition to translate()
) in pca_disc
, like so:
trans_am = 10
trans_ax = 1
X = [[],[],[]]
for i in range(3):
X[i] = rotate_z(rotate_y(rotate_x(translate(pts, trans_ax, trans_am),
random.uniform(0,2*np.pi)),random.uniform(0,2*np.pi)),random.uniform(0, 2*np.pi))
We end up with a few sets of points that still sit on 2D discs, but appear to make use of three dimensions.
[[ 8.2701649 6.54407486 9.43685818]
[ 10.88101268 8.97529237 8.31551898]
[ 10.87652499 10.02230807 3.48266113]
[ 4.70195135 4.70376686 5.67127247]
[ 14.28294606 12.28493836 5.26297749]
[ 7.14308586 6.18479583 2.17252043]
[ 11.8558326 9.95023447 7.14259082]
[ 10.0219924 8.51815475 4.83660148]
[ 12.01924573 11.11940641 4.34387331]
[ 13.73158322 12.05803314 2.28344063]]
Although these data have tricked the x, y, and z axes, visualization makes it apparent that we’re expressing our data in more dimensions than we need.
PCA’s job is to throw out the x, y and z axes and come up with a new set of axes that line up better with the data. In other words, PCA produces an ordered set of orthogonal vectors (called the principal components or PC’s), such that the first PC points in the direction of greatest variance in the data, the second in the direction of second greatest variance (constrained to be orthogonal to the first), and so forth. It is perfectly valid to come up with a new set that has the same number of vectors as the old one, thus capturing ALL the data’s variance. But the real fun happens when we ask ourselves if we can get away with having fewer axes (dimensions) than we started with.
In pca_disc
, it is apparent that the two measly a and b axes can capture the same amount of “interesting stuff” as all three, x, y, and z axes. To convince ourselves, we’ll need some way to rapidly explore how our data look embedded in highdimensional space and how they look after projection onto axes a and b. Together, Fovea layers and matplotlib’s subplots provide the needed utility.
Setting Up Fovea
Our goal is to set up some nice images showing the “before and after” of a dimensionality reduction as subplots in a figure window. But first we need to decide what kinds of information we want populating our subplots. For the purposes of this tutorial, the layers will correspond to different rotations of our disc and contain four different, but associated, groups of data (beforePCA data, afterPCA data, variance captured by PCA, and the PC’s themselves). Out in the wild, the user may want to reserve each layer for different clusters of a single dataset or different datasets entirely (the possibilities are endless). We can set aside some names and appropriate colors for our layers in a couple lists:
rot_layers = ['rot1', 'rot2', 'rot3']
rot_styles = ['r', 'g', 'b']
And the layers can be added to the plotter like so:
plotter.addLayer('orig_data')
for rot in rot_layers:
plotter.addLayer(clus)
I’ve also included a fourth layer, orig_data
, to house the original datadisc prior to any rotations. But before we can start populating rot1, rot2, and rot3 with data, we need a figure window and some axes over which our layers will be displayed. pca_disc
provides a convenient function, setupDisplay()
, which accepts a couple lists of strings (i.e., rot_layers
and rot_styles
) and arranges a figure with three subplots:
plotter.arrangeFig([1,3], {
'11':
{'name': 'BEFORE',
'scale': [(10,10),(10,10)],
'layers': rot_layers+['orig_data'],
'axes_vars': ['x', 'y', 'z'],
'projection':'3d'},
'12':
{'name': 'AFTER',
'scale': [(20,20),(20,20)],
'layers': rot_layers,
'axes_vars': ['a', 'b']},
'13':
{'name': 'Variance by Components',
'scale': [(0.5,10),(0,1)],
'layers': rot_layers,
'axes_vars': ['x', 'y']},
})
gui.buildPlotter2D((14,6), with_times=False)
The first subplot, (entitled “BEFORE”) will be home to the original 3D disc dataset, some different rotations of the same disc, and a view of the new axes that PCA discovers for each rotation. Note that we set the ‘projection’ parameter for this subplot to ‘3d’ and supply three axes_vars instead of two. This feature is new to Fovea, and allows for provisional plotting of 3D data (with associated perks, like being able to rotate the axes by clicking and dragging). The second plot (“AFTER”) will display a 2D view of how each rotated disc looks after having been projected onto their corresponding PC’s (shown in “BEFORE”). Bear in mind that, for the purposes of this user story, two PC’s will always be used to project the “AFTER” data, even if only one PC was more appropriate (say, by stretching the disc far across the x axis). When we move to more dimensions, we’ll also continue using only 2 axes for the “AFTER” plot, but you could also make this plot 3d as well, if PCA discovers 3 strong components.
As we move to more complicated data sets, it will also be important to assess how much the data are spread along each axis (the variance). Knowing the variance captured by each PC will later let us decide if that dimension is worth keeping around. So we will also set aside a third subplot, ‘Variance by Component’ where this information will eventually be stored. Notice also that every subplot will contain a bit of data from every layer, so the ‘layers’ parameter for each one is simply the list rot_layers
(with orig_data
appended on for the first subplot).
Looping through our list of layers, we can now perform PCA on each rotated dataset using pca_disc
’s compute()
function. compute()
makes use of doPCA()
– an aptly named function for doing PCA – included in PyDSTool’s data analysis toolbox (imported as da
). You may also wish to create a pcaNode object directly with the python MDP package (which da imports). MDP includes a method for retrieving the data’s projection matrix, whose columns are the principal components we’re after. Another handy method defined for pcaNodes is simply named d()
, which returns the list of eigenvalues corresponding to the principal components (which we can use to account for how much variance each new dimension explains).
# Create a pcaNode object.
p = da.doPCA(X, len(X[0]), len(X[0]))
# Columns of the projection matrix are the PCs.
pcMat = p.get_projmatrix()
# Store column vectors to be convenient for plotting as line endpoints.
pcPts = np.concatenate(([pcMat.transpose()[0,:]*15],
[pcMat.transpose()[0,:]*15]), axis=0)
for j in range(1, new_dim):
pcPts = np.concatenate((pcPts, [pcMat.transpose()[j,:]*15],
[pcMat.transpose()[j,:]*15]), axis=0)
# Get data projected onto the 'new_dim' number of PCs.
Y = p._execute(X, new_dim)
compute()
then adds the data (both low and high dimensional), the PCs, and a plot of variances to their respective layers, and makes use of the optional subplot
parameter to ensure each group of data ends up on the axes assigned to it.
# Create plot of highdimensional data.
plotter.addData([X[:,0], X[:,1], X[:,2]], layer=layer, style=style+'.', subplot='11')
# Add PC lines to the same subplot as the highdimensional data.
for j in range(0, len(pcPts), 2):
plotter.addData([pcPts[j+0:j+2,0], pcPts[j+0:j+2,1], pcPts[j+0:j+2,2]],
layer= layer, style= style, subplot= '11')
plotter.addData([pcPts[j+2:j+4,0], pcPts[j+2:j+4,1], pcPts[j+2:j+4,2]],
layer= layer, style= style, subplot= '11')
# Create plot of lowdimensional data.
plotter.addData([Y[:,0], Y[:,1]], layer=layer, style=style+'.', subplot= '12')
# Create line plot for fraction of variance explained by each component.
plotter.addData([range(1, len(p.d)+1), p.d/sum(p.d)], layer=layer, style=style+"o", subplot= '13')
At the end of compute()
, we loop through all the layers we’ve added and turn off their visibility as a single group (another new feature provided by Fovea), so we have a clean slate when we start interacting with our GUI.
for layer in rot_layers:
plotter.setLayer(layer, figure='Master', display=False)
Exploratory Analysis
At this point, we can explore our data by clicking and rolling the 3D axes and changing the visibility of the various layers using setLayer()
. But to make things more userfriendly, it is better to set up some callbacks. For now, this function will respond contextually by cycling through the different rotations when the arrow keys are pressed, hiding the original data when h
is pressed, and displaying all rotations in response to m
. It can be easily connected to our GUI’s instance of masterWin with mpl_connect
:
c = 0
def keypress(event):
global c
if event.key == 'right':
c += 1
if event.key == 'left':
c = 1
if event.key == 'left' or event.key == 'right':
for layer in rot_layers:
plotter.setLayer(layer, figure='Master', display= False)
plotter.toggleDisplay(layer=rot_layers[c%3], figure='Master')
if event.key == 'm':
for layer in rot_layers:
plotter.toggleDisplay(layer=layer, figure='Master')
if event.key == 'h':
plotter.toggleDisplay(layer='orig_data', figure='Master')
plotter.show(rebuild=False)
gui.masterWin.canvas.mpl_connect('key_press_event', keypress)
After a bit of playing around with our visualization, a few things should become apparent.
First, each rotation’s PC’s line up beautifully with the data, as one would expect from this surrogate dataset. If you rotate the datadisc to be edgeon, you’ll see that datapoints and PC’s all fall in a straight line. If we want to fuzzyup the disc with the noise()
function provided in pca_disc
, we may lose our straight line, but the PC’s will still point across the breadth of the data, as we would hope.
Second, the subplot “AFTER” projection looks quite a bit like the “BEFORE” image in the 3D axes. We may not be able to swivel it around, but if you turn the highdimensional data so that it’s flat face is facing toward you, you will have effectively recreated the “AFTER” plot in the “BEFORE” axes (you may need to use the zoom widget to correct the scales).
The takeaway message here is that PCA is ultimately just a linear projection (or flattening) of data in one space into some other space. Nothing is intrinsically different – it is merely squished. However, when we start dealing with more dimensions (and we’re forced to view 2D/3D projections of bigger data) the myriad projections we choose to get an impression of the data may all look quite different. In such cases, the ability to compare before and after images (perhaps plotting different projections of the data rather than rotations) can be instrumental.
And third, the “variance by components” plot is very boring. Because we’re plotting an almost perfect disc, the “direction of greatest variance” we use as the first PC has only marginally more variance than the second component. We can spice it up by using pca_disc
’s stretch function to skew the data and watch as the graph becomes less even. But as always, the real intrigue happens when we add more dimensions.
HighDimensional Data
For most realworld datasets, dimensionality reduction involves sacrifice. Every principal component captures a bit of variance and somewhere along the line, the user must make a judgment call about how much of that variance it is okay to throw out if it means getting to weasel into a lower dimension. The subjectivity of this process is why visual diagnostics tools can play such an important role in dimensionality reduction. Good analyses rely on the ability of informaticians to make informed decisions and Fovea is a great informer. By interacting with data, testing different candidate dimensionalities, and observing the associated “variance payoffs”, users can make better choices for a dimensionality reduction.
Consider how our workflow changes as we move into higher dimensions. Imagine you are a psychologist acutely interested in how frequently members of 3 different populations (children, adolescents, and adults) experience 6 different emotions (joy, sorrow, jealousy, malice, fear, and pride) throughout the day. For convenience, we’ll pretend also that each population’s data can be modeled perfectly as a normallydistributed hypersphere in 6D emotionspace. Now instead of having three rotations of a disc shown in three different colors, our dataset will consist of three distinct hyperspheres (created with synthetic_data()
) representing the emotional diversity of children (red), adolescents (green) and adults (blue):
# Create and stretch different hypersphere "clusters":
X1 = translate(stretch(stretch(sd.generate_ball(133, dim, 10), 0, 1.4), 1, 1.4), 0, 25)
X2 = translate(sd.generate_ball(110, dim, 10), 1, 20)
X3 = translate(noise(sd.generate_ball(95, dim, 10), 2, 0.6, 0, 2), 2, 15)
X = [X1, X2, X3]
clus_layers = ['clus1', 'clus2', 'clus3']
clus_styles = ['r', 'g', 'b']
Notice that I’ve also tinkered with the number of datapoints making up each hypersphere (the sample size of that population) and applied pca_disc
’s stretch()
, translate()
and noise()
function to give each cluster its own character, as we would expect from three different clusters of a real dataset.
Let’s say we suspect our choice of 6 emotions was somewhat arbitrary. Do we really need to give “joy” and “pride” their own axes, or should reports of those feelings all be lumped into a single variable (happiness)? For that matter, don’t “jealousy” and “malice” belong to the same underlying emotion (contempt) as well? PCA is designed to settle questions like these.
But a problem arises when the best set of axes PCA does identify outnumbers the 3 dimensions visible to us mortals. We may have narrowed our 6 emotions down to a more manageable list of 4 (happiness, contempt, fear, sorrow), but that’s little solace if we wish to plot our new dataset in 2D or 3D. However, we may still get a loose impression of how 4D or 6D might look in its natural state by projecting it onto 2 or 3 orthonormal vectors. Methods like Linear Discriminant Analysis can be used to come up with “projection vectors” that yield particularly informative projections of highdimensional data. But purely for the sake of demonstration, we’ll content ourselves with any 2 or 3 orthonormal projection vectors that transform our 4D (“AFTER”) and 6D (“BEFORE”) into 2D and 3D pointclouds that can be drawn in our existing subplots. Two arbitrary orthonormal projection vectors can be found by applying the QR factorization on a random matrix as in pca_disc
’s function, ortho_proj_mat()
:
def ortho_proj_mat(n, m):
Z = numpy.random.rand(n, m)
Q, R = numpy.linalg.qr(Z)
return Q
However, when dealing with 6D data, there is more leeway to select how many axes we want our new space to have. For this particular example, the third subplot reveals a sharp gap between the variance explained by the first 2 PCs and the remaining 3 for the X1 (red) dataset.
If this were real psychological data, we might conclude that we were too hasty to settle on four new dimensions, when what we really needed was just a “positive emotion” axis and a “negative emotion” axis. Then again, even if two vectors explain a big portion, we would still be throwing out a lot of variance if we ignore the other 3 completely. In other words, the process of picking the best dimensionality is nondeterministic, contextdependent, and leaves room for trial and error.
Wouldn’t it then be nice if we could manually select the number of PCs we want to project our data onto and explore how our choice affects the “AFTER” image? By including “up” and “down” arrow clicks in our keypress()
function we can do just that. Each time we press the “up” key, another PC is added to the set of axes onto which the “BEFORE” data are projected (thus changing the appearance of the “AFTER” plot). Each time we press “down”, a PC is removed. And if the current number of selected PC’s exceeds two, the code swoops in with our arbitrary, orthonormal, projection vectors to display the data in the second subplot.
However, as we flit around dimensions, generating different “AFTER” data, and projecting what we find onto our random QRvectors, we end up with a lot of different variables to keep track of. For instance, even if our projection vectors for displaying 4D data are chosen arbitrary, we still want them to be the same vectors if we hop over to 5Dland for a bit, then come back. Otherwise, viewing the different plots produced by toggling through different dimensions is about as informative as randomly generating the data each time.
Fortunately, Fovea’s objectoriented design makes it easy to add new classes that store and reuse attributes during interactive plotting sessions. Reworking a bit of code from Benjamin Root’s Interactive Applications Using Matplotlib, we can come up with a control system class, ControlSys
, whose fields include:
 d (the number of PC’s we wish to start with)
 c (a counter for keeping track of which hypersphere is on display)
 proj_vecsLO (Two orthonormal vectors for projecting postPCA data)
 proj_vecsHI (Three orthonormal vectors for projecting prePCA data)
This class also includes a modified version of our function, keypress()
, which now cycles between layers/clusters (left and right), recomputes PCA in different dimensions (up and down), and changes visibilities (m
and h
).
def keypress(self, event):
if event.key == 'left' or event.key == 'right':
if event.key == 'left':
self.c += 1
else:
self.c = 1
for layer in self.clus_layers:
plotter.setLayer(layer, display= False)
plotter.toggleDisplay(layer=self.clus_layers[self.c%len(self.clus_layers)]) #figure='Master',
if event.key == 'm':
for layer in self.clus_layers:
plotter.toggleDisplay(layer, figure='Master')
if event.key == 'h':
plotter.toggleDisplay(layer='orig_data', figure='Master')
if event.key == 'up' or (event.key == 'down' and self.d is not 2):
if event.key == 'up':
self.d += 1
elif event.key == 'down':
self.d = 1
print("Attempting to display", self.d,"dimensional data...")
for i in range(len(self.clus_layers)):
self.data_dict = compute(self.data[i], self.d, self.clus_layers[i],
self.clus_styles[i], self.proj_vecsLO, self.proj_vecsHI)
self.highlight_eigens()
gui.current_domain_handler.assign_criterion_func(self.get_projection_distance)
plotter.show(rebuild=False)
Note also the print statement and highlight_eigens()
function, called when the up or down keys are pressed. Each cues the user to the effects their changes of dimensionality has. In the case of the print statement, this cue is text. In the case of highlight_eigens()
, vertical bars are drawn through each component in the “Variance by Components” currently being viewed. Another more subtle indication of the current dimensionality is the number of colored axes visible in the “BEFORE” plot, which will always be the same as the number of vertical bars.
The end result is a system that lets the user move “horizontally” through different clusters of data and “vertically” through candidate dimensionalities for the reduced data. We can view the variance subplot or printed reports of variance captured to find our lowdimensional sweetspot, all while referring to the “BEFORE” and “AFTER” images as sanity checks.
Adapting This Example To Your Application
As much as I hope you enjoyed this meditation on discs and hyperspheres, it is unlikely that you installed Fovea to process reams of convenientlygaussian, madeup data. For that matter, it is unlikely you’ll be satisfied exploring whatever data you do have with PCA alone. So I’ll close out with a few comments about the overall design of the PCA visualization module and some simple ways you might repurpose it to meet your needs.
This user example is divided into two files: pca_disc
and pca_tester
. pca_tester
is the end where most of the user’s interaction with the code is meant to occur. It includes the disc and hyphersphere functions already discussed, but both should serve as a template for how to load real data into the pca_disc
’s control system and library of dimensionality reduction tools. For instance, instead of generating data with generate_ball()
and the rotation functions, X could be created from a delimited .txt file with numpy.loadtxt()
. Say you’ve used a classifier to group some set of featurevectors into different clusters. Just as we displayed three disc rotations in three different colors, so too could you explore clouds of “Iris Setosa”, “Iris Vesicolour” and “Iris Virginica” instances in red, green and blue.
The threesubplot configuration is also natural to other machine learning paradigms besides PCA. All dimensionality reductions involve three things: a source (highdimensional) data set, a reduced (lowerdimensional) data set, and some criterion that needs to be optimized in order to generate the reduced data from the source. In PCA, this criterion is variance. But there’s no reason why you might not reserve the third subplot for KullbackLeibler divergence while using tSNE, or mean geodesic distances between points in Isomap. Swapping out PCA for another algorithm will take a bit of digging in the “backend” functions of pca_disc
, but only these few lines of code in compute()
assume that you’re interested in using PCA at all:
p = da.doPCA(X, len(X[0]), len(X[0])) #Creates a pcaNode object.
pcMat = p.get_projmatrix()
plotter.addData([range(1, len(p.d)+1), p.d/sum(p.d)], layer=layer, style=style+"o", subplot= '13')
Y = p._execute(X, new_dim)
Finally, maybe your data aren’t already clustered, so there’s nothing to toggle between with the left or right arrow keys. Or maybe you already have your heart set on a specific dimensionality, so paging up and down is no good to you. Fortunately, the keypress()
function in pca_disc
’s ControlSys
can be easily modified with a conditional block to patch in whatever key commands do interest you. For example, if you’re curious about how your highdimensional data might look from many, many different angles, you might modify keypress()
to rotate your “BEFORE” projection vectors with presses of the left and right arrow keys and “AFTER” projection vectors by pressing up and down. Instead of having a tool for probing different candidate dimensionalities during a dimensionality reduction, you would then have a GUI in the vein of DataHigh, a MATLAB package which lets users explore highdimensional data with continuous, 2D projections once the reduction is over.
Myriad other methods could be added to ControlSys
to assist in analyses of the core objects captured in the class fields. For instance, I’m now working on integrating some of the tools from domain2D into the PCA user example. A function for growing a bubble around “AFTER” points that needed to be projected an especially long distance during PCA might look a little like this:
def get_projection_distance(self, pt_array):
"""
Domain criterion function for determining how far lower dimensional points
were projected
"""
nearest_pt = []
index = []
#Find closest projected point.
dist = 10000000
for i, row in enumerate(self.data_dict['Y_projected']):
if numpy.linalg.norm(pt_array  row) < dist:
dist = numpy.linalg.norm(pt_array  row)
nearest_pt = row
index = i
#Calculate distance between projected point and loD point.
orig_pt = self.data_dict['Y'][index]
proj_pt = numpy.append(nearest_pt, np.zeros(len(orig_pt)len(nearest_pt)))
if numpy.linalg.norm(orig_pt  proj_pt) > self.proj_thresh:
return 1
else:
return 1
Even if you don’t care to manipulate your GUIs with fancy callbacks and widgets, the ControlSys
class is a great place to define scripting methods that let you interact with your applications’ important fields in realtime.
Conclusion
Hopefully this walkthrough has given you a more visual understanding of PCA, but the moral of the story is that this framework is generalpurpose and applicable to many data science tasks. Layers can be used to group related data and metadata and make this information available in a number of formats. Setting up GUIs and callbacks is made easy, but advanced users retain the freedom to interact with their data through scripts and command line inputs. Fovea is an interactive, graphical toolkit, not simply a GUI. It gives wouldbe dimensionality reducers a view inside whatever black boxes interest them.
Links
Remarkable tutorial on PCA
DataHigh MATLAB package
Interactive Applications Using Matplotlib
06 May 2015
Table of Contents
In today’s post we will give an overview of an early prototype of a
new diagnostic tool for numeric algorithms, known as Fovea.
I began creating this tool a couple of years ago with one of my graduate
students, and I am continuing it now as part of my push for having
effective tools for my research. In fact, I’m fortunate to have a Google Summer of Code student working with me on this project right now.
Fovea has a lot more to it, involving interactive graphical widgets
for many analytical functions, including controlling parameters or
position along trajectories while enjoying automaticallyrefreshing
subplots that reflect changes. Today, we are just looking at its
structured logging and basic plotting. This is not a full tutorial but
a demonstration of capability and a teaser for the thorough
documentation that will be forthcoming during the GSoC project.
As software developers and experimental scientists know, the first
thing you have to do at the beginning of a new project is to spend a
lot of time building (or buying) tools and instruments before you
can execute the project itself. With the kinds of exploratory computational science project that I undertake, I always find myself short of good diagnostics to guide my (or my students’) progress towards a goal.
Because, as always, I will be introducing several new ideas here, we
will keep today’s examples elementary. Next time we look at these
tools, we’ll ramp up the complexity and apply them to a saddle
manifold computation that I’ve been working on, so you can see a more
realistic application that can really benefit from the tools. Today’s
post is pedagogical: I’ll present how to pick apart the workings of
common rootfinding algorithms, and graphically track and log their
progress during their iterative steps. With the large overhead in
preparation and execution, I wouldn’t expect anyone to want to set
this up for a mere root finding algorithm unless they were using it as
an educational tool for teaching numerical analysis.
Here’s an easytosetup animation of some of the demo’s raw output to
motivate you to read further…
We have enough new ideas to cover here that, for the time being, I
won’t be presenting this work in the form of my literate modeling
workflow.
The need for better diagnostics with complex algorithms
Let’s say you write an algorithm, or you are asked to apply an
existing algorithm that has a lot of tricky parameters to set.
It is typical to try running it and hope that it “just works” right
out of the box, but that rarely happens, right? Common complaints will then
be: “Why is it not working properly?” or “How can I make it run
faster?” Until you understand what’s happening under the hood, you
often don’t stand a chance of optimizing the use of the algorithm.
If you even get this far beyond blind trial and error, the most common
way to track progress is with print
statements. And that’s already
an order of magnitude improvement. But for more sophisticated
algorithms or hairy problems, you can still get lost in reams of text
output. In some cases, especially those related to dynamical systems
modeling, graphing some of the data generated by the algorithm during
its operation can be more helpful. Again, that probably yields another
order of magnitude improvement on understanding.
The worst offender is the hidden assumptions
When you’re working on really hard problems involving strong
nonlinearities, multiple scales, poor conditioning, noisy data, etc.,
the algorithms you are interested in understanding can end up nested
inside other algorithms. A simple example is a predictorcorrector method, used for many things but commonly found in numerical continuation algorithms. Correction is often done using Newton’s Method, but Newton’s Method is a slippery thing to handle. If you aren’t confident about certain properties of your starting situation, that method can rapidly diverge.
This example is already nontrivial to understand if you are new to
highlevel applied math, but it’s just the beginning! As complex
projects involve more nested computations, the accumulating
assumptions that are necessary to ensure each computation will be
successful (or efficient) also compose together in a complex way. It
can become unfeasibly difficult to prevalidate or postverify the
satisfaction of all the relevant assumptions, moreso than any other
challenge in executing the project.
This challenge is mainly due to the usually implicit nature of the
assumptions. For the sake of efficiency, many implementations do not
validate all their inputs and instead expect the caller (either the
end user or another algorithm) to know what they are doing. That
passing of the buck is fair but is extremely problematic once
algorithms become nested and if the user’s experience or expert
knowledge of the algorithms is limited. Moreover, in many cases, it
can be either theoretically impossible or, at least, practically
difficult for algorithms to validate all the assumptions relevant to
their effectiveness even if that could be done quickly enough to
not degrade performance.
Towards testdriven analysis
We could obviously add unit tests for the algorithm’s code
implementation, but if we don’t understand much about the possible
emergent properties of nested algorithms then it’s hard to write
thorough and appropriate tests a priori. Even with some good tests
written, if they fail at runtime because of a logical or semantic
error then we might need to diagnose and explore the behavior of the
algorithm on a specific problem or class of problems before we can
write appropriate test. Fovea can play a role in this. The literate
modeling principles could also lead to a clearer understanding of the
convoluted assumptions when dealing with nested algorithms.
Prerequisites for this demo
Code dependencies
Download Fovea from github and install with the usual python setup.py install
. This link is to a version tagged at the time of writing, to keep with the functionality exactly as presented here. You can, of course, also get the latest version and see what has progressed in the meantime.
ADDENDUM 1: For Python 3.x users, I had to subsequently update the github repo, so checkout the latest one and install pyeuclid from my fork that updates for Python 3 compatibility.
ADDENDUM 2: You may need to install pyeuclid from my github fork (covers some Python 3 compatibility not present in PyPI) and the geos dependency for the python shapely package. Binaries are available by independently installing shapely from here or you can try to do it yourself using the appropriate version of OSGeo4W.
At present, the package depends on shapely
, descartes
, pyyaml
,
euclid
, PyDSTool
(latest
github version at time of writing or the forthcoming v0.90.1 for those
of you in the future), structlog
, and tinydb
, although only the
last three are actually needed for this particular presentation. In
turn, PyDSTool
requires numpy
, scipy
, and matplotlib
,
which are essential for Fovea.
At the time of writing, one package dependency
(TinyDB) is not correctly
automatically installing from this setup process on some platforms,
and you may need to first run pip install tinydb
(or by inserting
the upgrade
option before tinydb
if you already have prev.2)
before rerunning my setup.py
script.
Knowledge dependencies
You can read an introduction to root finding on wikipedia. I’m not going to introduce it here except to say that it is ubiquitous and highly relevant to science, engineering, or medical computation wherever you need to detect zero crossings in images, signals, or curves.
Bisection is a method that most computational scientists have seen or even used at some point. (It is conceptually similar to the discrete version known as “binary search”.) It’s comparably slow but reliable and robust. Once at least one root is known to exist in an interval, the method will converge. On the other hand, Newtonlike methods are very fast but rather fussy about what is known of the test function’s properties ahead of time.
The secant method that we
use here is a Newtonlike iterative method, which uses a simple finite
difference in lieu of the function’s symbolic derivative (as this may
be unknown or hard to compute):
\( x_n = \left( x_{n2} f(x_{n2})  x_{n1} f(x_{n1}) \right) / \left( f(x_{n1})  f(x_{n2}) \right) \)
These two methods are sufficient to show off the principles of Fovea
for algorithm diagnostics.
Other assumptions
Our major assumption is that we have access to the algorithm’s source
code (i.e., a ‘white box’ situation) and knowledge of where
to put diagnostic elements. When developing open source algorithms,
this should not be a difficult assumption to satisfy.
Demonstration examples
I am using my fork of a Gist, which
contains deliberate errors in the secant implementation for us to
diagnose and fix. I am basing the algorithm implementations on someone
else’s original code to reinforce the idea that analysis can be set up
on existing functions without changing their API or, if so desired,
without affecting any of the function’s existing output (e.g. print
statements, warnings, or errors). Once either the code is understood
or properly debugged and tested, all the diagnostic statements can
simply be stripped out without having to alter any of the original
statements. For instance, any meta data or intermediate data that
needs to be stored or logged does not have to be temporarily returned
by the function by altering its return
statements or internal logic. The
logger takes care of capturing all that nonoutput related data for
later processing. This simplifies testing and avoids introducing more
bugs when trying to restore the function to its original state.
Now for the two test functions, whose roots we will seek.
Function 1
\[ f_1(x) = x^3 \pi/x + e/100 \]
This is a simple cubic function with three roots slightly offset
from integer values:
Function 2
\( f_2(x) = 1.13+\tanh(x2)+4 \exp(x) \sin((1/8) x^3) x+ \exp(x/35)/10 \).
From a numerical analysis standpoint, the second
function can be an unpleasant function for several reasons. Primarily,
it crosses the axis many times between x = 0 and 10. It is highly
nonmonotone, nonlinear and exhibits multiple scales. Notice that its frequency
increases cubically as its amplitude decreases. The function is
completely contrived but was chosen to resemble the
Gibbs’ phenomenon
in realistic signals and in Fourier analysis. It looks like this on
our domain of interest:
Bisection with a well behaved function
We will run the test script bisection_test1.py
in this
subsection. The script calls on a modified version of num
in num_modded.py
.
For reference, the original bisection function is given in num_original.py
(taken directly from here),
and looks like this:
def bisection(f, a, b, TOL=0.001, NMAX=100):
"""
Takes a function f, start values [a,b], tolerance value(optional) TOL and
max number of iterations(optional) NMAX and returns the root of the equation
using the bisection method.
"""
n=1
while n<=NMAX:
c = (a+b)/2.0
print("a=%s\tb=%s\tc=%s\tf(c)=%s"%(a,b,c,f(c)))
if f(c)==0 or (ba)/2.0 < TOL:
return c
else:
n = n+1
if f(c)*f(a) > 0:
a=c
else:
b=c
return False
The original code came with no description of assumptions or tests to
verify the correctness of the code. It does include a print
statement to monitor the values of a
, b
, and c
during iteration.
On quick inspection, we can realize that it has been assumed that b
> a in the tolerance check before the first return statement. For
now, we will not worry about the lack of abs()
or a check for this
ordering. Nor will we worry about the usual validation test that the
given interval [a, b] actually contains a sign change in the test
function f. It is reasonable to test these things before using the
function but at least the graphical diagnostics help us to see this
for ourselves in practice.
The bisection function is embellished with log statements for all the
major steps, replacing the print
statement with ` dm.log.msg(‘Bisect
loop’, a=a, b=b, c=c). This asks the logger make an entry for the
event named "Bisect loop" and record the current
float` values of
the variables by name.
Additionally, we invoke graphical output that shows the intermediate
data in each iteration step: the interval endpoints a
and b
(red
and green dots, resp.) and their point of bisection, c
.
def bisection(f, a, b, TOL=0.001, NMAX=100):
"""
Takes a function f, start values [a,b], tolerance value(optional) TOL and
max number of iterations(optional) NMAX and returns the root of the equation
using the bisection method.
"""
n=1
dm.log.msg('Call args', a=a, b=b, TOL=TOL, NMAX=NMAX)
plotter.addText(0.1, 0.95, 'n=%d'%n, use_axis_coords=True,
name='n_value', layer='meta_data', style='k')
while n<=NMAX:
dm.log = dm.log.bind(n=n)
plotter.setText('n_value', 'n=%d'%n, 'meta_data')
if n == 1:
rebuild = True
else:
plotter.toggleDisplay(layer='bisect_text_%d'%(n1))
plotter.toggleDisplay(layer='bisect_data_%d'%(n1))
rebuild = False
plotter.addLayer('bisect_data_%d'%n)
plotter.addLayer('bisect_text_%d'%n, kind='text')
c = (a+b)/2.0
dm.log.msg('Bisect loop', a=a, b=b, c=c)
a_pt = plot_pt(a, f, n, 'a', 'r', 'bisect', 'o')
b_pt = plot_pt(b, f, n, 'b', 'g', 'bisect', 'o')
c_pt = plot_pt(c, f, n, 'c', 'k', 'bisect', 'x')
if f(c)==0 or abs(ba)/2.0 < TOL:
dm.log.msg('Success', fval=f(c), err=(ba)/2.0)
dm.log = dm.log.unbind('n')
plotter.show(rebuild=rebuild)
return c
else:
n = n+1
if f(c)*f(a) > 0:
dm.log.msg('Same sign')
a=c
else:
dm.log.msg('Opposite sign')
b=c
dm.log.msg('Step', err=(ba)/2.0)
plotter.show(rebuild=rebuild)
dm.log.msg('Failure', status='fail', fval=f(c), err=(ba)/2.0)
dm.log = dm.log.unbind('n')
return False
The plots make use of a helper function plot_pt
that’s set up to
save code repetition:
def plot_pt(x, f, n, name, col, layer_root, marker='o'):
"""
Internal diagnostic helper function.
"""
pt = pp.Point2D(x, f(x))
plotter.addPoint(pt, style=col+marker, name=name+'_%d'%n,
layer=layer_root+'_data_%d'%n, log=dm.log)
fs = plotter.figs['Master']
ax = fs.arrange['11']['axes_obj']
xlim = ax.get_xlim()
ylim = ax.get_ylim()
x_ext = xlim[1]xlim[0]
y_ext = ylim[1]ylim[0]
# add marker text at a distance proportional to size of
# current axis extent
plotter.addText(pt[0]0.05*x_ext, pt[1]+0.02*y_ext, name,
use_axis_coords=False, name=name+'_%d'%n,
layer=layer_root+'_text_%d'%n, style=col)
return pt
dm
is the (shorthand) name for the global diagnostic_manager
object from fovea/diagnostics.py
, and plotter
is the name for the
global plotter2D
object from fovea/graphics.py
.
The dm.log
object is a structlog
object that performs structured
logging. Details of this package can be found in its
online docs.
In particular, it is used here to bind the current value of n
to the
logging statements during each iteration. This is known as making a
context. The log output lines automatically reflect this context. For
example, the log line
n=6 status='ok' layer='bisect_data_6' name='a_6' figure='Master'
event='Added plot data'
appears when code line 79 of num_modded.py
is executed:
a_pt = plot_pt(a, f, n, 'a', 'r', 'bisect', 'o')
We selected for the log to be dumped to a JSON file in a subdirectory
named after our diagnostic attempt. This directory will be created if
it doesn’t already exist. The graphics will automatically be saved
there too in PNG format.
Calls to the plotter are automatically logged by passing the log
object as an optional argument to any plotter calls that add or change
data.
plotter.addPoint(pt, style=col+marker, name=name+'_%d'%n,
layer=layer_root+'_data_%d'%n, log=dm.log)
The plotter
object represents a system built over matplotlib
. It
contains convenient structures for organizing plots of many
interrelated elements using a matplotlibbased, configurable GUI. For
instance, its core organization is based on figures, axes, layers,
and individual plotted object elements. Notice that it’s the “layers”
that are the new feature here: the rest are just the regular
matplotlib elements but grouped together in a way that allows a GUI to
be built over specially designated windows. Layers are useful for
exactly the same reason they are useful in a package like Photoshop:
they provide a powerful organizational tool for related objects or
functions that can be grouped and treated as a single entity. Thus, it
is easy to switch the visibility of whole layers, or to delete all the
objects in a particular layer, or change all the line styles in a layer.
We will show layers to be helpful for managing and quickly
navigating lots of metadata that are produced during the execution of
a long iteration. At this time, there can be no nested sublayers.
Attempt 1: Success
The various graphs output from our test script produced this GIF:
A look through the test script
# domain of interest
DOI = ([0,2.1],[2.5,2])
plotter.clean() # in case rerun in same session
plotter.addFig('Master',
title='Root Finding Diagnostic',
xlabel='x', ylabel='y',
domain=DOI)
dm.use_dir('bisect1_success')
dm.make_log('log.json')
plotter.dm = dm
# default to wait for user input on each iteration
plotter.wait_status = True
Near the beginning of the script, we declare two primary plotting
layers: ‘fn_data’ and ‘meta_data’.
plotter.addLayer('fn_data') # kind defaults to 'data'
plotter.addLayer('meta_data', kind='text')
We specify that the arrangement of the
figure will have:
plotter.arrangeFig([1,1], {'11':
{'name': 'Bisection method',
'scale': DOI,
'layers': '*', # all layers will be selected
'axes_vars': ['x', 'y']}
})
gui.buildPlotter2D((8,8), with_times=False)
# permanent axes
plotter.addHLine(0, style='k:', layer='fn_data')
plotter.addVLine(0, style='k:', layer='fn_data')
We will be making more layers for this figure within the call to the
bisect
method, but we are able to specify now that all layers
will belong to this figure using the '*'
wildcard option. In the
future, I could imagine making the syntax options more like what we
expect from text search using partial matches and both the *
and ?
wildcards to help structure the figures based on layer naming
conventions.
During each iteration, text that shows the iteration value of n
is
updated on the figure in global, absolute axis coordinates. That way,
the text position remains the same regardless of how the axes are
zoomed or panned within the plot.
Each iteration involves making two new layers for each n
, one for
the label text (‘bisect_text_n’) and one for the points
(‘bisect_data_n’). Into each one, we add the relevant data. After
that, we switch off the visibility of the previous n’s two
layers. Thus, the layer data is still searchable and accessible after
the calculation is over, and can be reviewed from interactive
commands. During this interaction over an iteration step, the
resulting figure updates from a call to plotter.show()
.
We simply invoke the diagnostics using the same call that would be
done when using the function for real: root = bisection(f1, 0.1, 2)
,
which will hopefully return the x value of a root in [0.1, 2].
At the end of each iteration, we have selected to pause and wait for
user confirmation to proceed. This allows the user to inspect the
graphical output, resize it, etc. Pressing “s” autosaves the state of
the main figure before proceeding. From the saved PNG files, I simply
ran them through an online GIF maker.
Presently, there is no way to autozoom to the same settings on each
iteration in order to ensure a smooth animation. I could have tried
harder to make the plots line up better between iterations, but I’m
sure you can get the idea.
Secant method with a well behaved function
The secant function is embellished with logging for all major steps and
graphics that show intermediate data: the previous two x
values (red and green dots),
the secant line that they imply (blue), and the new f(x2)
value (marked by
a cross) that is projected out on a red line to the curve from the
point at which the secant line crosses the axis.
Although all the diagnostic functions add a lot of code to the script,
you’ll quickly notice that it’s all but identical to what we had in
the bisection script with a few titles and filenames changed.
Attempt 1: First bug
The first time code was marked up with plot and log statements,
an error was reported:
KeyError: 'Layer name already exists in figure!'
This immediately uncovered a problem inherited from the original
numeric code that I hadn’t even noticed. n
was set to 1
(line 109)
and then not being incremented in the loop. Thus, when the logger
tried to create new log entries uniquely named according to the value
of n
, there was a name clash. This is helpful because the algorithm
might successfully converge before NMAX
is reached in testing, and a
case that converged extremely slowly or even diverge would not be
caught. On one hand, this motivates the need for unit tests to cover
all scenarios and catch these issues up front, which is the ideal way
to proceed. But our logging approach is also effective at uncovering
broken internal logic.
So, I added an increment for n
(line 154) and tried again.
Attempt 2: Second bug
This time, the graphic shows an initial state in which the interval
contains a single root and the function is monotone. This should be a
nobrainer for the algorithm to converge! However, it didn’t.
We immediately see that the secant line created from x0
and x1
is
crossing the xaxis in the first iteration but the crossing point is
not being used to create the position of x2
. We reread the defining
line
x2 = x1  f(x1)*((x1x0)/(f(x0)f(x1)))
and compare it to the mathematical definition above, and we realize
that the function calls in the denominator’s subtraction are in
reverse order.
Attempt 2: Third bug
With the (deliberately placed) subtraction order bug fixed, we try
running the test script again after changing the output directory’s
name. This time, the algorithm stops at the end of the first iteration
with this picture.
Clearly, x2
is not nearly close enough to the actual root for the
tolerance to have been met as intended. The log reports:
n=1 x2=1.5705399403789213 status='ok' x0=1.05 x1=2 event='Secant loop'
n=1 status='ok' err=0.4294600596210787 fval=1.0329268501690227 event='Success'
The error is negative and clearly not smaller in size than TOL
(which is the
default of 0.001). When we look at the code:
if x2x1 < TOL:
dm.log.msg('Success', fval=fx2, err=x2x1)
dm.log = dm.log.unbind('n')
plotter.show(rebuild=rebuild)
return x2
we realize that it should have included a call to abs()
and have read:
if abs(x2x1) < TOL:
dm.log.msg('Success', fval=fx2, err=abs(x2x1))
dm.log = dm.log.unbind('n')
plotter.show(rebuild=rebuild)
return x2
We see that we had also “accidentally” copied the error into the log
call, too.
Attempt 3: Success
We make the necessary fixes to the method’s code, change the output
directory again in the test script and rerun it. We find that it
works this time:
Bisection with a poorly behaved function
Bisection works nicely on this function, given that we bounded the
zero crossings with the initial interval. It converges after 14
iterations. Of course, we don’t get to control which of the many
zeros it converges to. Here is the output:
The test script is essentially identical to the previous ones.
Secant method with a poorly behaved function
Fortunately, we find that this method converges in this test, although
after a temporary foray into much larger x values (at n=5
) that give an
impression that it’s about to diverge. It converges after 12
iterations, which is hardly better than bisection in this case.
You can imagine adding further
diagnostics to uncover the origin of steps that go further from the
root. In this case, given the many oscillations in the interval,
there’s going to be a lot of luck involved, as some iterations might
just be unlucky to line up oscillating parts of the function at
similar amplitudes that project out far
from the root. This can be seen below in the n=4 step, where x2
is
almost at the same height as x1
, so that the next secant line will
be almost horizontal and point downwards far to the right of this
neighborhood.
We were lucky that the function continues to increase far from the
neighborhood of the roots, so that the new x2
will be much higher on
the yaxis and will lead to a secant line that projects back inside
the desired neighborhood. Certainly, given the highly oscillatory nature of
the function on the interval of interest, the secant method is
probably not a wise choice.
Viewing the logs
While the interactive python session is still open, you can access the
database of log entries. The demo scripts already fetch the DB for
you, with db = dm.log.get_DB()
. For instance, to only see the main
loop iteration log entries:
>>> db.search(where('event').matches('Bisect loop'))
[{u'a': 0.1,
u'b': 2,
u'c': 1.05,
u'event': u'Bisect loop',
u'id': 1,
u'n': 1,
u'status': u'ok'}
...
{u'a': 1.7662109375,
u'b': 1.769921875,
u'c': 1.76806640625,
u'event': u'Bisect loop',
u'id': 55,
u'n': 10,
u'status': u'ok'},
{u'a': 1.76806640625,
u'b': 1.769921875,
u'c': 1.768994140625,
u'event': u'Bisect loop',
u'id': 61,
u'n': 11,
u'status': u'ok'}]
See the
TinyDB docs for
further examples and help with making queries.
If you want to go over any of the logs after the session is closed,
you can use the diagnostics
module’s load_log
function, which
returns an OrderedDict
. You can pretty print it with the info
function, which is inherited from PyDSTool. All log entries are
numbered in order, but there are no search tools except what you make
yourself based on the dictionary format. I provided one example
function as filter_log
.
>>> d = load_log('bisect1_success/log.json')
>>> info(d)
...
64:
u'event': u'Added plot data'
u'figure': u'Master'
u'id': 64
u'layer': u'bisect_data_11'
u'n': 11
u'name': u'c_11'
u'status': u'ok'
65:
u'err': 0.00093
u'event': u'Success'
u'fval': 0.00551
u'id': 65
u'n': 11
u'status': u'ok'
Discussion
Overhead
There is some execution time overhead incurred by the diagnostic
plotting and especially the logging. However, this setup
wasn’t designed for high efficiency although I’m sure it could be
better optimized. In any case, a noticeable fraction of a second added
to each iteration is a small price to pay for learning precisely
what’s going on inside a complex algorithm.
Next steps
I would love to GUIfy* this example to give the user
control over important algorithmic parameters and help them see the
consequences. In this case, the end points of bisection and stopping
tolerances are the only parameters; for the secant method, the
starting point and stopping tolerances. With this, we could further
explore the methods’ convergence behavior by adding to and
manipulating the stopping criteria. For instance, we could add a
function value tolerance ftol
instead of just the tolerance xtol
between successive x values, which is currently what this code’s
TOL
argument represents.
* i.e. attach GUI slider widgets to userdefined params that are easy for the
user to hook up to the diagnostic GUI.
The plotter and log calls to Fovea could be set up by the algorithm’s
author for use by the enduser ahead of time, but deactivated by
default, e.g. using a verbose_level
call argument. Perhaps, if a
diagnostic_manager
object is passed, that could activate the
hooks. Or, the user can just be expected to mark up the original code
as needed, which is what we did here.
You might imagine wrapping these rootfinding algorithms inside
a loop that samples initial x values across the whole domain and
repeatedly tries these methods to find as many different roots as
possible. Different functions and different prior knowledge can
strongly affect how one might go about automatically finding “all”
roots of an arbitrary function with numeric algorithms. There will
never be a simple solution to that problem.
Not using ipython notebooks, again
I considered using ipython notebooks for this post. It’s a fine idea
in principle, given some of my
recent epiphanies,
but what does it really gain for us here? One possiblity would be using the
yaml_magic
extension to simplify how to specify the event arguments
to the logger. Alas, on further investigation, I realized that we can’t
easily access a global diagnostic_manager
object that is visible
from inside the yaml_magic
processing code for the YAML to be used
to activate things in the dm
or log
.
Secondly, function definitions (or any code block) can’t be broken across cells between
diagnosticrelated code and original function code, so I can’t use
cells to distinguish what’s original code and what’s diagnostic
markup. Given that my demo involves multiple files as well, I didn’t
see the added value in either developing or presenting this demo as a
notebook. Please feel free to enlighten me further if you disagree.
Next installment with Fovea
Next time we look at Fovea, we’ll diagnose a much more sophisticated
algorithm that I have been developing, which computes stable and
unstable manifolds in 2D (phase plane) models.
Robert
Clewley (2015). Logging and diagnostic tools for numeric python
algorithms. Transient Dynamic
blog