# Google Summer of Code 2016 project internships in Literate Modeling and Diagnostics

I am mentoring up to two projects this summer with Google Summer of Code (GSoC16), following on from the success of last year’s projects. They are described in section 5 of this page on the sponsoring organization’s web site. Thanks to the INCF for their support.

A brief summary of the projects, reproduced from the INCF page:

### Next-generation neuroscience data and model exploration tools

#### 1. Improving the graphical exploration and analysis environment, Fovea.

Fovea is a user-extensible graphical diagnostic tool for working with complex data and models. It is a library of Python classes that are built over components of PyDSTool and Matplotlib. Fovea’s present capabilities and applications to neural data and models were developed significantly in GSoC 2015 (see previous blog posts), but there are several directions remaining to develop and explore to maximize the utility and accessibility of this package to less specialized users.

A primary capability of Fovea is to assist the creation of organizing “layers” of graphic data in 2D or 3D sub-plots that are associated with specific calculations. These layers can be dynamically updated as parameters are changed or as time advances during a time series, and they can be grouped and hierarchically construced to ease navigation of complex data. Examples of layer usage is to display:

• Feature representations, such as characteristic threshold crossings of time series, clusters or principal component projections, domains of an approximation’s error bound satisfaction
• Augmenting meta-data, such as vectors showing velocity or acceleration at a specified position or time point during a simulation
• Other diagnostic quantities that can visually guide parameter fitting of a model or algorithm.

In addition, GUI buttons and dialog boxes, and command-line interfacing can provide additional interactivity with the graphical views.

Aims: Depending on student interest and skills, there are three possible directions this project could usefully take:

1. Redevelop the existing prototype “application” into an actual Django or flask web-based app running over a local server (maybe using Bokeh);

2. Extend the development of the existing core tools to work better for dynamical systems applications(especially biophysical models of neural excitability);

3. Continue to streamline the workflow for constructing new applications by improving core design and adding convenient utilities;

4. Integrate existing functionality with the literate modeling prototype discussed in Project 2 (maybe in collaboration with a student working directly on that project, if both projects run).

As part of any direction of this project, there might also be opportunity to create innovative new forms of plug-and-play UI component that will assist in visualization and diagnostics of neural (or similar) data.

Skills: Required: Python, Matplotlib, GUI design, documentation tools, working knowledge of calculus and geometry or statistics and data science

Desirable: git, general experience with either model simulation or data science techniques; agile development methodology; technical writing; Django or similar, where applicable, dynamical systems theory, where applicable.

Keywords: Python, data science, GUI, diagnostics

#### 2. “Literate modeling” capabilities for data analysis and model development

A basic workflow and loosely structured package for “literate modeling” has recently been explored. This prototype reuses several PyDSTool classes and data structures, but is intended to work mostly standalone within other computational or analytical frameworks. It is a part-graphical, part-command-line tool to explore and test model behavior and parameter variations against quantitative objective data or qualitative features while working inside a core computational framework (e.g. simulator), and could work well as an integration with the Fovea package mentioned in Project 1.

“Literate modeling” is a natural extension to “literate programming” and reproducible research practices that is intended to create a rich audit trail of the model development process itself (in addition to recording the provenance of code versions and parameter choices for simulation runs once a model is developed). Literate modeling aims to add structured, interpretive metadata to version-controlled models or analysis workflows through specially structured “active documents” (similar to the Jupyter notebook). Examples of such metadata include validatory regression tests, exploratory hypotheses (as sets of expected features and data-driven constraints), and data-driven contexts for parameter search or optimization. There is presently no other software tool that aims to provide this advanced support for hypothesis generation and testing in a computational setting!

Aims: Refine, improve, or redesign existing prototype classes, database structure, and user workflow for the core functions of literate modeling. Add other core functionality to support other workflows. Option to design a browser-based interface to this system using Django or similar technology or to collaborate on integration with Fovea (see Project 5.1). Document the tools, including creating a tutorial for the website. Test according to predefined examples that will include manipulations of the Hodgkin Huxley neural model and other simple dynamical systems.

Skills: Required: Python, git, documentation tools, automatic code generation principles, agile development methodology

Desirable: Pweave/Ptangle or similar; Django, flask, or similar; SQL; XML; general experience with either model simulation or data science techniques; technical writing; working knowledge of calculus and geometry or statistics and data science

Keywords: Python, data science, literate modeling, productivity

### Inquiries and Applications for GSoC 2016

Please contact INCF at gsoc@incf.org so they can give you access to our discussion group for INCF mentors and students. Please prepare a resume and 1-2 page statement about your specific interests in the work and why you’d make a good fit with the project. You can share that within the discussion group once you’re logged in.

# Presenting complex math models to a general audience

In 2011, I proudly developed and taught a new graduate course at Georgia State University named “Introduction to Modeling for the Life Sciences”. I took a bunch of students from six different academic departments and threw them into the deep end of studying the application of mathematical modeling in the life sciences. And I don’t mean the typical basic stuff you usually show newbies like elementary statistics, linear regression models, or remedial calculus. I wanted to discuss big questions such as:

• How can a non-mathematican tell a possibly good model from a bad one?
• How can a mathematician make sure they are building a reasonable model that will satisfy scientific domain experts?

More specific questions we dug into were:

• How can we assess the modeling methods of the Blue Brain Project?
• What kinds of models of the brain can there even be? And why are there so many?

In reference to my promotional materials to attract students to my intimidating-looking course description and reading list, we also considered more cryptic questions such as:

What is the connection between (a) the ability of an F-16 fighter jet to maneuver much more nimbly than an F-4, (b) the possibility that small genetic modifications lead to large differences in phylogenetic outcome within a couple of generations, and (c) the ability of a neural circuit controlling limb motor patterns to switch an animal rapidly between locomotive gaits?

Now, of course the goal was not for my students to really understand these technical papers in depth. But most of these trainee scientists and philosophers had future plans to be involved in science in some way, and as such they needed some preparation in the emerging ways that computing is reshaping scientific understanding. Is there a middle ground to understanding such literature without abandoning it for purely popularist science journalism about the topic?

I recently discovered a dead link on the class page for our crowning achievement: a collaboratively-written document produced by the students, with my help, that summarized their high-level understanding of a pretty complex mathematical model.

My students spent about three weeks dissecting Carson’s influential 2008 paper with Kevin Hall, which uses principles of qualitative dynamical systems in the form of analysis of fixed point stability and phase-plane diagrams. Nonetheless, the assumptions and derivation of the differential equations is, of course, not at all trivial. I invite you to take a look first – it’s open access.

The document produced by my students was used as a handout to accompany a public lecture given my friend and colleague, Carson Chow, visiting from NIH. I’d invited him to speak at GSU on “The Dynamics of Obesity”.

I thought I’d share this handout with you by embedding the pages here. Now that I’m expending more effort at the interface of technical and non-technical users and readers, this seemed highly relevant. I think it’s a good example of using a mixture of diagrams, math, and prose to explain the technical basis of the work as well provide adequate context to the assumptions and literature surrounding it.

I’ll let the value of the handout speak for itself. I think the students did a great job in expressing complex ideas without dumbing them down by organizing information into small, inter-related chunks, and making mathematically meaningful diagramatic representations.

The students graduated from the class with a lot of valuable, broad insights into the world of real-world math modeling. Not everyone has the time or interest to learn dynamical systems modeling to a high degree of technical proficiency. There are not enough articles in popular science magazines that dare to delve this deeply into (relatively) complex mathematics such as this when it has clear application in experimental science, but in the right hands I don’t think it’s so difficult to present a fair overview of the ideas.

# Spike Detection and Feature Classification with Fovea

Today’s post is the second that is guest-contributed by my Google Summer of Code student, Alex Kuefler. He has just finished his project on further developing Fovea, the tool for interactively investigating and analyzing complex data and models. The previous post described setting up Fovea and using it to investigate a PCA of multielectrode spike data.

As usual, and the latest version of the project source code, including the code for this post’s example, can be found on github.

## Introduction

In my previous post, I explored how Fovea can elucidate the process of picking the best number of dimensions to express multivariate data. As a corollary, I provided a bit of information about Principal Component Analysis (PCA) and showed off the geometry of this projection-based dimensionality reduction method. With these basics in place, we can turn to a fun application of PCA (and visual diagnostics) to a real problem in neuroscience. Namely, how do we determine which brain cells emit which spikes if we record a sample containing different signals?

## Background

A multi-electrode array is a tiny bed of pin-like sensors driven into the soft jelly of the brain. Each pin (electrode) records electrical changes near individual neurons as a numerical value that dips and spikes as the cell hyper- and de-polarizes. Given that the array can contain tens or hundreds of electrodes, multi-electrode arrays collect matrix data describing voltage at many different points.

Such direct recording of many cells is sometimes likened to lowering a series of microphones into a stadium during a sports game. Whereas low-resolution brain scans (e.g., fMRI, EEG) record the roar of the entire crowd, multi-electrode arrays pick out the voices of individual neurons. But even though we can aim our recorders at individuals, we’re still bound to pick up surrounding speakers. Neuroscientists run into a problem identifying Alice the neuron, when Bob and Carol are shouting nearby.

Given waveforms recorded from electrodes at various points in a neural population, spike sorting is the process of determining which action potentials were generated by which neurons. Going from a neural waveform to a set of sorted spikes involves three basic stages:

Spike Detection Given a continuous waveform, we need to parse which peaks and troughs can really be counted as action potentials.

Feature Extraction We discriminate between Alice and Bob’s voices based on features like pitch, timbre, and word-choice. Similarly, we must decide what qualities make one cell’s spikes different than others.

Classification After detecting all the action potentials and building a profile of each one’s features, we need to leverage this information to infer which spikes (instances) were generated by which cells (classes).

Experts have a wealth of filters, classifiers, and numerical techniques for crunching through each step of a spike sort. But I’m not an expert. I like to gain insight about scary algorithms by fiddling with their knobs and seeing what happens. Fortunately, spike sorting can be visualized vividly at every step and Fovea provides the tools to do it.

## Setting Up Fovea

The previous post walks through some commands for filling layers with data and slapping them over GUI subplots. The pca_disc.py code there described is an example of working with the vanilla diagnosticGUI (a global, singleton instance simply imported from graphics.py), but for a more complicate spike-sorting program, we’ll want a more personalized GUI.

Fovea now supports the option of initializing a new diagnosticGUIs to be subclassed by a user’s objects.

class spikesorter(graphics.diagnosticGUI):
def __init__(self, title):

#global plotter
plotter = graphics.plotter2D()
graphics.diagnosticGUI.__init__(self, plotter)

Instead of creating a ControlSys class to be juggled alongside the diagnosticGUI (as in pca_disc), spikesorter is a child of diagnosticGUI and can make use of all the plotting methods and attributes we’ve seen before.

In the initialization, we load in a sample waveform as a PyDSTool Pointset using importPointset() and convert it to a trajectory using numeric_to_traj(). The sample data included in the Fovea repository is a simulated neural waveform (described in Martinez, Pedreira, Ison, & Quiroga) and looks more or less like something an electrode would read out of a brain cell. For our purposes, all we need to know is that our waveform is composed from four different sources: There are two single neurons (Alice and Bob) in the immediate area of the recording channel, a cluster of nearby neurons which can’t be resolved as single cells, but whose spikes are still picked up by the electrode (multi-unit activity), and a bunch of extra-cellular noise to fuzzy things up. The end result is a waveform looking something like this:

Where the y-axis represents voltage, and the x-axis is time.

Looking at this plot, we want to determine which peaks came from Alice, which came from Bob, and which are Multi-Unit Activity from the electrode’s periphery, all while barring noise from further consideration.

## Spike Detection

Finding true positives means admitting false positives. Such is life. When setting a threshold (i.e., the minimum acceptable height for a peak in the waveform to be considered an action potential) we invariably exclude some interesting spikes from our analysis. Or we error in the opposite direction and admit noise that’s tall enough to ride. Spike detection is a recurring theme of the last blogpost: data analysis is replete will subjective judgments, but visual diagnostics can help inform us before we make decisions. Using Fovea’s new callbacks, we can create a threshold line to be translated up and down the waveform, admitting (or dismissing) candidate spikes as it moves along. Pressing “l” will activate line creation mode, at which point, a line_GUI object can be drawn on the axes by clicking and dragging. Once created, we can produce a perfectly horizontal threshold that extends the breadth of the data by pressing “m”.

In spikesort.py, this line won’t start detecting spikes until it’s been renamed “thresh”. Since this line_GUI is our currently selected object (having just created it), we can update its name with the following call:

ssort.selected_object.update(name = 'thresh')

At this point, pressing the “up” and “down” arrow keys will translate the threshold up and down the waveform, marking the points where it is crossed along the way.

This threshold line is best thought of as a “knob” on our analysis. We turn the knob in this or that direction and see how the tweaks propagate to our final product.

To implement special behaviors when context objects (like line_GUIs) are translated across the axes, we make use of Fovea’s new “user functions”. When we create a custom diagnosticGUI object (here called spikesorter) through subclassing, we have the option of overriding diagnosticGUI methods by defining functions of the same name. user_update_func is an example of an empty method defined in diagnosticGUI for no other reason than to be overridden. It is called every time a context object is updated. So by defining user_update_func for spikesorter, we can patch in a bit of extra behavior:

def user_update_func(self):
#We only want the thresholding behavior to occur when 'thresh' is the updated context object.
if self.selected_object.name is 'thresh':

#Ensure the threshold line is horizontal by checkking for 0 slope.
if self.selected_object.m != 0:
print("Make 'thresh' a horizontal threshold by pressing 'm'.")
return

#The numeric value of the threshold is 'thresh's y-intercept.
cutoff =  self.selected_object.b

traj_samp = self.traj.sample()['x']
r = traj_samp > cutoff
above_thresh = np.where(r == True)[0]

spike = []
spikes = []
crosses = []

#Recover list of lists of points on waveform above the threshold line.
last_i = above_thresh[0] - 1
for i in above_thresh:
if i - 1 != last_i:
crosses.append(i)
#Return x value of the highest y value.
spikes.append(spike)
spike = []
spike.append(i)
last_i = i

self.traj_samp = traj_samp
self.crosses = crosses
self.spikes = spikes

style='r*', name='crossovers', force= True)

self.show()

Once the threshold is positioned where we want it, the “d” key will detect spikes by locating each local maximum to the right of a cross-over. These maxima are the peaks of the action potentials and each is centered in its own box_GUI object created by spikesort.py.

Unlike the “l” hot key, “d” is specific to our application. By writing a key handler method and attaching it to the canvas with mpl_connect, it is easy to patch in new hot keys like this one. However, you must be careful not to reuse function names from Fovea, as they will be overridden. The following method is named ssort_key_on (as opposed to key_on, used in diagnosticGUI) so that new key presses will be added to the old library of hot keys, rather than replacing it:

def ssort_key_on(self, ev):
self._key = k = ev.key  # keep record of last keypress
fig_struct, fig = self.plotter._resolveFig(None)

if k== 'd':
#Draw bounding boxes around spikes found in user_update_func.
spikes = self.spikes
self.X = self.compute_bbox(spikes)

self.default_colors = {}

#Draw the spike profiles in the second subplot.
if len(self.X.shape) == 1:
self.default_colors['spike0'] = 'k'
style= self.default_colors['spike0']+'-', name= 'spike0', force= True)

else:
c= 0
for spike in self.X:
name = 'spike'+str(c)
self.default_colors[name] = 'k'
style= self.default_colors[name]+'-', name= name, force= True)
c += 1

self.plotter.auto_scale_domain(xcushion = 0, subplot = '12')
self.show()

Once the threshold is positioned where we want it, the “d” key will detect spikes by locating each local maximum to the right of a cross-over. These maxima are the peaks of the action potentials and each is centered in its own box_GUI object created by spikesort.py.

Each box_GUI captures 64 milliseconds of neural data by default, but this value can be changed in the spikesort GUI itself. Just press “b” to create your own box_GUI, give it the name “ref_box”, and the program will use its width as the new size for detecting spikes. This trick can be used in conjunction with the toolbar zoom to make very narrow bounding boxes to fit your detected spikes.

The boxes’ contents are shown in the second (“detected spikes”) subplot, as a succession of spike profiles laid on top of each other. This plot will give you a general sense of the action potentials’ shapes (and their diversity), but it’s not yet clear how to sort them.

## Feature Extraction

Any given object of study can be decomposed into features in an infinite number of ways. A birdwatcher confronting a strange specimen can consider anything from the animal’s tail-plumage, to its birdsong, to the motion of its flight in order to identify it. There is no single “right” feature or set of features to look at, because the bird is a whole of many parts. Nevertheless, selecting some attributes over others and describing experimental objects as vectors across these attributes can be very useful. This process, called features extraction, discretizes nature and helps us make decisions in the face of ambiguity.

Traditionally, researchers use their domain knowledge of the subject they study to select the attributes they think are most informative. For neuroscientists, such features might be the duration or the amplitude of a spike. But Alice and Bob are closely situated in the brain. Their signals are visually abstract and more similar to one another than the appearances of crows and condors. In other words, eyeballing these spike profiles and pointing to different qualities that “seem useful” for grouping them could easily lead us astray. Is there some hidden set of features that will give us a better handle on microelectrode data? If so, how do we find it?

Alarm bells should now be ringing in the heads of PCA enthusiasts. In the last post, I described the dilemma of a psychologist who wants to find the minimum, component feelings one might use to describe all emotions. Once this set is found, any complex emotion might be described as a weighted sum of these components (e.g., “nostalgia” might decompose to 0.8*sorrow + 0.7*happiness – 0.3*anger). PCA comes up with these components by finding a set of orthogonal directions through the data that capture the greatest amount of variance. Here, “orthogonal” essentially means that no component contains even a smidgen of another component (e.g., happiness is zero parts anger and zero parts sorrow) and “greatest amount of variance” means no other components can be composed to give a more accurate reconstruction of the data than those found by PCA (assuming the components are orthogonal). By using components as our attributes, PCA acts as an automatic feature extractor driven by mathematics, rather than intuitions.

We neuroscientists with our spikes are now in an analogous situation to the psychologist. The difference here being the components PCA finds for our data will be a little more abstract and a little less amenable to an English-language description. Fortunately, we don’t need English-language descriptions of the components, because we have Fovea: We can visualize what spike components look like in the third subplot.

Pressing “p” after detecting a set of spikes will cause spikesort to carry out PCA on the n x m observation matrix whose rows are the waveform segments detected with the “d” key-press, and whose columns are the current voltages of every spike at a given time-step. PCA gives us an m x 3 projection matrix, whose columns are the principal components (PC). These PCs are displayed in the third subplot in red, green, and yellow respectively. Meanwhile, the spikes projected onto the first and second PC are shown in the fourth subplot.

The important thing to notice about the PCs in the third subplot is that they look a bit like action potentials. Sure, no single one of them seems quite like it would fit in with the spikes of the “detected” plot. But it’s easy to imagine how, say, hammering out a few bumps in the red PC or flipping the green PC upside down and merging them together might give us one of the black spike profiles shown. Indeed, this is what the projection step of PCA is doing. Recall that each observed instance (a spike) is a weighted combination of the components (e.g., spike1 = 0.5*pc1 – 0.8*pc2 + 0.2*pc3). These weights stretch and flip the PCs in the manner above described before merging them through summation. In this sense, a PC might be described as a “proto-spike”: The wavy red, green and yellow lines don’t correspond to anything we’ve really recorded the brain doing, but they can be combined in different ways to generate any of the real spikes we did observe.

The fourth subplot, “Projected Spikes”, can be interpreted as a graph of the weights used to scale the PCs. In the below image, we use Fovea’s new picker tools to select a data-point in the fourth subplot, which highlights its corresponding detected spike as well. The chosen point is at approximated -230 on the x-axis (the red PC) and +40 on the y-axis (the green PC).

In other words, this spike contains a LOT of first PC (except flipped upside-down, as the weight is negative), and a fair amount, but substantially less of the second PC. This finding accords with intuition, as the big peak of the red PC looks like an inversion of the spike’s V-shaped refractory period (see my overlaid box A, which is not part of the Fovea output), and they both share a trough or hill at the end (see: B). Similarly, the flat onset and high peak of the spike resemble the first third of the green PC (see: C).

You may also notice that only the first and second PCs are fully drawn in, whereas the third is dotted out. This indicates the first and second PCs are being used as the axes of the fourth subplot. By clicking on the dotted PC, the user can change the projection axes of the “Projected Spikes” subplot. Below we see the same selected spike, but shown on new axes:

## Classification

Two differences should stand out when projecting onto the first/second PCs rather than the second/third PCs. First, the data projected onto the first/second PCs should have a greater tendency to cluster.

Second, when we introspect these clusters they should tend to be more cohesive than groupings that show up when projecting to the second/third PCs. In other words, if we look at only those detected spike profiles that go along with a given cluster, they should all look pretty similar to one another. We use a combination of key-presses and context objects to facilitate this process. Pressing “b”, we create box_GUIs in the fourth subplot to fit around potential clusters of data point. When one of these bounding boxes is selected (in bold), we can then press either “1”, 2”, or “3” to color the points in that box red, green or blue, respectively (“0” returns them to black). The detected spike profiles will receive the new colors as well. Below, we see a few clusters grouped in this way:

Here, four distinct clusters have been picked out and they’re all fairly easy to describe. The red spikes look like “classic” action potentials, characterized by a flat onset, high peak, and slow return to baseline. The green spikes start out with a sharp V-shape and have a distinctive, yet lower peak. The blue spikes are characterized by single, smooth swells. And finally, the black spikes always manage to cross the mean of the data, and are perhaps just noise or multi-unit activity.

On the other hand, projecting data onto the third/second PCs is less clean. Not only do the clusters seem less distinctive, but the spike profiles they group together don’t appear to have very much in common:

My placement of rectangular bounding boxes using Fovea’s “b” command in this projection was more arbitrary, as these clusters weren’t as clearly separated. Although some of the colorations accord with what we’d expect, in this projection the difference between the high-peaked “classic” spikes and the multi-unit noise we picked out from the previous example doesn’t come across (both are painted in black and belong to the large, blurry cluster in the middle). However, this projection isn’t entirely without merit. Consider the blue and green spikes. In the previous example, these were all bunched together into the same group. But it’s clear from this projection that although both sets have a distinctive V shape, for one group they precede the peak, and for the other, they follow it.

Although the literature on classification algorithms is extensive, the fact that we can turn the different “knobs” provided by Fovea and come up with unexpected results suggests the benefits of visualization for exploratory analysis.

## Conclusion

In addition to exploring a real-world use case, this post lays out lots of new Fovea tools resulting from our work during Google Summer of Code 2015. We’ve seen how subclassing diagnosticGUI can produce more complicated programs and let us tailor Fovea’s behavior with user function overrides. We also took a look at built-in key presses, picking selected objects with the mouse-clicks, and some applications for the line_GUI and box_GUI context objects.

Realistic simulation of extracellular recordings

Martinez, Pedreira, Ison, & Quiroga’s simulated data

Helpful paper on spike sorting by Michael S. Lewicki

# PCA demo with Fovea

Today’s post is guest-contributed by my Google Summer of Code student, Alex Kuefler. He has been busy developing Fovea, the new tool for interactively investigating and analyzing complex data and models, and has a demo in the context of data dimensionality reduction to show off some new features and possibilities. It also acts as a first tutorial in how you can set up your own Fovea-based project, about which there will be more. We have recently improved support for Python 3, and the latest version can be found on github

## Introduction

If you’ve worked with data, you’ve probably heard of principal component analysis. PCA is one of the most widely used techniques for pre-processing high dimensional data, but for the mathematically illiterate, it can be something of a black box. This tutorial won’t show you the algebra going on inside (for that, I highly recommend this tutorial by Jon Shlens), but it will build up some geometric intuitions about what PCA is doing and demonstrate how Fovea can be used to gain such insights.

Let’s start off in everyone’s three favorite dimensions (x, y, and z) before working our way up. Consider the following dataset:

Although I’ve taken the trouble to draw this data-disc in the volume of a cube, I really didn’t need to. All the action is happening across the xy-plane with the z-axis contributing nothing to our understanding – I may as well have just appended some arbitrary coordinate to every point making up an otherwise interesting 2D object. In fact, this is exactly what I did.

Using PyDSTool’s synthetic_data tool and numpy’s concatenate function:

pts = sd.generate_ball(120, 2, 10)
pts = np.concatenate((pts,np.zeros((120, 1))), axis=1)

The end result is a list of points looking something like this:

[[ 5.94061792  3.76256284  0.        ]
[-0.86465872 -3.35053331  0.        ]
[ 3.4455166   2.50427016  0.        ]
[ 7.75243164 -5.58465333  0.        ]
[ 2.11125897  3.93196272  0.        ]
[-5.87175514  5.98995138  0.        ]
[-1.12449966  2.32071201  0.        ]
[ 2.0811796  -0.78464252  0.        ]
[ 5.65265684  0.11505331  0.        ]
[ 3.43255688  3.83145878  0.        ]]

One easily spots the slacking z-coordinate. In a perfect world, dimensionality reduction would consist of nothing more than tossing out the zeroes. But when the data have been rotated, or noised up, the problem becomes more apparent. By generating some random angles and composing the rotate_x(), rotate_y(), and rotate_z() functions (in addition to translate()) in pca_disc, like so:

trans_am = 10
trans_ax = 1

X = [[],[],[]]
for i in range(3):
X[i] = rotate_z(rotate_y(rotate_x(translate(pts, trans_ax, trans_am),
random.uniform(0,2*np.pi)),random.uniform(0,2*np.pi)),random.uniform(0, 2*np.pi))

We end up with a few sets of points that still sit on 2D discs, but appear to make use of three dimensions.

[[  8.2701649    6.54407486   9.43685818]
[ 10.88101268   8.97529237   8.31551898]
[ 10.87652499  10.02230807  -3.48266113]
[  4.70195135   4.70376686  -5.67127247]
[ 14.28294606  12.28493836   5.26297749]
[  7.14308586   6.18479583   2.17252043]
[ 11.8558326    9.95023447   7.14259082]
[ 10.0219924    8.51815475   4.83660148]
[ 12.01924573  11.11940641  -4.34387331]
[ 13.73158322  12.05803314   2.28344063]]

Although these data have tricked the x, y, and z axes, visualization makes it apparent that we’re expressing our data in more dimensions than we need.

PCA’s job is to throw out the x, y and z axes and come up with a new set of axes that line up better with the data. In other words, PCA produces an ordered set of orthogonal vectors (called the principal components or PC’s), such that the first PC points in the direction of greatest variance in the data, the second in the direction of second greatest variance (constrained to be orthogonal to the first), and so forth. It is perfectly valid to come up with a new set that has the same number of vectors as the old one, thus capturing ALL the data’s variance. But the real fun happens when we ask ourselves if we can get away with having fewer axes (dimensions) than we started with.

In pca_disc, it is apparent that the two measly a and b axes can capture the same amount of “interesting stuff” as all three, x, y, and z axes. To convince ourselves, we’ll need some way to rapidly explore how our data look embedded in high-dimensional space and how they look after projection onto axes a and b. Together, Fovea layers and matplotlib’s subplots provide the needed utility.

## Setting Up Fovea

Our goal is to set up some nice images showing the “before and after” of a dimensionality reduction as subplots in a figure window. But first we need to decide what kinds of information we want populating our subplots. For the purposes of this tutorial, the layers will correspond to different rotations of our disc and contain four different, but associated, groups of data (before-PCA data, after-PCA data, variance captured by PCA, and the PC’s themselves). Out in the wild, the user may want to reserve each layer for different clusters of a single dataset or different datasets entirely (the possibilities are endless). We can set aside some names and appropriate colors for our layers in a couple lists:

rot_layers = ['rot1', 'rot2', 'rot3']
rot_styles = ['r', 'g', 'b']

And the layers can be added to the plotter like so:

for rot in rot_layers:

I’ve also included a fourth layer, orig_data, to house the original data-disc prior to any rotations. But before we can start populating rot1, rot2, and rot3 with data, we need a figure window and some axes over which our layers will be displayed. pca_disc provides a convenient function, setupDisplay(), which accepts a couple lists of strings (i.e., rot_layers and rot_styles) and arranges a figure with three subplots:

plotter.arrangeFig([1,3], {
'11':
{'name': 'BEFORE',
'scale': [(-10,10),(-10,10)],
'layers': rot_layers+['orig_data'],
'axes_vars': ['x', 'y', 'z'],
'projection':'3d'},
'12':
{'name': 'AFTER',
'scale': [(-20,20),(-20,20)],
'layers': rot_layers,
'axes_vars': ['a', 'b']},
'13':
{'name': 'Variance by Components',
'scale': [(0.5,10),(0,1)],
'layers': rot_layers,
'axes_vars': ['x', 'y']},
})

gui.buildPlotter2D((14,6), with_times=False)

The first subplot, (entitled “BEFORE”) will be home to the original 3D disc dataset, some different rotations of the same disc, and a view of the new axes that PCA discovers for each rotation. Note that we set the ‘projection’ parameter for this subplot to ‘3d’ and supply three axes_vars instead of two. This feature is new to Fovea, and allows for provisional plotting of 3D data (with associated perks, like being able to rotate the axes by clicking and dragging). The second plot (“AFTER”) will display a 2D view of how each rotated disc looks after having been projected onto their corresponding PC’s (shown in “BEFORE”). Bear in mind that, for the purposes of this user story, two PC’s will always be used to project the “AFTER” data, even if only one PC was more appropriate (say, by stretching the disc far across the x axis). When we move to more dimensions, we’ll also continue using only 2 axes for the “AFTER” plot, but you could also make this plot 3d as well, if PCA discovers 3 strong components.

As we move to more complicated data sets, it will also be important to assess how much the data are spread along each axis (the variance). Knowing the variance captured by each PC will later let us decide if that dimension is worth keeping around. So we will also set aside a third subplot, ‘Variance by Component’ where this information will eventually be stored. Notice also that every subplot will contain a bit of data from every layer, so the ‘layers’ parameter for each one is simply the list rot_layers (with orig_data appended on for the first subplot).

Looping through our list of layers, we can now perform PCA on each rotated dataset using pca_disc’s compute() function. compute() makes use of doPCA() – an aptly named function for doing PCA – included in PyDSTool’s data analysis toolbox (imported as da). You may also wish to create a pcaNode object directly with the python MDP package (which da imports). MDP includes a method for retrieving the data’s projection matrix, whose columns are the principal components we’re after. Another handy method defined for pcaNodes is simply named d(), which returns the list of eigenvalues corresponding to the principal components (which we can use to account for how much variance each new dimension explains).

# Create a pcaNode object.
p = da.doPCA(X, len(X[0]), len(X[0]))

# Columns of the projection matrix are the PCs.
pcMat = p.get_projmatrix()

# Store column vectors to be convenient for plotting as line endpoints.
pcPts = np.concatenate(([pcMat.transpose()[0,:]*15],
[-pcMat.transpose()[0,:]*15]), axis=0)

for j in range(1, new_dim):
pcPts = np.concatenate((pcPts, [pcMat.transpose()[j,:]*15],
[-pcMat.transpose()[j,:]*15]), axis=0)

# Get data projected onto the 'new_dim' number of PCs.

Y = p._execute(X, new_dim)

compute() then adds the data (both low and high dimensional), the PCs, and a plot of variances to their respective layers, and makes use of the optional subplot parameter to ensure each group of data ends up on the axes assigned to it.

# Create plot of high-dimensional data.
plotter.addData([X[:,0], X[:,1], X[:,2]], layer=layer, style=style+'.', subplot='11')

# Add PC lines to the same subplot as the high-dimensional data.
for j in range(0, len(pcPts), 2):
layer= layer, style= style, subplot= '11')
layer= layer, style= style, subplot= '11')

# Create plot of low-dimensional data.
plotter.addData([Y[:,0], Y[:,1]], layer=layer, style=style+'.', subplot= '12')

# Create line plot for fraction of variance explained by each component.
plotter.addData([range(1, len(p.d)+1), p.d/sum(p.d)], layer=layer, style=style+"-o", subplot= '13')

At the end of compute(), we loop through all the layers we’ve added and turn off their visibility as a single group (another new feature provided by Fovea), so we have a clean slate when we start interacting with our GUI.

for layer in rot_layers:
plotter.setLayer(layer, figure='Master', display=False)

## Exploratory Analysis

At this point, we can explore our data by clicking and rolling the 3D axes and changing the visibility of the various layers using setLayer(). But to make things more user-friendly, it is better to set up some callbacks. For now, this function will respond contextually by cycling through the different rotations when the arrow keys are pressed, hiding the original data when h is pressed, and displaying all rotations in response to m. It can be easily connected to our GUI’s instance of masterWin with mpl_connect:

c = 0
def keypress(event):
global c

if event.key == 'right':
c += 1
if event.key == 'left':
c -= 1

if event.key == 'left' or event.key == 'right':
for layer in rot_layers:
plotter.setLayer(layer, figure='Master', display= False)

plotter.toggleDisplay(layer=rot_layers[c%3], figure='Master')

if event.key == 'm':
for layer in rot_layers:
plotter.toggleDisplay(layer=layer, figure='Master')

if event.key == 'h':
plotter.toggleDisplay(layer='orig_data', figure='Master')

plotter.show(rebuild=False)

gui.masterWin.canvas.mpl_connect('key_press_event', keypress)

After a bit of playing around with our visualization, a few things should become apparent.

First, each rotation’s PC’s line up beautifully with the data, as one would expect from this surrogate dataset. If you rotate the data-disc to be edge-on, you’ll see that data-points and PC’s all fall in a straight line. If we want to fuzzy-up the disc with the noise() function provided in pca_disc, we may lose our straight line, but the PC’s will still point across the breadth of the data, as we would hope.

Second, the subplot “AFTER” projection looks quite a bit like the “BEFORE” image in the 3D axes. We may not be able to swivel it around, but if you turn the high-dimensional data so that it’s flat face is facing toward you, you will have effectively recreated the “AFTER” plot in the “BEFORE” axes (you may need to use the zoom widget to correct the scales).

The takeaway message here is that PCA is ultimately just a linear projection (or flattening) of data in one space into some other space. Nothing is intrinsically different – it is merely squished. However, when we start dealing with more dimensions (and we’re forced to view 2D/3D projections of bigger data) the myriad projections we choose to get an impression of the data may all look quite different. In such cases, the ability to compare before and after images (perhaps plotting different projections of the data rather than rotations) can be instrumental.

And third, the “variance by components” plot is very boring. Because we’re plotting an almost perfect disc, the “direction of greatest variance” we use as the first PC has only marginally more variance than the second component. We can spice it up by using pca_disc’s stretch function to skew the data and watch as the graph becomes less even. But as always, the real intrigue happens when we add more dimensions.

## High-Dimensional Data

For most real-world datasets, dimensionality reduction involves sacrifice. Every principal component captures a bit of variance and somewhere along the line, the user must make a judgment call about how much of that variance it is okay to throw out if it means getting to weasel into a lower dimension. The subjectivity of this process is why visual diagnostics tools can play such an important role in dimensionality reduction. Good analyses rely on the ability of informaticians to make informed decisions and Fovea is a great informer. By interacting with data, testing different candidate dimensionalities, and observing the associated “variance payoffs”, users can make better choices for a dimensionality reduction.

Consider how our workflow changes as we move into higher dimensions. Imagine you are a psychologist acutely interested in how frequently members of 3 different populations (children, adolescents, and adults) experience 6 different emotions (joy, sorrow, jealousy, malice, fear, and pride) throughout the day. For convenience, we’ll pretend also that each population’s data can be modeled perfectly as a normally-distributed hypersphere in 6D emotion-space. Now instead of having three rotations of a disc shown in three different colors, our dataset will consist of three distinct hyperspheres (created with synthetic_data()) representing the emotional diversity of children (red), adolescents (green) and adults (blue):

# Create and stretch different hypersphere "clusters":

X1 = translate(stretch(stretch(sd.generate_ball(133, dim, 10), 0, 1.4), 1, 1.4), 0, 25)
X2 = translate(sd.generate_ball(110, dim, 10), 1, 20)
X3 = translate(noise(sd.generate_ball(95, dim, 10), 2, 0.6, 0, 2), 2, 15)

X = [X1, X2, X3]

clus_layers = ['clus1', 'clus2', 'clus3']
clus_styles = ['r', 'g', 'b']

Notice that I’ve also tinkered with the number of datapoints making up each hypersphere (the sample size of that population) and applied pca_disc’s stretch(), translate() and noise() function to give each cluster its own character, as we would expect from three different clusters of a real dataset.

Let’s say we suspect our choice of 6 emotions was somewhat arbitrary. Do we really need to give “joy” and “pride” their own axes, or should reports of those feelings all be lumped into a single variable (happiness)? For that matter, don’t “jealousy” and “malice” belong to the same underlying emotion (contempt) as well? PCA is designed to settle questions like these.

But a problem arises when the best set of axes PCA does identify outnumbers the 3 dimensions visible to us mortals. We may have narrowed our 6 emotions down to a more manageable list of 4 (happiness, contempt, fear, sorrow), but that’s little solace if we wish to plot our new dataset in 2D or 3D. However, we may still get a loose impression of how 4D or 6D might look in its natural state by projecting it onto 2 or 3 orthonormal vectors. Methods like Linear Discriminant Analysis can be used to come up with “projection vectors” that yield particularly informative projections of high-dimensional data. But purely for the sake of demonstration, we’ll content ourselves with any 2 or 3 orthonormal projection vectors that transform our 4D (“AFTER”) and 6D (“BEFORE”) into 2D and 3D point-clouds that can be drawn in our existing subplots. Two arbitrary orthonormal projection vectors can be found by applying the QR factorization on a random matrix as in pca_disc’s function, ortho_proj_mat():

def ortho_proj_mat(n, m):
Z = numpy.random.rand(n, m)
Q, R = numpy.linalg.qr(Z)
return Q

However, when dealing with 6D data, there is more leeway to select how many axes we want our new space to have. For this particular example, the third subplot reveals a sharp gap between the variance explained by the first 2 PCs and the remaining 3 for the X1 (red) dataset.

If this were real psychological data, we might conclude that we were too hasty to settle on four new dimensions, when what we really needed was just a “positive emotion” axis and a “negative emotion” axis. Then again, even if two vectors explain a big portion, we would still be throwing out a lot of variance if we ignore the other 3 completely. In other words, the process of picking the best dimensionality is non-deterministic, context-dependent, and leaves room for trial and error.

Wouldn’t it then be nice if we could manually select the number of PCs we want to project our data onto and explore how our choice affects the “AFTER” image? By including “up” and “down” arrow clicks in our keypress() function we can do just that. Each time we press the “up” key, another PC is added to the set of axes onto which the “BEFORE” data are projected (thus changing the appearance of the “AFTER” plot). Each time we press “down”, a PC is removed. And if the current number of selected PC’s exceeds two, the code swoops in with our arbitrary, orthonormal, projection vectors to display the data in the second subplot.

However, as we flit around dimensions, generating different “AFTER” data, and projecting what we find onto our random QR-vectors, we end up with a lot of different variables to keep track of. For instance, even if our projection vectors for displaying 4D data are chosen arbitrary, we still want them to be the same vectors if we hop over to 5D-land for a bit, then come back. Otherwise, viewing the different plots produced by toggling through different dimensions is about as informative as randomly generating the data each time.

Fortunately, Fovea’s object-oriented design makes it easy to add new classes that store and reuse attributes during interactive plotting sessions. Reworking a bit of code from Benjamin Root’s Interactive Applications Using Matplotlib, we can come up with a control system class, ControlSys, whose fields include:

• c (a counter for keeping track of which hypersphere is on display)
• proj_vecsLO (Two orthonormal vectors for projecting post-PCA data)
• proj_vecsHI (Three orthonormal vectors for projecting pre-PCA data)

This class also includes a modified version of our function, keypress(), which now cycles between layers/clusters (left and right), re-computes PCA in different dimensions (up and down), and changes visibilities (m and h).

def keypress(self, event):

if event.key == 'left' or event.key == 'right':
if event.key == 'left':
self.c += 1
else:
self.c -= 1

for layer in self.clus_layers:
plotter.setLayer(layer, display= False)

plotter.toggleDisplay(layer=self.clus_layers[self.c%len(self.clus_layers)]) #figure='Master',

if event.key == 'm':
for layer in self.clus_layers:
plotter.toggleDisplay(layer, figure='Master')

if event.key == 'h':
plotter.toggleDisplay(layer='orig_data', figure='Master')

if event.key == 'up' or (event.key == 'down' and self.d is not 2):
if event.key == 'up':
self.d += 1
elif event.key == 'down':
self.d -= 1

print("Attempting to display", self.d,"-dimensional data...")

for i in range(len(self.clus_layers)):
self.data_dict = compute(self.data[i], self.d, self.clus_layers[i],
self.clus_styles[i], self.proj_vecsLO, self.proj_vecsHI)

self.highlight_eigens()
gui.current_domain_handler.assign_criterion_func(self.get_projection_distance)

plotter.show(rebuild=False)

Note also the print statement and highlight_eigens() function, called when the up or down keys are pressed. Each cues the user to the effects their changes of dimensionality has. In the case of the print statement, this cue is text. In the case of highlight_eigens(), vertical bars are drawn through each component in the “Variance by Components” currently being viewed. Another more subtle indication of the current dimensionality is the number of colored axes visible in the “BEFORE” plot, which will always be the same as the number of vertical bars.

The end result is a system that lets the user move “horizontally” through different clusters of data and “vertically” through candidate dimensionalities for the reduced data. We can view the variance subplot or printed reports of variance captured to find our low-dimensional sweet-spot, all while referring to the “BEFORE” and “AFTER” images as sanity checks.

As much as I hope you enjoyed this meditation on discs and hyperspheres, it is unlikely that you installed Fovea to process reams of conveniently-gaussian, made-up data. For that matter, it is unlikely you’ll be satisfied exploring whatever data you do have with PCA alone. So I’ll close out with a few comments about the overall design of the PCA visualization module and some simple ways you might repurpose it to meet your needs.

This user example is divided into two files: pca_disc and pca_tester. pca_tester is the end where most of the user’s interaction with the code is meant to occur. It includes the disc and hyphersphere functions already discussed, but both should serve as a template for how to load real data into the pca_disc’s control system and library of dimensionality reduction tools. For instance, instead of generating data with generate_ball() and the rotation functions, X could be created from a delimited .txt file with numpy.loadtxt(). Say you’ve used a classifier to group some set of feature-vectors into different clusters. Just as we displayed three disc rotations in three different colors, so too could you explore clouds of “Iris Setosa”, “Iris Vesicolour” and “Iris Virginica” instances in red, green and blue.

The three-subplot configuration is also natural to other machine learning paradigms besides PCA. All dimensionality reductions involve three things: a source (high-dimensional) data set, a reduced (lower-dimensional) data set, and some criterion that needs to be optimized in order to generate the reduced data from the source. In PCA, this criterion is variance. But there’s no reason why you might not reserve the third subplot for Kullback-Leibler divergence while using t-SNE, or mean geodesic distances between points in Isomap. Swapping out PCA for another algorithm will take a bit of digging in the “backend” functions of pca_disc, but only these few lines of code in compute() assume that you’re interested in using PCA at all:

p = da.doPCA(X, len(X[0]), len(X[0])) #Creates a pcaNode object.

pcMat = p.get_projmatrix()

plotter.addData([range(1, len(p.d)+1), p.d/sum(p.d)], layer=layer, style=style+"-o", subplot= '13')

Y = p._execute(X, new_dim)

Finally, maybe your data aren’t already clustered, so there’s nothing to toggle between with the left or right arrow keys. Or maybe you already have your heart set on a specific dimensionality, so paging up and down is no good to you. Fortunately, the keypress() function in pca_disc’s ControlSys can be easily modified with a conditional block to patch in whatever key commands do interest you. For example, if you’re curious about how your high-dimensional data might look from many, many different angles, you might modify keypress() to rotate your “BEFORE” projection vectors with presses of the left and right arrow keys and “AFTER” projection vectors by pressing up and down. Instead of having a tool for probing different candidate dimensionalities during a dimensionality reduction, you would then have a GUI in the vein of DataHigh, a MATLAB package which lets users explore high-dimensional data with continuous, 2D projections once the reduction is over.

Myriad other methods could be added to ControlSys to assist in analyses of the core objects captured in the class fields. For instance, I’m now working on integrating some of the tools from domain2D into the PCA user example. A function for growing a bubble around “AFTER” points that needed to be projected an especially long distance during PCA might look a little like this:

def get_projection_distance(self, pt_array):
"""
Domain criterion function for determining how far lower dimensional points
were projected
"""

nearest_pt = []
index = []

#Find closest projected point.
dist = 10000000
for i, row in enumerate(self.data_dict['Y_projected']):
if numpy.linalg.norm(pt_array - row) < dist:
dist = numpy.linalg.norm(pt_array - row)
nearest_pt = row
index = i

#Calculate distance between projected point and loD point.
orig_pt = self.data_dict['Y'][index]
proj_pt = numpy.append(nearest_pt, np.zeros(len(orig_pt)-len(nearest_pt)))
if numpy.linalg.norm(orig_pt - proj_pt) > self.proj_thresh:
return -1
else:
return 1

Even if you don’t care to manipulate your GUIs with fancy callbacks and widgets, the ControlSys class is a great place to define scripting methods that let you interact with your applications’ important fields in real-time.

## Conclusion

Hopefully this walkthrough has given you a more visual understanding of PCA, but the moral of the story is that this framework is general-purpose and applicable to many data science tasks. Layers can be used to group related data and meta-data and make this information available in a number of formats. Setting up GUIs and callbacks is made easy, but advanced users retain the freedom to interact with their data through scripts and command line inputs. Fovea is an interactive, graphical toolkit, not simply a GUI. It gives would-be dimensionality reducers a view inside whatever black boxes interest them.

Remarkable tutorial on PCA

DataHigh MATLAB package

Interactive Applications Using Matplotlib

# Logging and diagnostic tools for numeric python algorithms

In today’s post we will give an overview of an early prototype of a new diagnostic tool for numeric algorithms, known as Fovea. I began creating this tool a couple of years ago with one of my graduate students, and I am continuing it now as part of my push for having effective tools for my research. In fact, I’m fortunate to have a Google Summer of Code student working with me on this project right now.

Fovea has a lot more to it, involving interactive graphical widgets for many analytical functions, including controlling parameters or position along trajectories while enjoying automatically-refreshing sub-plots that reflect changes. Today, we are just looking at its structured logging and basic plotting. This is not a full tutorial but a demonstration of capability and a teaser for the thorough documentation that will be forthcoming during the GSoC project.

As software developers and experimental scientists know, the first thing you have to do at the beginning of a new project is to spend a lot of time building (or buying) tools and instruments before you can execute the project itself. With the kinds of exploratory computational science project that I undertake, I always find myself short of good diagnostics to guide my (or my students’) progress towards a goal.

Because, as always, I will be introducing several new ideas here, we will keep today’s examples elementary. Next time we look at these tools, we’ll ramp up the complexity and apply them to a saddle manifold computation that I’ve been working on, so you can see a more realistic application that can really benefit from the tools. Today’s post is pedagogical: I’ll present how to pick apart the workings of common root-finding algorithms, and graphically track and log their progress during their iterative steps. With the large overhead in preparation and execution, I wouldn’t expect anyone to want to set this up for a mere root finding algorithm unless they were using it as an educational tool for teaching numerical analysis.

Here’s an easy-to-setup animation of some of the demo’s raw output to motivate you to read further…

We have enough new ideas to cover here that, for the time being, I won’t be presenting this work in the form of my literate modeling workflow.

## The need for better diagnostics with complex algorithms

Let’s say you write an algorithm, or you are asked to apply an existing algorithm that has a lot of tricky parameters to set.

It is typical to try running it and hope that it “just works” right out of the box, but that rarely happens, right? Common complaints will then be: “Why is it not working properly?” or “How can I make it run faster?” Until you understand what’s happening under the hood, you often don’t stand a chance of optimizing the use of the algorithm.

If you even get this far beyond blind trial and error, the most common way to track progress is with print statements. And that’s already an order of magnitude improvement. But for more sophisticated algorithms or hairy problems, you can still get lost in reams of text output. In some cases, especially those related to dynamical systems modeling, graphing some of the data generated by the algorithm during its operation can be more helpful. Again, that probably yields another order of magnitude improvement on understanding.

### The worst offender is the hidden assumptions

When you’re working on really hard problems involving strong nonlinearities, multiple scales, poor conditioning, noisy data, etc., the algorithms you are interested in understanding can end up nested inside other algorithms. A simple example is a predictor-corrector method, used for many things but commonly found in numerical continuation algorithms. Correction is often done using Newton’s Method, but Newton’s Method is a slippery thing to handle. If you aren’t confident about certain properties of your starting situation, that method can rapidly diverge.

This example is already non-trivial to understand if you are new to high-level applied math, but it’s just the beginning! As complex projects involve more nested computations, the accumulating assumptions that are necessary to ensure each computation will be successful (or efficient) also compose together in a complex way. It can become unfeasibly difficult to pre-validate or post-verify the satisfaction of all the relevant assumptions, moreso than any other challenge in executing the project.

This challenge is mainly due to the usually implicit nature of the assumptions. For the sake of efficiency, many implementations do not validate all their inputs and instead expect the caller (either the end user or another algorithm) to know what they are doing. That passing of the buck is fair but is extremely problematic once algorithms become nested and if the user’s experience or expert knowledge of the algorithms is limited. Moreover, in many cases, it can be either theoretically impossible or, at least, practically difficult for algorithms to validate all the assumptions relevant to their effectiveness even if that could be done quickly enough to not degrade performance.

### Towards test-driven analysis

We could obviously add unit tests for the algorithm’s code implementation, but if we don’t understand much about the possible emergent properties of nested algorithms then it’s hard to write thorough and appropriate tests a priori. Even with some good tests written, if they fail at runtime because of a logical or semantic error then we might need to diagnose and explore the behavior of the algorithm on a specific problem or class of problems before we can write appropriate test. Fovea can play a role in this. The literate modeling principles could also lead to a clearer understanding of the convoluted assumptions when dealing with nested algorithms.

## Pre-requisites for this demo

### Code dependencies

Download Fovea from github and install with the usual python setup.py install. This link is to a version tagged at the time of writing, to keep with the functionality exactly as presented here. You can, of course, also get the latest version and see what has progressed in the meantime.

ADDENDUM 1: For Python 3.x users, I had to subsequently update the github repo, so checkout the latest one and install pyeuclid from my fork that updates for Python 3 compatibility.

ADDENDUM 2: You may need to install pyeuclid from my github fork (covers some Python 3 compatibility not present in PyPI) and the geos dependency for the python shapely package. Binaries are available by independently installing shapely from here or you can try to do it yourself using the appropriate version of OSGeo4W.

At present, the package depends on shapely, descartes, pyyaml, euclid, PyDSTool (latest github version at time of writing or the forthcoming v0.90.1 for those of you in the future), structlog, and tinydb, although only the last three are actually needed for this particular presentation. In turn, PyDSTool requires numpy, scipy, and matplotlib, which are essential for Fovea.

At the time of writing, one package dependency (TinyDB) is not correctly automatically installing from this setup process on some platforms, and you may need to first run pip install tinydb (or by inserting the --upgrade option before tinydb if you already have pre-v.2) before re-running my setup.py script.

### Knowledge dependencies

You can read an introduction to root finding on wikipedia. I’m not going to introduce it here except to say that it is ubiquitous and highly relevant to science, engineering, or medical computation wherever you need to detect zero crossings in images, signals, or curves.

Bisection is a method that most computational scientists have seen or even used at some point. (It is conceptually similar to the discrete version known as “binary search”.) It’s comparably slow but reliable and robust. Once at least one root is known to exist in an interval, the method will converge. On the other hand, Newton-like methods are very fast but rather fussy about what is known of the test function’s properties ahead of time.

The secant method that we use here is a Newton-like iterative method, which uses a simple finite difference in lieu of the function’s symbolic derivative (as this may be unknown or hard to compute):

$$x_n = \left( x_{n-2} f(x_{n-2}) - x_{n-1} f(x_{n-1}) \right) / \left( f(x_{n-1}) - f(x_{n-2}) \right)$$

These two methods are sufficient to show off the principles of Fovea for algorithm diagnostics.

### Other assumptions

Our major assumption is that we have access to the algorithm’s source code (i.e., a ‘white box’ situation) and knowledge of where to put diagnostic elements. When developing open source algorithms, this should not be a difficult assumption to satisfy.

## Demonstration examples

I am using my fork of a Gist, which contains deliberate errors in the secant implementation for us to diagnose and fix. I am basing the algorithm implementations on someone else’s original code to reinforce the idea that analysis can be set up on existing functions without changing their API or, if so desired, without affecting any of the function’s existing output (e.g. print statements, warnings, or errors). Once either the code is understood or properly debugged and tested, all the diagnostic statements can simply be stripped out without having to alter any of the original statements. For instance, any meta data or intermediate data that needs to be stored or logged does not have to be temporarily returned by the function by altering its return statements or internal logic. The logger takes care of capturing all that non-output related data for later processing. This simplifies testing and avoids introducing more bugs when trying to restore the function to its original state.

Now for the two test functions, whose roots we will seek.

### Function 1

$f_1(x) = x^3- \pi/x + e/100$

This is a simple cubic function with three roots slightly offset from integer values:

### Function 2

$$f_2(x) = -1.13+\tanh(x-2)+4 \exp(-x) \sin((1/8) x^3) x+ \exp(x/35)/10$$.

From a numerical analysis standpoint, the second function can be an unpleasant function for several reasons. Primarily, it crosses the axis many times between x = 0 and 10. It is highly non-monotone, nonlinear and exhibits multiple scales. Notice that its frequency increases cubically as its amplitude decreases. The function is completely contrived but was chosen to resemble the Gibbs’ phenomenon in realistic signals and in Fourier analysis. It looks like this on our domain of interest:

### Bisection with a well behaved function

We will run the test script bisection_test1.py in this sub-section. The script calls on a modified version of num in num_modded.py.

For reference, the original bisection function is given in num_original.py (taken directly from here), and looks like this:

def bisection(f, a, b, TOL=0.001, NMAX=100):
"""
Takes a function f, start values [a,b], tolerance value(optional) TOL and
max number of iterations(optional) NMAX and returns the root of the equation
using the bisection method.
"""
n=1
while n<=NMAX:
c = (a+b)/2.0
print("a=%s\tb=%s\tc=%s\tf(c)=%s"%(a,b,c,f(c)))
if f(c)==0 or (b-a)/2.0 < TOL:
return c
else:
n = n+1
if f(c)*f(a) > 0:
a=c
else:
b=c
return False

The original code came with no description of assumptions or tests to verify the correctness of the code. It does include a print statement to monitor the values of a, b, and c during iteration.

On quick inspection, we can realize that it has been assumed that b > a in the tolerance check before the first return statement. For now, we will not worry about the lack of abs() or a check for this ordering. Nor will we worry about the usual validation test that the given interval [a, b] actually contains a sign change in the test function f. It is reasonable to test these things before using the function but at least the graphical diagnostics help us to see this for ourselves in practice.

The bisection function is embellished with log statements for all the major steps, replacing the print statement with  dm.log.msg(‘Bisect loop’, a=a, b=b, c=c). This asks the logger make an entry for the event named "Bisect loop" and record the current float values of the variables by name.

Additionally, we invoke graphical output that shows the intermediate data in each iteration step: the interval endpoints a and b (red and green dots, resp.) and their point of bisection, c.

def bisection(f, a, b, TOL=0.001, NMAX=100):
"""
Takes a function f, start values [a,b], tolerance value(optional) TOL and
max number of iterations(optional) NMAX and returns the root of the equation
using the bisection method.
"""
n=1
dm.log.msg('Call args', a=a, b=b, TOL=TOL, NMAX=NMAX)
name='n_value', layer='meta_data', style='k')
while n<=NMAX:
dm.log = dm.log.bind(n=n)
plotter.setText('n_value', 'n=%d'%n, 'meta_data')
if n == 1:
rebuild = True
else:
plotter.toggleDisplay(layer='bisect_text_%d'%(n-1))
plotter.toggleDisplay(layer='bisect_data_%d'%(n-1))
rebuild = False
c = (a+b)/2.0
dm.log.msg('Bisect loop', a=a, b=b, c=c)
a_pt = plot_pt(a, f, n, 'a', 'r', 'bisect', 'o')
b_pt = plot_pt(b, f, n, 'b', 'g', 'bisect', 'o')
c_pt = plot_pt(c, f, n, 'c', 'k', 'bisect', 'x')

if f(c)==0 or abs(b-a)/2.0 < TOL:
dm.log.msg('Success', fval=f(c), err=(b-a)/2.0)
dm.log = dm.log.unbind('n')
plotter.show(rebuild=rebuild)
return c
else:
n = n+1
if f(c)*f(a) > 0:
dm.log.msg('Same sign')
a=c
else:
dm.log.msg('Opposite sign')
b=c
dm.log.msg('Step', err=(b-a)/2.0)
plotter.show(rebuild=rebuild)
dm.log.msg('Failure', status='fail', fval=f(c), err=(b-a)/2.0)
dm.log = dm.log.unbind('n')
return False

The plots make use of a helper function plot_pt that’s set up to save code repetition:

def plot_pt(x, f, n, name, col, layer_root, marker='o'):
"""
Internal diagnostic helper function.
"""
pt = pp.Point2D(x, f(x))
layer=layer_root+'_data_%d'%n, log=dm.log)
fs = plotter.figs['Master']
ax = fs.arrange['11']['axes_obj']
xlim = ax.get_xlim()
ylim = ax.get_ylim()
x_ext = xlim[1]-xlim[0]
y_ext = ylim[1]-ylim[0]
# add marker text at a distance proportional to size of
# current axis extent
use_axis_coords=False, name=name+'_%d'%n,
layer=layer_root+'_text_%d'%n, style=col)
return pt

dm is the (shorthand) name for the global diagnostic_manager object from fovea/diagnostics.py, and plotter is the name for the global plotter2D object from fovea/graphics.py.

The dm.log object is a structlog object that performs structured logging. Details of this package can be found in its online docs. In particular, it is used here to bind the current value of n to the logging statements during each iteration. This is known as making a context. The log output lines automatically reflect this context. For example, the log line

n=6 status='ok' layer='bisect_data_6' name='a_6' figure='Master'

appears when code line 79 of num_modded.py is executed:

a_pt = plot_pt(a, f, n, 'a', 'r', 'bisect', 'o')

We selected for the log to be dumped to a JSON file in a sub-directory named after our diagnostic attempt. This directory will be created if it doesn’t already exist. The graphics will automatically be saved there too in PNG format.

Calls to the plotter are automatically logged by passing the log object as an optional argument to any plotter calls that add or change data.

layer=layer_root+'_data_%d'%n, log=dm.log)

The plotter object represents a system built over matplotlib. It contains convenient structures for organizing plots of many inter-related elements using a matplotlib-based, configurable GUI. For instance, its core organization is based on figures, axes, layers, and individual plotted object elements. Notice that it’s the “layers” that are the new feature here: the rest are just the regular matplotlib elements but grouped together in a way that allows a GUI to be built over specially designated windows. Layers are useful for exactly the same reason they are useful in a package like Photoshop: they provide a powerful organizational tool for related objects or functions that can be grouped and treated as a single entity. Thus, it is easy to switch the visibility of whole layers, or to delete all the objects in a particular layer, or change all the line styles in a layer.

We will show layers to be helpful for managing and quickly navigating lots of metadata that are produced during the execution of a long iteration. At this time, there can be no nested sub-layers.

#### Attempt 1: Success

The various graphs output from our test script produced this GIF:

#### A look through the test script

# domain of interest
DOI = ([0,2.1],[-2.5,2])

plotter.clean() # in case rerun in same session
title='Root Finding Diagnostic',
xlabel='x', ylabel='y',
domain=DOI)

dm.use_dir('bisect1_success')
dm.make_log('log.json')
plotter.dm = dm

# default to wait for user input on each iteration
plotter.wait_status = True

Near the beginning of the script, we declare two primary plotting layers: ‘fn_data’ and ‘meta_data’.

plotter.addLayer('fn_data')      # kind defaults to 'data'

We specify that the arrangement of the figure will have:

plotter.arrangeFig([1,1], {'11':
{'name': 'Bisection method',
'scale': DOI,
'layers': '*',  # all layers will be selected
'axes_vars': ['x', 'y']}
})

gui.buildPlotter2D((8,8), with_times=False)

# permanent axes

We will be making more layers for this figure within the call to the bisect method, but we are able to specify now that all layers will belong to this figure using the '*' wildcard option. In the future, I could imagine making the syntax options more like what we expect from text search using partial matches and both the * and ? wildcards to help structure the figures based on layer naming conventions.

During each iteration, text that shows the iteration value of n is updated on the figure in global, absolute axis coordinates. That way, the text position remains the same regardless of how the axes are zoomed or panned within the plot.

Each iteration involves making two new layers for each n, one for the label text (‘bisect_text_n’) and one for the points (‘bisect_data_n’). Into each one, we add the relevant data. After that, we switch off the visibility of the previous n’s two layers. Thus, the layer data is still searchable and accessible after the calculation is over, and can be reviewed from interactive commands. During this interaction over an iteration step, the resulting figure updates from a call to plotter.show().

We simply invoke the diagnostics using the same call that would be done when using the function for real: root = bisection(f1, 0.1, 2), which will hopefully return the x value of a root in [0.1, 2].

At the end of each iteration, we have selected to pause and wait for user confirmation to proceed. This allows the user to inspect the graphical output, resize it, etc. Pressing “s” autosaves the state of the main figure before proceeding. From the saved PNG files, I simply ran them through an online GIF maker.

Presently, there is no way to auto-zoom to the same settings on each iteration in order to ensure a smooth animation. I could have tried harder to make the plots line up better between iterations, but I’m sure you can get the idea.

### Secant method with a well behaved function

The secant function is embellished with logging for all major steps and graphics that show intermediate data: the previous two x values (red and green dots), the secant line that they imply (blue), and the new f(x2) value (marked by a cross) that is projected out on a red line to the curve from the point at which the secant line crosses the axis.

Although all the diagnostic functions add a lot of code to the script, you’ll quickly notice that it’s all but identical to what we had in the bisection script with a few titles and filenames changed.

#### Attempt 1: First bug

The first time code was marked up with plot and log statements, an error was reported:

KeyError: 'Layer name already exists in figure!'

This immediately uncovered a problem inherited from the original numeric code that I hadn’t even noticed. n was set to 1 (line 109) and then not being incremented in the loop. Thus, when the logger tried to create new log entries uniquely named according to the value of n, there was a name clash. This is helpful because the algorithm might successfully converge before NMAX is reached in testing, and a case that converged extremely slowly or even diverge would not be caught. On one hand, this motivates the need for unit tests to cover all scenarios and catch these issues up front, which is the ideal way to proceed. But our logging approach is also effective at uncovering broken internal logic.

So, I added an increment for n (line 154) and tried again.

#### Attempt 2: Second bug

This time, the graphic shows an initial state in which the interval contains a single root and the function is monotone. This should be a no-brainer for the algorithm to converge! However, it didn’t.

We immediately see that the secant line created from x0 and x1 is crossing the x-axis in the first iteration but the crossing point is not being used to create the position of x2. We re-read the defining line

x2 = x1 - f(x1)*((x1-x0)/(f(x0)-f(x1)))

and compare it to the mathematical definition above, and we realize that the function calls in the denominator’s subtraction are in reverse order.

#### Attempt 2: Third bug

With the (deliberately placed) subtraction order bug fixed, we try running the test script again after changing the output directory’s name. This time, the algorithm stops at the end of the first iteration with this picture.

Clearly, x2 is not nearly close enough to the actual root for the tolerance to have been met as intended. The log reports:

n=1 x2=1.5705399403789213 status='ok' x0=1.05 x1=2 event='Secant loop'
n=1 status='ok' err=-0.4294600596210787 fval=-1.0329268501690227 event='Success'

The error is negative and clearly not smaller in size than TOL (which is the default of 0.001). When we look at the code:

if x2-x1 < TOL:
dm.log.msg('Success', fval=fx2, err=x2-x1)
dm.log = dm.log.unbind('n')
plotter.show(rebuild=rebuild)
return x2

we realize that it should have included a call to abs() and have read:

if abs(x2-x1) < TOL:
dm.log.msg('Success', fval=fx2, err=abs(x2-x1))
dm.log = dm.log.unbind('n')
plotter.show(rebuild=rebuild)
return x2

We see that we had also “accidentally” copied the error into the log call, too.

#### Attempt 3: Success

We make the necessary fixes to the method’s code, change the output directory again in the test script and re-run it. We find that it works this time:

### Bisection with a poorly behaved function

Bisection works nicely on this function, given that we bounded the zero crossings with the initial interval. It converges after 14 iterations. Of course, we don’t get to control which of the many zeros it converges to. Here is the output:

The test script is essentially identical to the previous ones.

### Secant method with a poorly behaved function

Fortunately, we find that this method converges in this test, although after a temporary foray into much larger x values (at n=5) that give an impression that it’s about to diverge. It converges after 12 iterations, which is hardly better than bisection in this case.

You can imagine adding further diagnostics to uncover the origin of steps that go further from the root. In this case, given the many oscillations in the interval, there’s going to be a lot of luck involved, as some iterations might just be unlucky to line up oscillating parts of the function at similar amplitudes that project out far from the root. This can be seen below in the n=4 step, where x2 is almost at the same height as x1, so that the next secant line will be almost horizontal and point downwards far to the right of this neighborhood.

We were lucky that the function continues to increase far from the neighborhood of the roots, so that the new x2 will be much higher on the y-axis and will lead to a secant line that projects back inside the desired neighborhood. Certainly, given the highly oscillatory nature of the function on the interval of interest, the secant method is probably not a wise choice.

### Viewing the logs

While the interactive python session is still open, you can access the database of log entries. The demo scripts already fetch the DB for you, with db = dm.log.get_DB(). For instance, to only see the main loop iteration log entries:

>>> db.search(where('event').matches('Bisect loop'))
[{u'a': 0.1,
u'b': 2,
u'c': 1.05,
u'event': u'Bisect loop',
u'id': 1,
u'n': 1,
u'status': u'ok'}
...
{u'a': 1.7662109375,
u'b': 1.769921875,
u'c': 1.76806640625,
u'event': u'Bisect loop',
u'id': 55,
u'n': 10,
u'status': u'ok'},
{u'a': 1.76806640625,
u'b': 1.769921875,
u'c': 1.768994140625,
u'event': u'Bisect loop',
u'id': 61,
u'n': 11,
u'status': u'ok'}]

See the TinyDB docs for further examples and help with making queries.

If you want to go over any of the logs after the session is closed, you can use the diagnostics module’s load_log function, which returns an OrderedDict. You can pretty print it with the info function, which is inherited from PyDSTool. All log entries are numbered in order, but there are no search tools except what you make yourself based on the dictionary format. I provided one example function as filter_log.

>>> info(d)

...

64:
u'figure': u'Master'
u'id': 64
u'layer': u'bisect_data_11'
u'n': 11
u'name': u'c_11'
u'status': u'ok'
65:
u'err': 0.00093
u'event': u'Success'
u'fval': 0.00551
u'id': 65
u'n': 11
u'status': u'ok'

## Discussion

There is some execution time overhead incurred by the diagnostic plotting and especially the logging. However, this setup wasn’t designed for high efficiency although I’m sure it could be better optimized. In any case, a noticeable fraction of a second added to each iteration is a small price to pay for learning precisely what’s going on inside a complex algorithm.

### Next steps

I would love to GUI-fy* this example to give the user control over important algorithmic parameters and help them see the consequences. In this case, the end points of bisection and stopping tolerances are the only parameters; for the secant method, the starting point and stopping tolerances. With this, we could further explore the methods’ convergence behavior by adding to and manipulating the stopping criteria. For instance, we could add a function value tolerance ftol instead of just the tolerance xtol between successive x values, which is currently what this code’s TOL argument represents.

* i.e. attach GUI slider widgets to user-defined params that are easy for the user to hook up to the diagnostic GUI.

The plotter and log calls to Fovea could be set up by the algorithm’s author for use by the end-user ahead of time, but deactivated by default, e.g. using a verbose_level call argument. Perhaps, if a diagnostic_manager object is passed, that could activate the hooks. Or, the user can just be expected to mark up the original code as needed, which is what we did here.

You might imagine wrapping these root-finding algorithms inside a loop that samples initial x values across the whole domain and repeatedly tries these methods to find as many different roots as possible. Different functions and different prior knowledge can strongly affect how one might go about automatically finding “all” roots of an arbitrary function with numeric algorithms. There will never be a simple solution to that problem.

### Not using ipython notebooks, again

I considered using ipython notebooks for this post. It’s a fine idea in principle, given some of my recent epiphanies, but what does it really gain for us here? One possiblity would be using the yaml_magic extension to simplify how to specify the event arguments to the logger. Alas, on further investigation, I realized that we can’t easily access a global diagnostic_manager object that is visible from inside the yaml_magic processing code for the YAML to be used to activate things in the dm or log.

Secondly, function definitions (or any code block) can’t be broken across cells between diagnostic-related code and original function code, so I can’t use cells to distinguish what’s original code and what’s diagnostic markup. Given that my demo involves multiple files as well, I didn’t see the added value in either developing or presenting this demo as a notebook. Please feel free to enlighten me further if you disagree.

## Next installment with Fovea

Next time we look at Fovea, we’ll diagnose a much more sophisticated algorithm that I have been developing, which computes stable and unstable manifolds in 2D (phase plane) models.

Robert Clewley (2015). Logging and diagnostic tools for numeric python algorithms. Transient Dynamic blog