I’ve been working with some genomics data for a while now.  The standard approach here is to do some sort of clustering, since you’ve got 20-some-odd-thousand genes to look at, and you really don’t want to analyze them all by hand, and a lot of them are repeats or have nearly identical functions anyway.  So yes, clustering is a good thing.  The question is: “How many clusters?” and there’s answers for that; the gap statistic (see Tibshirani, Walther, and Hastie, “Estimating the number of clusters in a data set via the gap statistic,” JRSS: Series B, vol. 63, no. 2, 2001.) is my current favorite.  The problem is that all of the methods I’ve seen basically compare the within-cluster variance for some number of clusters n in your data to the same measure for some random data with similar shape to yours, usually via some SVD type mechanism.  This means that you end up generating and clustering a whole boatload of random data to essentially generate a null distribution for the sum of your within-cluster variance.

With 20k+ genes, and evaluating the gap statistic across perhaps 5 to 50 clusters, this can get incredibly slow.  Depending on how lucky you get with the initialization of the k-means search, I’ve had a single application of the function take nearly 20 minutes in MATLAB.  For one realization.  When I want (loose cannon that I am, only) hundreds.  Even farming this out to the cluster (almost leading to another ‘oops-I-broke-it’ moment when I realized that I was getting ready to send all 210,000 data points out to each of the 64 jobs I’d set up), it still took me nearly a week to run the whole thing.

So while I was sitting, constantly refreshing the queuing statistics on the cluster every few minutes, I decided to poke around in Pylab to see if it has anything I could use.  I figured that worst case I could write my own function and see how it did.

As a test case, I tried clustering all 21,000 data traces into 32 clusters (for fun: calculate the expected number of inner products you have to take to do this) in a shootout.  Average execution time in MATLAB was about seven minutes (433 seconds, if you’re into details), and was exponential to the limits of the eyeball test given n=25; mostly of them converged pretty quick, a few took for-goddamn-ever.

Pylab’s average time was 75 +/- about 4 seconds.  Over three times faster, and way more predictable.

Just to have an excuse to show off PyLab’s plotting functions (through matplotlib, I think), here’s another test of the k-means speed.  I didn’t feel like waiting around for half the day for it to finish, so it’s 1000 trials of clustering a 1000 by 10 array into 20 clusters.  Plotted below is a histogram of execution time for just the k-means step (using tic/toc and time() for MATLAB and Python, respectively).  Same story, just less dramatic; Python is faster with a tighter distribution, MATLAB slower with a higher variance.

I’ve yet to find a core function in MATLAB that I use regularly that I can’t find somewhere in Pylab/NumPy/SciPy, and honestly, if it weren’t for all the thousands of lines of my own code that I already have invested in MATLAB, I probably would probably be out the door about now.  Alas, the perils of vendor lock-in.

I just finished doing some consulting with a graduate student in Linguistics, helping him sort out some phonetic structures in a dead Alaskan language, and began to run into a fairly persistent, fairly annoying problem: the data lived in one monolithic .CSV file; everything was coded fairly nicely, but at the end of the day what I needed to do was a bunch of identical operations on various subsets of the data (fit an identical model to all the different classes of syllables, for instance).

Now you can do this in R, and it’s actually reasonably painless as compared to other programming languages, but it involves a bunch of for-loops and subsets and before you know it you’ve got a list of arrays of models on your hands that you’ve got to write another for-loop to pull all of the bits out of, and your code looks like a mess.

The solution, which I wish I had stumbled across a month ago, is the “plyr” package courtesy of the inimitable Hadley Wickham. It contains a whole bunch of very nice, very tidy wrapper functions that basically chains together the results of “subset”, “apply”, and a concatenation operation to recast it back into an array, list, or data frame, all while keeping the variable names straight and using a very intuitive syntax.

Just as an off the cuff example, say you need to take an array, break it up by rows, apply a function to each row of the matrix, and then stick everything into a data frame. This is pretty straightforward with a ‘for’ loop, but even the cleanest method involves building up your eventual data frame piecewise from the chunks of array, and then finally casting the whole thing at the end. And god help you if you’ve lost a comma somewhere. Plyr can knock this out in a single function: adply. The first letter, ‘a’, tells you that it takes an array, the ‘d’ tells you that it outputs a data frame, and the ply is just so you remember who you owe all your newfound free time to.

It’s also got functions to arrange a data frame’s columns painlessly — which is how I ran across it — what amounts to the missing ‘map’ function in R (charmingly named ‘splat’), and about 30 other useful data-munging functions. If you’re dealing with badly formatted data that you want to do some heavy processing on without losing your mind, you owe it to yourself to check it out.

RKHSs are one of those things that a lot of people seem to use without really understanding what they actually are. Plenty of packages exist for (e.g.) fitting smoothing splines or doing support vector machine learning that sort of gloss over the details of constructing the space and making it do what you want.

This is kind of a shame, because not only are they interesting in their own right (really!), understanding them is also kind of necessary to a) using them correctly, and b) extending them when you hit a problem that your favorite software package isn’t built to handle.

So this is my own attempt to give some background and explanation of them, both motivating why they’re useful and explaining how they do the neat things that they do.

To be brief, reproducing kernel Hilbert spaces have the following “nice” properties (actually they’ve got lots of nice properties, but these in particular interest me):

  1. They form a convenient space of smooth functions (for some definition of smooth that I’ll get to in a bit); this makes them nice for nonparametric estimation, for instance, where we generally want to fit some smooth function to noisy data.
  2. The inner product of two elements in that space can be computed via a (generally pretty simple) kernel function regardless of the dimension of the space.  This means that even for an infinite-dimensional space, we can evaluate inner products quickly and efficiently.  This leads to its use in classification, such as support vector machines.
  3. They have very close ties to stochastic processes: it turns out that any weakly stationary stochastic process “lives” in a RKHS created by the covariance function of the process.

This is the point that tradition dictates I begin talking about reproducing kernels, evaluation functionals, and representers of evaluation, and then go on to build up RKHS’s from scratch.  I’m actually going to try to really gloss over this, and try to come back to the details only when needed.

The short, short version is that every RKHS \mathcal{H} has a set of functions defined on it so that, for any function f\in\mathcal{H} evaluated at some point x\in\mathcal{X}, we have some other function g_x\in\mathcal{H} that will let us find the value f(x) by an inner product operation. In other words, f(x)=\left<g_x, f\right>_\mathcal{H}; in a fit of originality, the function g_x is called the “representer of evaluation at x“. It’s worth noting at this point that some of the inner products involved can be quite a bit different from the dot-product-ish formulation that we often sort of jump to, and some of them are frankly just weird. Generally, when we’re dealing with finite-dimensional spaces things are pretty easy, and when we go to infinite-dimensional spaces they go right off the rails.

The analogy with this in \mathcal{L}^2 is the Dirac delta function: f(x) = \left<\delta_x,f\right> using the “usual” inner product. It’s pretty easy to see that this can’t be a RKHS (\delta_x isn’t in \mathcal{L}^2), but the evaluation works similarly.

Once we have this, we can also see that we must have some function K_y(x) = K(x,y) = \left<g_x,g_y\right> where we can actually use these representers to evaluate themselves, and since K_y(x) actually gives us the value of g_y at the point x, we can see that f(x) = \left<K_y, f\right> and therefore K(x,y) = \left<K_y, K_x\right>. Basically, you can pull the kernel out of its own hat; hence the name “reproducing kernel”.

Take a minute and convince yourself of this. If you’re anything like me this will seem confusing for about ten minutes, then suddenly quite obvious.

So now here’s the two important parts, at least with respect to smoothness: first, if two functions in the RKHS converge in norm, then they converge pointwise (start with the pointwise difference of two functions in \mathcal{H}, express it in terms of the kernel, apply Cauchy-Schwarz to show that the pointwise difference is bounded above by a constant times the norm of the difference), and second (using a similar trick) we can show that any function in \mathcal{H} is Lipschitz continuous with constant \left\|f\right\|_\mathcal{H} and distance defined by the RK.

It turns out — through the Moore-Aronszajn theorem (Aronszajn, 1950) — that any RKHS has a reproducing kernel that is positive definite; just as importantly, the converse holds: if you have a positive definite function, then it has associated with it a unique RKHS. This means we have nice eigenfunction decompositions of our RKs, which is a Very Good Thing. In particular, it means that we can define some space \mathcal{H}_0 that is made of linear combinations of our eigenfunctions, that this space \mathcal{H}_0 is dense in \mathcal{H}, and that \mathcal{H} forms the closure of the space \mathcal{H}_0.

And the upshot of all of that is that if we have some finite number of points and we want to fit a function f\in\mathcal{H} to this data that minimizes:
 L(f(x_1), \ldots, f(x_n)) + \lambda(\left\|f\right\|^2_\mathcal{H})
where we think of L is some loss function measuring how well the data fits, and \lambda as some penalization function on e.g. smoothness, we have (Kimeldorf and Wahba, 1970) a unique solution of the form
 f = \sum_i a_i K(x_i, \cdot).

This, believe it or not, is very, very cool. Basically, with any positive definite kernel, we can find some smooth (or at least Lipschitz continuous) function in the space that we are interested in that minimizes some loss and some penalty based on an inner product. Furthermore, we can go a step further and define our space \mathcal{H} so that it only contains smooth functions, such as ones with two continuous derivatives. This RKHS may be of infinite dimension (like, say, a space of smooth functions), but we can still find a function that fits our data points in the manner we’ve specified in some dimension n subspace of \mathcal{H}.

So in practical terms, what we usually do is split \mathcal{H} into two distinct subspaces \mathcal{H}_0 (where our ‘good’ solutions live) and \mathcal{H}_1 (where the ‘roughness’ lives); this is very easy to do for RKs, see Aronszajin (1950). Then we let \lambda_0(\cdot)=0 and \lambda_1(\cdot) into a nondecreasing function; this lets us penalize some features (higher derivatives) while leaving others alone.

So that’s all highly entertaining, but what about some examples?

A pretty simple one is basic Brownian motion; recall that, for a BM starting at 0, we have that Cov(X_s, X_t) = K(X_s, X_t) = \min(s,t). Since the weak derivative of \min(t, \cdot) = I_{(0,t)}(\cdot), we can write our inner product as \left<K(t, \cdot), f\right> = \int f^\prime(x)I_{(0,t)}(x)dx = \int_0^t f^\prime(x)dx, which gets us the reproducing property. This corresponds to the space \mathcal{H} of absolutely continuous functions f with f(0)=0 and first (at least weak) derivatives in \mathcal{L}^2.

This means that our white noise process can be represented by some linear combination of the form:
X_t = \sum_i a_i\min(t, i)

Let’s pretend, for instance, that we’ve got three observations: X_0 = 0, X_1 = -0.41, X_2 = 0.07, and we want a squared error loss function. Without penalizing and assuming that the observations are noiseless, we see that we want coefficients a_0=0, a_1, a_2 so that we have
X_0 = \sum_{i=1}^2 a_i\min(t, i) = 0\\ X_1 = \sum_{i=1}^2 a_i\min(t, i) = -0.41\\ X_2 = \sum_{i=1}^2 a_i\min(t, i) = 0.07\\

This is a pretty simple linear system, giving us the solution
X_t = 0 \min(t,0) - 0.89 \min(t, 1) + 0.48 \min(t,2)
which is exactly the conditional expected value of X_t|X_0,X_1,X_2 for all times t\geq 0: it linearly interpolates between our points, and then has flat expected value of 0.07 for all future times. We add more data, and we get more accurate interpolation. Through sufficient handwaving and chanting the mantra “up to sets of measure zero”, we can see that the claim that this will regenerate our BM in the large-sample limit is at least reasonable, if not exactly proven.

So that’s interesting, if not exactly groundbreaking. But notice that we didn’t even need to evaluate that inner product, we just needed the kernel. And while that was a pretty toy case, in general (see Parzen) we can find optimal an BLUE for pretty much any second-order stationary time series by cranking out the above steps. It gets a little more complicated when you add noise, but not by much, thanks to another useful properties of RKHSs: their additivity.

Next time, we’ll take a look at the case where we’ve got simple IID Gaussian noise on top of the signal, and show how we can still recover pretty good estimates for the mean signal.

References
Aronszajn, N. “Theory of reproducing kernels.” Transactions of the American Mathematical Society 68 (1950).

Kimeldorf, George S., and Grace Wahba. “Spline Functions and Stochastic Processes.” Sankhyā: The Indian Journal of Statistics, Series A 32, no. 2 (June 1, 1970): 173-180.