RKHSs are one of those things that a lot of people seem to use without really understanding what they actually are. Plenty of packages exist for (e.g.) fitting smoothing splines or doing support vector machine learning that sort of gloss over the details of constructing the space and making it do what you want.
This is kind of a shame, because not only are they interesting in their own right (really!), understanding them is also kind of necessary to a) using them correctly, and b) extending them when you hit a problem that your favorite software package isn’t built to handle.
So this is my own attempt to give some background and explanation of them, both motivating why they’re useful and explaining how they do the neat things that they do.
To be brief, reproducing kernel Hilbert spaces have the following “nice” properties (actually they’ve got lots of nice properties, but these in particular interest me):
- They form a convenient space of smooth functions (for some definition of smooth that I’ll get to in a bit); this makes them nice for nonparametric estimation, for instance, where we generally want to fit some smooth function to noisy data.
- The inner product of two elements in that space can be computed via a (generally pretty simple) kernel function regardless of the dimension of the space. This means that even for an infinite-dimensional space, we can evaluate inner products quickly and efficiently. This leads to its use in classification, such as support vector machines.
- They have very close ties to stochastic processes: it turns out that any weakly stationary stochastic process “lives” in a RKHS created by the covariance function of the process.
This is the point that tradition dictates I begin talking about reproducing kernels, evaluation functionals, and representers of evaluation, and then go on to build up RKHS’s from scratch. I’m actually going to try to really gloss over this, and try to come back to the details only when needed.
The short, short version is that every RKHS
has a set of functions defined on it so that, for any function
evaluated at some point
, we have some other function
that will let us find the value
by an inner product operation. In other words,
; in a fit of originality, the function
is called the “representer of evaluation at
“. It’s worth noting at this point that some of the inner products involved can be quite a bit different from the dot-product-ish formulation that we often sort of jump to, and some of them are frankly just weird. Generally, when we’re dealing with finite-dimensional spaces things are pretty easy, and when we go to infinite-dimensional spaces they go right off the rails.
The analogy with this in
is the Dirac delta function:
using the “usual” inner product. It’s pretty easy to see that this can’t be a RKHS (
isn’t in
), but the evaluation works similarly.
Once we have this, we can also see that we must have some function
where we can actually use these representers to evaluate themselves, and since
actually gives us the value of
at the point
, we can see that
and therefore
. Basically, you can pull the kernel out of its own hat; hence the name “reproducing kernel”.
Take a minute and convince yourself of this. If you’re anything like me this will seem confusing for about ten minutes, then suddenly quite obvious.
So now here’s the two important parts, at least with respect to smoothness: first, if two functions in the RKHS converge in norm, then they converge pointwise (start with the pointwise difference of two functions in
, express it in terms of the kernel, apply Cauchy-Schwarz to show that the pointwise difference is bounded above by a constant times the norm of the difference), and second (using a similar trick) we can show that any function in
is Lipschitz continuous with constant
and distance defined by the RK.
It turns out — through the Moore-Aronszajn theorem (Aronszajn, 1950) — that any RKHS has a reproducing kernel that is positive definite; just as importantly, the converse holds: if you have a positive definite function, then it has associated with it a unique RKHS. This means we have nice eigenfunction decompositions of our RKs, which is a Very Good Thing. In particular, it means that we can define some space
that is made of linear combinations of our eigenfunctions, that this space
is dense in
, and that
forms the closure of the space
.
And the upshot of all of that is that if we have some finite number of points and we want to fit a function
to this data that minimizes:

where we think of
is some loss function measuring how well the data fits, and
as some penalization function on e.g. smoothness, we have (Kimeldorf and Wahba, 1970) a unique solution of the form

This, believe it or not, is very, very cool. Basically, with any positive definite kernel, we can find some smooth (or at least Lipschitz continuous) function in the space that we are interested in that minimizes some loss and some penalty based on an inner product. Furthermore, we can go a step further and define our space
so that it only contains smooth functions, such as ones with two continuous derivatives. This RKHS may be of infinite dimension (like, say, a space of smooth functions), but we can still find a function that fits our data points in the manner we’ve specified in some dimension
subspace of
.
So in practical terms, what we usually do is split
into two distinct subspaces
(where our ‘good’ solutions live) and
(where the ‘roughness’ lives); this is very easy to do for RKs, see Aronszajin (1950). Then we let
and
into a nondecreasing function; this lets us penalize some features (higher derivatives) while leaving others alone.
So that’s all highly entertaining, but what about some examples?
A pretty simple one is basic Brownian motion; recall that, for a BM starting at 0, we have that
. Since the weak derivative of
, we can write our inner product as
, which gets us the reproducing property. This corresponds to the space
of absolutely continuous functions
with
and first (at least weak) derivatives in
.
This means that our white noise process can be represented by some linear combination of the form:

Let’s pretend, for instance, that we’ve got three observations:
, and we want a squared error loss function. Without penalizing and assuming that the observations are noiseless, we see that we want coefficients
so that we have

This is a pretty simple linear system, giving us the solution

which is exactly the conditional expected value of
for all times
: it linearly interpolates between our points, and then has flat expected value of 0.07 for all future times. We add more data, and we get more accurate interpolation. Through sufficient handwaving and chanting the mantra “up to sets of measure zero”, we can see that the claim that this will regenerate our BM in the large-sample limit is at least reasonable, if not exactly proven.
So that’s interesting, if not exactly groundbreaking. But notice that we didn’t even need to evaluate that inner product, we just needed the kernel. And while that was a pretty toy case, in general (see Parzen) we can find optimal an BLUE for pretty much any second-order stationary time series by cranking out the above steps. It gets a little more complicated when you add noise, but not by much, thanks to another useful properties of RKHSs: their additivity.
Next time, we’ll take a look at the case where we’ve got simple IID Gaussian noise on top of the signal, and show how we can still recover pretty good estimates for the mean signal.
References
Aronszajn, N. “Theory of reproducing kernels.” Transactions of the American Mathematical Society 68 (1950).
Kimeldorf, George S., and Grace Wahba. “Spline Functions and Stochastic Processes.” Sankhyā: The Indian Journal of Statistics, Series A 32, no. 2 (June 1, 1970): 173-180.