Pattern Recognition and Machine Learning

350 7. SPARSE KERNEL MACHINES

t 1

t 2

t C

t 1

t 2

t C

φ

Figure 7.10 Illustration of the mechanism for sparsity in a Bayesian linear regression model, showing a training
set vector of target values given byt=(t 1 ,t 2 )T, indicated by the cross, for a model with one basis vector
φ=(φ(x 1 ),φ(x 2 ))T, which is poorly aligned with the target data vectort. On the left we see a model having
only isotropic noise, so thatC=β−^1 I, corresponding toα=∞, withβset to its most probable value. On
the right we see the same model but with a finite value ofα. In each case the red ellipse corresponds to unit
Mahalanobis distance, with|C|taking the same value for both plots, while the dashed green circle shows the
contrition arising from the noise termβ−^1. We see that any finite value ofαreduces the probability of the
observed data, and so for the most probable solution the basis vector is removed.

the mechanism of sparsity in the context of the relevance vector machine. In the process, we will arrive at a significantly faster procedure for optimizing the hyper- parameters compared to the direct techniques given above. Before proceeding with a mathematical analysis, we first give some informal insight into the origin of sparsity in Bayesian linear models. Consider a data set comprisingN =2observationst 1 andt 2 , together with a model having a single basis functionφ(x), with hyperparameterα, along with isotropic noise having pre- cisionβ. From (7.85), the marginal likelihood is given byp(t|α, β)=N(t| 0 ,C)in which the covariance matrix takes the form

C=

1

β

I+

1

α

φφT (7.92)

whereφdenotes theN-dimensional vector(φ(x 1 ),φ(x 2 ))T, and similarlyt = (t 1 ,t 2 )T. Notice that this is just a zero-mean Gaussian process model overtwith covarianceC. Given a particular observation fort, our goal is to findαandβby maximizing the marginal likelihood. We see from Figure 7.10 that, if there is a poor alignment between the direction ofφand that of the training data vectort, then the corresponding hyperparameterαwill be driven to∞, and the basis vector will be pruned from the model. This arises because any finite value forαwill always assign a lower probability to the data, thereby decreasing the value of the density att, pro- vided thatβis set to its optimal value. We see that any finite value forαwould cause the distribution to be elongated in a direction away from the data, thereby increasing the probability mass in regions away from the observed data and hence reducing the value of the density at the target data vector itself. For the more general case ofM

Pattern Recognition and Machine Learning

350 7. SPARSE KERNEL MACHINES

1

I+

1

Get our desktop app

Company

Features

Documentation

Resources