Pattern Recognition and Machine Learning

(Jeff_L) #1
350 7. SPARSE KERNEL MACHINES

t 1

t 2

t
C

t 1

t 2

t
C

φ

Figure 7.10 Illustration of the mechanism for sparsity in a Bayesian linear regression model, showing a training
set vector of target values given byt=(t 1 ,t 2 )T, indicated by the cross, for a model with one basis vector
φ=(φ(x 1 ),φ(x 2 ))T, which is poorly aligned with the target data vectort. On the left we see a model having
only isotropic noise, so thatC=β−^1 I, corresponding toα=∞, withβset to its most probable value. On
the right we see the same model but with a finite value ofα. In each case the red ellipse corresponds to unit
Mahalanobis distance, with|C|taking the same value for both plots, while the dashed green circle shows the
contrition arising from the noise termβ−^1. We see that any finite value ofαreduces the probability of the
observed data, and so for the most probable solution the basis vector is removed.


the mechanism of sparsity in the context of the relevance vector machine. In the
process, we will arrive at a significantly faster procedure for optimizing the hyper-
parameters compared to the direct techniques given above.
Before proceeding with a mathematical analysis, we first give some informal
insight into the origin of sparsity in Bayesian linear models. Consider a data set
comprisingN =2observationst 1 andt 2 , together with a model having a single
basis functionφ(x), with hyperparameterα, along with isotropic noise having pre-
cisionβ. From (7.85), the marginal likelihood is given byp(t|α, β)=N(t| 0 ,C)in
which the covariance matrix takes the form

C=

1

β

I+

1

α

φφT (7.92)

whereφdenotes theN-dimensional vector(φ(x 1 ),φ(x 2 ))T, and similarlyt =
(t 1 ,t 2 )T. Notice that this is just a zero-mean Gaussian process model overtwith
covarianceC. Given a particular observation fort, our goal is to findαandβby
maximizing the marginal likelihood. We see from Figure 7.10 that, if there is a poor
alignment between the direction ofφand that of the training data vectort, then the
corresponding hyperparameterαwill be driven to∞, and the basis vector will be
pruned from the model. This arises because any finite value forαwill always assign
a lower probability to the data, thereby decreasing the value of the density att, pro-
vided thatβis set to its optimal value. We see that any finite value forαwould cause
the distribution to be elongated in a direction away from the data, thereby increasing
the probability mass in regions away from the observed data and hence reducing the
value of the density at the target data vector itself. For the more general case ofM
Free download pdf