Pattern Recognition and Machine Learning

(Jeff_L) #1
7.2. Relevance Vector Machines 347

in the predictions made by the model and so are effectively pruned out, resulting in
a sparse model.
Using the result (3.49) for linear regression models, we see that the posterior
distribution for the weights is again Gaussian and takes the form

p(w|t,X,α,β)=N(w|m,Σ) (7.81)

where the mean and covariance are given by

m = βΣΦTt (7.82)
Σ =

(
A+βΦTΦ

)− 1
(7.83)

whereΦis theN×M design matrix with elementsΦni = φi(xn), andA=
diag(αi). Note that in the specific case of the model (7.78), we haveΦ=K, where
Kis the symmetric(N+1)×(N+1)kernel matrix with elementsk(xn,xm).
The values ofαandβare determined using type-2 maximum likelihood, also
Section 3.5 known as theevidence approximation, in which we maximize the marginal likeli-
hood function obtained by integrating out the weight parameters


p(t|X,α,β)=


p(t|X,w,β)p(w|α)dw. (7.84)

Exercise 7.10 Because this represents the convolution of two Gaussians, it is readily evaluated to
give the log marginal likelihood in the form


lnp(t|X,α,β)=lnN(t| 0 ,C)

= −

1

2

{
Nln(2π)+ln|C|+tTC−^1 t

}
(7.85)

wheret=(t 1 ,...,tN)T, and we have defined theN×NmatrixCgiven by

C=β−^1 I+ΦA−^1 ΦT. (7.86)

Our goal is now to maximize (7.85) with respect to the hyperparametersαand
β. This requires only a small modification to the results obtained in Section 3.5 for
the evidence approximation in the linear regression model. Again, we can identify
two approaches. In the first, we simply set the required derivatives of the marginal
Exercise 7.12 likelihood to zero and obtain the following re-estimation equations


αnewi =

γi
m^2 i

(7.87)

(βnew)−^1 =

‖t−Φm‖^2
N−


iγi

(7.88)

wheremiis theithcomponent of the posterior meanmdefined by (7.82). The
quantityγimeasures how well the corresponding parameterwiis determined by the
Section 3.5.3 data and is defined by

Free download pdf