Pattern Recognition and Machine Learning

346 7. SPARSE KERNEL MACHINES

whereβ=σ−^2 is the noise precision (inverse noise variance), and the mean is given by a linear model of the form

y(x)=

∑M

i=1

wiφi(x)=wTφ(x) (7.77)

with fixed nonlinear basis functionsφi(x), which will typically include a constant term so that the corresponding weight parameter represents a ‘bias’. The relevance vector machine is a specific instance of this model, which is in- tended to mirror the structure of the support vector machine. In particular, the basis functions are given by kernels, with one kernel associated with each of the data points from the training set. The general expression (7.77) then takes the SVM-like form

y(x)=

∑N

n=1

wnk(x,xn)+b (7.78)

wherebis a bias parameter. The number of parameters in this case isM=N+1, andy(x)has the same form as the predictive model (7.64) for the SVM, except that the coefficientsanare here denotedwn. It should be emphasized that the subsequent analysis is valid for arbitrary choices of basis function, and for generality we shall work with the form (7.77). In contrast to the SVM, there is no restriction to positive- definite kernels, nor are the basis functions tied in either number or location to the training data points. Suppose we are given a set ofNobservations of the input vectorx, which we denote collectively by a data matrixXwhosenthrow isxTnwithn=1,...,N. The corresponding target values are given byt =(t 1 ,...,tN)T. Thus, the likelihood function is given by

p(t|X,w,β)=

∏N

n=1

p(tn|xn,w,β−^1 ). (7.79)

Next we introduce a prior distribution over the parameter vectorwand as in Chapter 3, we shall consider a zero-mean Gaussian prior. However, the key differ- ence in the RVM is that we introduce a separate hyperparameterαifor each of the weight parameterswiinstead of a single shared hyperparameter. Thus the weight prior takes the form

p(w|α)=

∏M

i=1

N(wi| 0 ,α−i^1 ) (7.80)

whereαirepresents the precision of the corresponding parameterwi, andαdenotes (α 1 ,...,αM)T. We shall see that, when we maximize the evidence with respect to these hyperparameters, a significant proportion of them go to infinity, and the corresponding weight parameters have posterior distributions that are concentrated at zero. The basis functions associated with these parameters therefore play no role

Pattern Recognition and Machine Learning

346 7. SPARSE KERNEL MACHINES

Get our desktop app

Company

Features

Documentation

Resources