Pattern Recognition and Machine Learning

(Jeff_L) #1
346 7. SPARSE KERNEL MACHINES

whereβ=σ−^2 is the noise precision (inverse noise variance), and the mean is given
by a linear model of the form

y(x)=

∑M

i=1

wiφi(x)=wTφ(x) (7.77)

with fixed nonlinear basis functionsφi(x), which will typically include a constant
term so that the corresponding weight parameter represents a ‘bias’.
The relevance vector machine is a specific instance of this model, which is in-
tended to mirror the structure of the support vector machine. In particular, the basis
functions are given by kernels, with one kernel associated with each of the data
points from the training set. The general expression (7.77) then takes the SVM-like
form

y(x)=

∑N

n=1

wnk(x,xn)+b (7.78)

wherebis a bias parameter. The number of parameters in this case isM=N+1,
andy(x)has the same form as the predictive model (7.64) for the SVM, except that
the coefficientsanare here denotedwn. It should be emphasized that the subsequent
analysis is valid for arbitrary choices of basis function, and for generality we shall
work with the form (7.77). In contrast to the SVM, there is no restriction to positive-
definite kernels, nor are the basis functions tied in either number or location to the
training data points.
Suppose we are given a set ofNobservations of the input vectorx, which we
denote collectively by a data matrixXwhosenthrow isxTnwithn=1,...,N. The
corresponding target values are given byt =(t 1 ,...,tN)T. Thus, the likelihood
function is given by

p(t|X,w,β)=

∏N

n=1

p(tn|xn,w,β−^1 ). (7.79)

Next we introduce a prior distribution over the parameter vectorwand as in
Chapter 3, we shall consider a zero-mean Gaussian prior. However, the key differ-
ence in the RVM is that we introduce a separate hyperparameterαifor each of the
weight parameterswiinstead of a single shared hyperparameter. Thus the weight
prior takes the form

p(w|α)=

∏M

i=1

N(wi| 0 ,α−i^1 ) (7.80)

whereαirepresents the precision of the corresponding parameterwi, andαdenotes
(α 1 ,...,αM)T. We shall see that, when we maximize the evidence with respect
to these hyperparameters, a significant proportion of them go to infinity, and the
corresponding weight parameters have posterior distributions that are concentrated
at zero. The basis functions associated with these parameters therefore play no role
Free download pdf