Pattern Recognition and Machine Learning

(Jeff_L) #1
312 6. KERNEL METHODS

Figure 6.9 Samples from the ARD
prior for Gaussian processes, in
which the kernel function is given by
(6.71). The left plot corresponds to
η 1 =η 2 =1, and the right plot cor-
responds toη 1 =1,η 2 =0. 01.


Gaussian process framework by introducing a second Gaussian process to represent
the dependence ofβon the inputx(Goldberget al., 1998). Becauseβis a variance,
and hence nonnegative, we use the Gaussian process to modellnβ(x).

6.4.4 Automatic relevance determination


In the previous section, we saw how maximum likelihood could be used to de-
termine a value for the correlation length-scale parameter in a Gaussian process.
This technique can usefully be extended by incorporating a separate parameter for
each input variable (Rasmussen and Williams, 2006). The result, as we shall see, is
that the optimization of these parameters by maximum likelihood allows the relative
importance of different inputs to be inferred from the data. This represents an exam-
ple in the Gaussian process context ofautomatic relevance determination,orARD,
which was originally formulated in the framework of neural networks (MacKay,
1994; Neal, 1996). The mechanism by which appropriate inputs are preferred is
discussed in Section 7.2.2.
Consider a Gaussian process with a two-dimensional input spacex=(x 1 ,x 2 ),
having a kernel function of the form

k(x,x′)=θ 0 exp

{


1

2

∑^2

i=1

ηi(xi−x′i)^2

}

. (6.71)


Samples from the resulting prior over functionsy(x)are shown for two different
settings of the precision parametersηiin Figure 6.9. We see that, as a particu-
lar parameterηibecomes small, the function becomes relatively insensitive to the
corresponding input variablexi. By adapting these parameters to a data set using
maximum likelihood, it becomes possible to detect input variables that have little
effect on the predictive distribution, because the corresponding values ofηiwill be
small. This can be useful in practice because it allows such inputs to be discarded.
ARD is illustrated using a simple synthetic data set having three inputsx 1 ,x 2 andx 3
(Nabney, 2002) in Figure 6.10. The target variablet, is generated by sampling 100
values ofx 1 from a Gaussian, evaluating the functionsin(2πx 1 ), and then adding
Free download pdf