Pattern Recognition and Machine Learning

318 6. KERNEL METHODS

whereΨ(aN)=lnp(aN|θ)+lnp(tN|aN). We also need to evaluate the gradient oflnp(tN|θ)with respect to the parameter vectorθ. Note that changes inθwill cause changes inaN, leading to additional terms in the gradient. Thus, when we differentiate (6.90) with respect toθ, we obtain two sets of terms, the first arising from the dependence of the covariance matrixCNonθ, and the rest arising from dependence ofaNonθ. The terms arising from the explicit dependence onθcan be found by using (6.80) together with the results (C.21) and (C.22), and are given by

∂lnp(tN|θ) ∂θj

=

1

2

aNTC−N^1

∂CN

∂θj

C−N^1 aN

−

1

2

Tr

[ (I+CNWN)−^1 WN

∂CN

∂θj

]

. (6.91)

To compute the terms arising from the dependence ofaNonθ, we note that the Laplace approximation has been constructed such thatΨ(aN)has zero gradient ataN=aN, and soΨ(aN)gives no contribution to the gradient as a result of its dependence onaN. This leaves the following contribution to the derivative with respect to a componentθjofθ

−

1

2

∑N

n=1

∂ln|WN+C−N^1 | ∂an

∂an ∂θj

=−

1

2

∑N

n=1

[ (I+CNWN)−^1 CN

] nnσ

n(1−σ

n)(1−^2 σ

n)

∂an ∂θj

(6.92)

whereσn =σ(an), and again we have used the result (C.22) together with the definition ofWN. We can evaluate the derivative ofaNwith respect toθjby differ- entiating the relation (6.84) with respect toθjto give

∂an ∂θj

=

∂CN

∂θj

(tN−σN)−CNWN

∂an ∂θj

. (6.93)

Rearranging then gives

∂an ∂θj

=(I+WNCN)−^1

∂CN

∂θj

(tN−σN). (6.94)

Combining (6.91), (6.92), and (6.94), we can evaluate the gradient of the log
likelihood function, which can be used with standard nonlinear optimization algo-
rithms in order to determine a value forθ.
We can illustrate the application of the Laplace approximation for Gaussian pro-
Appendix A cesses using the synthetic two-class data set shown in Figure 6.12. Extension of the
Laplace approximation to Gaussian processes involvingK> 2 classes, using the
softmax activation function, is straightforward (Williams and Barber, 1998).

Pattern Recognition and Machine Learning

318 6. KERNEL METHODS

=

1

2

∂CN

−

1

2

∂CN

−

1

2

=−

1

2

(6.92)

=

∂CN

. (6.93)

=(I+WNCN)−^1

∂CN

Get our desktop app

Company

Features

Documentation

Resources