##### 318 6. KERNEL METHODS

`whereΨ(aN)=lnp(aN|θ)+lnp(tN|aN). We also need to evaluate the gradient`

oflnp(tN|θ)with respect to the parameter vectorθ. Note that changes inθwill

cause changes inaN, leading to additional terms in the gradient. Thus, when we

differentiate (6.90) with respect toθ, we obtain two sets of terms, the first arising

from the dependence of the covariance matrixCNonθ, and the rest arising from

dependence ofaNonθ.

The terms arising from the explicit dependence onθcan be found by using

(6.80) together with the results (C.21) and (C.22), and are given by

`∂lnp(tN|θ)`

∂θj

##### =

##### 1

##### 2

`aNTC−N^1`

##### ∂CN

`∂θj`

`C−N^1 aN`

##### −

##### 1

##### 2

`Tr`

`[`

(I+CNWN)−^1 WN

##### ∂CN

`∂θj`

`]`

. (6.91)

`To compute the terms arising from the dependence ofaNonθ, we note that`

the Laplace approximation has been constructed such thatΨ(aN)has zero gradient

ataN=aN, and soΨ(aN)gives no contribution to the gradient as a result of its

dependence onaN. This leaves the following contribution to the derivative with

respect to a componentθjofθ

##### −

##### 1

##### 2

`∑N`

`n=1`

`∂ln|WN+C−N^1 |`

∂an

`∂an`

∂θj

##### =−

##### 1

##### 2

`∑N`

`n=1`

`[`

(I+CNWN)−^1 CN

`]`

nnσ

n(1−σ

n)(1−^2 σ

n)

`∂an`

∂θj

##### (6.92)

`whereσn =σ(an), and again we have used the result (C.22) together with the`

definition ofWN. We can evaluate the derivative ofaNwith respect toθjby differ-

entiating the relation (6.84) with respect toθjto give

`∂an`

∂θj

##### =

##### ∂CN

`∂θj`

`(tN−σN)−CNWN`

`∂an`

∂θj

##### . (6.93)

`Rearranging then gives`

`∂an`

∂θj

##### =(I+WNCN)−^1

##### ∂CN

`∂θj`

`(tN−σN). (6.94)`

Combining (6.91), (6.92), and (6.94), we can evaluate the gradient of the log

likelihood function, which can be used with standard nonlinear optimization algo-

rithms in order to determine a value forθ.

We can illustrate the application of the Laplace approximation for Gaussian pro-

Appendix A cesses using the synthetic two-class data set shown in Figure 6.12. Extension of the

Laplace approximation to Gaussian processes involvingK> 2 classes, using the

softmax activation function, is straightforward (Williams and Barber, 1998).