# Pattern Recognition and Machine Learning

(Jeff_L) #1
##### 318 6. KERNEL METHODS

``````whereΨ(aN)=lnp(aN|θ)+lnp(tN|aN). We also need to evaluate the gradient
oflnp(tN|θ)with respect to the parameter vectorθ. Note that changes inθwill
differentiate (6.90) with respect toθ, we obtain two sets of terms, the first arising
from the dependence of the covariance matrixCNonθ, and the rest arising from
dependence ofaNonθ.
The terms arising from the explicit dependence onθcan be found by using
(6.80) together with the results (C.21) and (C.22), and are given by``````

``````∂lnp(tN|θ)
∂θj``````

##### 2

``aNTC−N^1``

##### ∂CN

``∂θj``

``C−N^1 aN``

##### 2

``Tr``

``````[
(I+CNWN)−^1 WN``````

##### ∂CN

``∂θj``

``]``

. (6.91)

``````To compute the terms arising from the dependence ofaNonθ, we note that
the Laplace approximation has been constructed such thatΨ(aN)has zero gradient
ataN=aN, and soΨ(aN)gives no contribution to the gradient as a result of its
dependence onaN. This leaves the following contribution to the derivative with
respect to a componentθjofθ``````

##### 2

``∑N``

``n=1``

``````∂ln|WN+C−N^1 |
∂an``````

``````∂an
∂θj``````

##### 2

``∑N``

``n=1``

``````[
(I+CNWN)−^1 CN``````

``````]
nnσ``````

``````
n(1−σ``````

``````
n)(1−^2 σ``````

``````
n)``````

``````∂an
∂θj``````

##### (6.92)

``````whereσn =σ(an), and again we have used the result (C.22) together with the
definition ofWN. We can evaluate the derivative ofaNwith respect toθjby differ-
entiating the relation (6.84) with respect toθjto give``````

``````∂an
∂θj``````

##### ∂CN

``∂θj``

``(tN−σN)−CNWN``

``````∂an
∂θj``````

##### . (6.93)

``Rearranging then gives``

``````∂an
∂θj``````

##### ∂CN

``∂θj``

``(tN−σN). (6.94)``

Combining (6.91), (6.92), and (6.94), we can evaluate the gradient of the log
likelihood function, which can be used with standard nonlinear optimization algo-
rithms in order to determine a value forθ.
We can illustrate the application of the Laplace approximation for Gaussian pro-
Appendix A cesses using the synthetic two-class data set shown in Figure 6.12. Extension of the
Laplace approximation to Gaussian processes involvingK> 2 classes, using the
softmax activation function, is straightforward (Williams and Barber, 1998).