Pattern Recognition and Machine Learning

(Jeff_L) #1
4.4. The Laplace Approximation 213

Thusyandηmust related, and we denote this relation throughη=ψ(y).
Following Nelder and Wedderburn (1972), we define ageneralized linear model
to be one for whichyis a nonlinear function of a linear combination of the input (or
feature) variables so that
y=f(wTφ) (4.120)
wheref(·)is known as theactivation functionin the machine learning literature, and
f−^1 (·)is known as thelink functionin statistics.
Now consider the log likelihood function for this model, which, as a function of
η,isgivenby

lnp(t|η, s)=

∑N

n=1

lnp(tn|η, s)=

∑N

n=1

{
lng(ηn)+

ηntn
s

}
+const (4.121)

where we are assuming that all observations share a common scale parameter (which
corresponds to the noise variance for a Gaussian distribution for instance) and sos
is independent ofn. The derivative of the log likelihood with respect to the model
parameterswis then given by

∇wlnp(t|η, s)=

∑N

n=1

{
d
dηn

lng(ηn)+

tn
s

}
dηn
dyn

dyn
dan

∇an

=

∑N

n=1

1

s

{tn−yn}ψ′(yn)f′(an)φn (4.122)

wherean=wTφn, and we have usedyn=f(an)together with the result (4.119)
forE[t|η]. We now see that there is a considerable simplification if we choose a
particular form for the link functionf−^1 (y)given by

f−^1 (y)=ψ(y) (4.123)

which givesf(ψ(y)) =yand hencef′(ψ)ψ′(y)=1. Also, becausea=f−^1 (y),
we havea=ψand hencef′(a)ψ′(y)=1. In this case, the gradient of the error
function reduces to

∇lnE(w)=

1

s

∑N

n=1

{yn−tn}φn. (4.124)

For the Gaussians=β−^1 , whereas for the logistic models=1.

4.4 The Laplace Approximation


In Section 4.5 we shall discuss the Bayesian treatment of logistic regression. As
we shall see, this is more complex than the Bayesian treatment of linear regression
models, discussed in Sections 3.3 and 3.5. In particular, we cannot integrate exactly
Free download pdf