4.4. The Laplace Approximation 213
Thusyandηmust related, and we denote this relation throughη=ψ(y).
Following Nelder and Wedderburn (1972), we define ageneralized linear model
to be one for whichyis a nonlinear function of a linear combination of the input (or
feature) variables so that
y=f(wTφ) (4.120)
wheref(·)is known as theactivation functionin the machine learning literature, and
f−^1 (·)is known as thelink functionin statistics.
Now consider the log likelihood function for this model, which, as a function of
η,isgivenby
lnp(t|η, s)=
∑N
n=1
lnp(tn|η, s)=
∑N
n=1
{
lng(ηn)+
ηntn
s
}
+const (4.121)
where we are assuming that all observations share a common scale parameter (which
corresponds to the noise variance for a Gaussian distribution for instance) and sos
is independent ofn. The derivative of the log likelihood with respect to the model
parameterswis then given by
∇wlnp(t|η, s)=
∑N
n=1
{
d
dηn
lng(ηn)+
tn
s
}
dηn
dyn
dyn
dan
∇an
=
∑N
n=1
1
s
{tn−yn}ψ′(yn)f′(an)φn (4.122)
wherean=wTφn, and we have usedyn=f(an)together with the result (4.119)
forE[t|η]. We now see that there is a considerable simplification if we choose a
particular form for the link functionf−^1 (y)given by
f−^1 (y)=ψ(y) (4.123)
which givesf(ψ(y)) =yand hencef′(ψ)ψ′(y)=1. Also, becausea=f−^1 (y),
we havea=ψand hencef′(a)ψ′(y)=1. In this case, the gradient of the error
function reduces to
∇lnE(w)=
1
s
∑N
n=1
{yn−tn}φn. (4.124)
For the Gaussians=β−^1 , whereas for the logistic models=1.
4.4 The Laplace Approximation
In Section 4.5 we shall discuss the Bayesian treatment of logistic regression. As
we shall see, this is more complex than the Bayesian treatment of linear regression
models, discussed in Sections 3.3 and 3.5. In particular, we cannot integrate exactly