Pattern Recognition and Machine Learning

4.4. The Laplace Approximation 213

Thusyandηmust related, and we denote this relation throughη=ψ(y). Following Nelder and Wedderburn (1972), we define ageneralized linear model to be one for whichyis a nonlinear function of a linear combination of the input (or feature) variables so that y=f(wTφ) (4.120) wheref(·)is known as theactivation functionin the machine learning literature, and f−^1 (·)is known as thelink functionin statistics. Now consider the log likelihood function for this model, which, as a function of η,isgivenby

lnp(t|η, s)=

∑N

n=1

lnp(tn|η, s)=

∑N

n=1

{ lng(ηn)+

ηntn s

} +const (4.121)

where we are assuming that all observations share a common scale parameter (which corresponds to the noise variance for a Gaussian distribution for instance) and sos is independent ofn. The derivative of the log likelihood with respect to the model parameterswis then given by

∇wlnp(t|η, s)=

∑N

n=1

{ d dηn

lng(ηn)+

tn s

} dηn dyn

dyn dan

∇an

=

∑N

n=1

1

s

{tn−yn}ψ′(yn)f′(an)φn (4.122)

wherean=wTφn, and we have usedyn=f(an)together with the result (4.119) forE[t|η]. We now see that there is a considerable simplification if we choose a particular form for the link functionf−^1 (y)given by

f−^1 (y)=ψ(y) (4.123)

which givesf(ψ(y)) =yand hencef′(ψ)ψ′(y)=1. Also, becausea=f−^1 (y), we havea=ψand hencef′(a)ψ′(y)=1. In this case, the gradient of the error function reduces to

∇lnE(w)=

1

s

∑N

n=1

{yn−tn}φn. (4.124)

For the Gaussians=β−^1 , whereas for the logistic models=1.

4.4 The Laplace Approximation

In Section 4.5 we shall discuss the Bayesian treatment of logistic regression. As we shall see, this is more complex than the Bayesian treatment of linear regression models, discussed in Sections 3.3 and 3.5. In particular, we cannot integrate exactly

Pattern Recognition and Machine Learning

=

1

1

4.4 The Laplace Approximation

Get our desktop app

Company

Features

Documentation

Resources