Pattern Recognition and Machine Learning

218 4. LINEAR MODELS FOR CLASSIFICATION

wherem 0 andS 0 are fixed hyperparameters. The posterior distribution overwis given by p(w|t)∝p(w)p(t|w) (4.141) wheret=(t 1 ,...,tN)T. Taking the log of both sides, and substituting for the prior distribution using (4.140), and for the likelihood function using (4.89), we obtain

lnp(w|t)=−

1

2

(w−m 0 )TS− 01 (w−m 0 )

+

∑N

n=1

{tnlnyn+(1−tn)ln(1−yn)}+const (4.142)

whereyn=σ(wTφn). To obtain a Gaussian approximation to the posterior distribution, we first maximize the posterior distribution to give the MAP (maximum posterior) solutionwMAP, which defines the mean of the Gaussian. The covariance is then given by the inverse of the matrix of second derivatives of the negative log likelihood, which takes the form

SN=−∇∇lnp(w|t)=S− 01 +

∑N

n=1

yn(1−yn)φnφTn. (4.143)

The Gaussian approximation to the posterior distribution therefore takes the form

q(w)=N(w|wMAP,SN). (4.144)

Having obtained a Gaussian approximation to the posterior distribution, there remains the task of marginalizing with respect to this distribution in order to make predictions.

4.5.2 Predictive distribution

The predictive distribution for classC 1 , given a new feature vectorφ(x),is obtained by marginalizing with respect to the posterior distributionp(w|t), which is itself approximated by a Gaussian distributionq(w)so that

p(C 1 |φ,t)=

∫ p(C 1 |φ,w)p(w|t)dw

∫ σ(wTφ)q(w)dw (4.145)

with the corresponding probability for classC 2 given byp(C 2 |φ,t)=1−p(C 1 |φ,t). To evaluate the predictive distribution, we first note that the functionσ(wTφ)de- pends onwonly through its projection ontoφ. Denotinga=wTφ,wehave

σ(wTφ)=

∫ δ(a−wTφ)σ(a)da (4.146)

whereδ(·)is the Dirac delta function. From this we obtain ∫ σ(wTφ)q(w)dw=

∫ σ(a)p(a)da (4.147)

Pattern Recognition and Machine Learning

218 4. LINEAR MODELS FOR CLASSIFICATION

1

2

+

4.5.2 Predictive distribution

Get our desktop app

Company

Features

Documentation

Resources