Pattern Recognition and Machine Learning

(Jeff_L) #1
218 4. LINEAR MODELS FOR CLASSIFICATION

wherem 0 andS 0 are fixed hyperparameters. The posterior distribution overwis
given by
p(w|t)∝p(w)p(t|w) (4.141)
wheret=(t 1 ,...,tN)T. Taking the log of both sides, and substituting for the prior
distribution using (4.140), and for the likelihood function using (4.89), we obtain

lnp(w|t)=−

1

2

(w−m 0 )TS− 01 (w−m 0 )

+

∑N

n=1

{tnlnyn+(1−tn)ln(1−yn)}+const (4.142)

whereyn=σ(wTφn). To obtain a Gaussian approximation to the posterior dis-
tribution, we first maximize the posterior distribution to give the MAP (maximum
posterior) solutionwMAP, which defines the mean of the Gaussian. The covariance
is then given by the inverse of the matrix of second derivatives of the negative log
likelihood, which takes the form

SN=−∇∇lnp(w|t)=S− 01 +

∑N

n=1

yn(1−yn)φnφTn. (4.143)

The Gaussian approximation to the posterior distribution therefore takes the form

q(w)=N(w|wMAP,SN). (4.144)

Having obtained a Gaussian approximation to the posterior distribution, there
remains the task of marginalizing with respect to this distribution in order to make
predictions.

4.5.2 Predictive distribution


The predictive distribution for classC 1 , given a new feature vectorφ(x),is
obtained by marginalizing with respect to the posterior distributionp(w|t), which is
itself approximated by a Gaussian distributionq(w)so that

p(C 1 |φ,t)=


p(C 1 |φ,w)p(w|t)dw


σ(wTφ)q(w)dw (4.145)

with the corresponding probability for classC 2 given byp(C 2 |φ,t)=1−p(C 1 |φ,t).
To evaluate the predictive distribution, we first note that the functionσ(wTφ)de-
pends onwonly through its projection ontoφ. Denotinga=wTφ,wehave

σ(wTφ)=


δ(a−wTφ)σ(a)da (4.146)

whereδ(·)is the Dirac delta function. From this we obtain

σ(wTφ)q(w)dw=


σ(a)p(a)da (4.147)
Free download pdf