Pattern Recognition and Machine Learning

9.3. An Alternative View of EM 449

where the likelihoodp(t|w,β)and the priorp(w|α)are given by (3.10) and (3.52), respectively, andy(x,w)is given by (3.3). Taking the expectation with respect to the posterior distribution ofwthen gives

E[lnp(t,w|α, β)] =

M

2

ln

(α

2 π

) −

α 2

E

[ wTw

] +

N

2

ln

( β 2 π

)

−

β 2

∑N

n=1

E

[ (tn−wTφn)^2

]

. (9.62)

Setting the derivatives with respect toαto zero, we obtain the M step re-estimation
Exercise 9.20 equation

α=

M

E[wTw]

=

M

mTNmN+Tr(SN)

. (9.63)

Exercise 9.21 An analogous result holds forβ.
Note that this re-estimation equation takes a slightly different form from the
corresponding result (3.92) derived by direct evaluation of the evidence function.
However, they each involve computation and inversion (or eigen decomposition) of
anM×Mmatrix and hence will have comparable computational cost per iteration.
These two approaches to determiningαshould of course converge to the same
result (assuming they find the same local maximum of the evidence function). This
can be verified by first noting that the quantityγis defined by

γ=M−α

∑M

i=1

1

λi+α

=M−αTr(SN). (9.64)

At a stationary point of the evidence function, the re-estimation equation (3.92) will be self-consistently satisfied, and hence we can substitute forγto give

αmTNmN=γ=M−αTr(SN) (9.65)

and solving forαwe obtain (9.63), which is precisely the EM re-estimation equation. As a final example, we consider a closely related model, namely the relevance vector machine for regression discussed in Section 7.2.1. There we used direct max- imization of the marginal likelihood to derive re-estimation equations for the hyper- parametersαandβ. Here we consider an alternative approach in which we view the weight vectorwas a latent variable and apply the EM algorithm. The E step involves finding the posterior distribution over the weights, and this is given by (7.81). In the M step we maximize the expected complete-data log likelihood, which is defined by

Ew[lnp(t|X,w,β)p(w|α)] (9.66)

where the expectation is taken with respect to the posterior distribution computed
using the ‘old’ parameter values. To compute the new parameter values we maximize
Exercise 9.22 with respect toαandβto give

Pattern Recognition and Machine Learning

M

2

E

N

2

−

E

M

=

M

. (9.63)

1

Get our desktop app

Company

Features

Documentation

Resources