Pattern Recognition and Machine Learning

(Jeff_L) #1
10.6. Variational Logistic Regression 499

we reproduce here for convenience


σ(z)σ(ξ)exp

{
(z−ξ)/ 2 −λ(ξ)(z^2 −ξ^2 )

}
(10.149)

where


λ(ξ)=

1

2 ξ

[
σ(ξ)−

1

2

]

. (10.150)


We can therefore write


p(t|w)=eatσ(−a)eatσ(ξ)exp

{
−(a+ξ)/ 2 −λ(ξ)(a^2 −ξ^2 )

}

. (10.151)


Note that because this bound is applied to each of the terms in the likelihood function
separately, there is a variational parameterξncorresponding to each training set
observation(φn,tn). Usinga=wTφ, and multiplying by the prior distribution, we
obtain the following bound on the joint distribution oftandw


p(t,w)=p(t|w)p(w)h(w,ξ)p(w) (10.152)

whereξdenotes the set{ξn}of variational parameters, and


h(w,ξ)=

∏N

n=1

σ(ξn)exp

{
wTφntn−(wTφn+ξn)/ 2

−λ(ξn)([wTφn]^2 −ξ^2 n)

}

. (10.153)


Evaluation of the exact posterior distribution would require normalization of the left-
hand side of this inequality. Because this is intractable, we work instead with the
right-hand side. Note that the function on the right-hand side cannot be interpreted
as a probability density because it is not normalized. Once it is normalized to give a
variational posterior distributionq(w), however, it no longer represents a bound.
Because the logarithm function is monotonically increasing, the inequalityA
BimplieslnAlnB. This gives a lower bound on the log of the joint distribution
oftandwof the form


ln{p(t|w)p(w)}lnp(w)+

∑N

n=1

{
lnσ(ξn)+wTφntn

−(wTφn+ξn)/ 2 −λ(ξn)([wTφn]^2 −ξ^2 n)

}

. (10.154)


Substituting for the priorp(w), the right-hand side of this inequality becomes, as a
function ofw



1

2

(w−m 0 )TS− 01 (w−m 0 )

+

∑N

n=1

{
wTφn(tn− 1 /2)−λ(ξn)wT(φnφTn)w

}
+const. (10.155)
Free download pdf