Pattern Recognition and Machine Learning

10.6. Variational Logistic Regression 499

we reproduce here for convenience

σ(z)σ(ξ)exp

{ (z−ξ)/ 2 −λ(ξ)(z^2 −ξ^2 )

} (10.149)

where

λ(ξ)=

1

2 ξ

[ σ(ξ)−

1

2

]

. (10.150)

We can therefore write

p(t|w)=eatσ(−a)eatσ(ξ)exp

{ −(a+ξ)/ 2 −λ(ξ)(a^2 −ξ^2 )

}

. (10.151)

Note that because this bound is applied to each of the terms in the likelihood function
separately, there is a variational parameterξncorresponding to each training set
observation(φn,tn). Usinga=wTφ, and multiplying by the prior distribution, we
obtain the following bound on the joint distribution oftandw

p(t,w)=p(t|w)p(w)h(w,ξ)p(w) (10.152)

whereξdenotes the set{ξn}of variational parameters, and

h(w,ξ)=

∏N

n=1

σ(ξn)exp

{ wTφntn−(wTφn+ξn)/ 2

−λ(ξn)([wTφn]^2 −ξ^2 n)

}

. (10.153)

Evaluation of the exact posterior distribution would require normalization of the left-
hand side of this inequality. Because this is intractable, we work instead with the
right-hand side. Note that the function on the right-hand side cannot be interpreted
as a probability density because it is not normalized. Once it is normalized to give a
variational posterior distributionq(w), however, it no longer represents a bound.
Because the logarithm function is monotonically increasing, the inequalityA
BimplieslnAlnB. This gives a lower bound on the log of the joint distribution
oftandwof the form

ln{p(t|w)p(w)}lnp(w)+

∑N

n=1

{ lnσ(ξn)+wTφntn

−(wTφn+ξn)/ 2 −λ(ξn)([wTφn]^2 −ξ^2 n)

}

. (10.154)

Substituting for the priorp(w), the right-hand side of this inequality becomes, as a
function ofw

−

1

2

(w−m 0 )TS− 01 (w−m 0 )

+

∑N

n=1

{ wTφn(tn− 1 /2)−λ(ξn)wT(φnφTn)w

} +const. (10.155)

Pattern Recognition and Machine Learning

1

1

2

−

1

2

+

Get our desktop app

Company

Features

Documentation

Resources