10.6. Variational Logistic Regression 499
we reproduce here for convenience
σ(z)σ(ξ)exp
{
(z−ξ)/ 2 −λ(ξ)(z^2 −ξ^2 )
}
(10.149)
where
λ(ξ)=
1
2 ξ
[
σ(ξ)−
1
2
]
. (10.150)
We can therefore write
p(t|w)=eatσ(−a)eatσ(ξ)exp
{
−(a+ξ)/ 2 −λ(ξ)(a^2 −ξ^2 )
}
. (10.151)
Note that because this bound is applied to each of the terms in the likelihood function
separately, there is a variational parameterξncorresponding to each training set
observation(φn,tn). Usinga=wTφ, and multiplying by the prior distribution, we
obtain the following bound on the joint distribution oftandw
p(t,w)=p(t|w)p(w)h(w,ξ)p(w) (10.152)
whereξdenotes the set{ξn}of variational parameters, and
h(w,ξ)=
∏N
n=1
σ(ξn)exp
{
wTφntn−(wTφn+ξn)/ 2
−λ(ξn)([wTφn]^2 −ξ^2 n)
}
. (10.153)
Evaluation of the exact posterior distribution would require normalization of the left-
hand side of this inequality. Because this is intractable, we work instead with the
right-hand side. Note that the function on the right-hand side cannot be interpreted
as a probability density because it is not normalized. Once it is normalized to give a
variational posterior distributionq(w), however, it no longer represents a bound.
Because the logarithm function is monotonically increasing, the inequalityA
BimplieslnAlnB. This gives a lower bound on the log of the joint distribution
oftandwof the form
ln{p(t|w)p(w)}lnp(w)+
∑N
n=1
{
lnσ(ξn)+wTφntn
−(wTφn+ξn)/ 2 −λ(ξn)([wTφn]^2 −ξ^2 n)
}
. (10.154)
Substituting for the priorp(w), the right-hand side of this inequality becomes, as a
function ofw
−
1
2
(w−m 0 )TS− 01 (w−m 0 )
+
∑N
n=1
{
wTφn(tn− 1 /2)−λ(ξn)wT(φnφTn)w
}
+const. (10.155)