Pattern Recognition and Machine Learning

(Jeff_L) #1
2.4. The Exponential Family 117

which can in principle be solved to obtainηML. We see that the solution for the
maximum likelihood estimator depends on the data only through



nu(xn), which
is therefore called thesufficient statisticof the distribution (2.194). We do not need
to store the entire data set itself but only the value of the sufficient statistic. For
the Bernoulli distribution, for example, the functionu(x)is given just byxand
so we need only keep the sum of the data points{xn}, whereas for the Gaussian
u(x)=(x, x^2 )T, and so we should keep both the sum of{xn}and the sum of{x^2 n}.
If we consider the limitN →∞, then the right-hand side of (2.228) becomes
E[u(x)], and so by comparing with (2.226) we see that in this limitηMLwill equal
the true valueη.
In fact, this sufficiency property holds also for Bayesian inference, although
we shall defer discussion of this until Chapter 8 when we have equipped ourselves
with the tools of graphical models and can thereby gain a deeper insight into these
important concepts.


2.4.2 Conjugate priors


We have already encountered the concept of a conjugate prior several times, for
example in the context of the Bernoulli distribution (for which the conjugate prior
is the beta distribution) or the Gaussian (where the conjugate prior for the mean is
a Gaussian, and the conjugate prior for the precision is the Wishart distribution). In
general, for a given probability distributionp(x|η), we can seek a priorp(η)that is
conjugate to the likelihood function, so that the posterior distribution has the same
functional form as the prior. For any member of the exponential family (2.194), there
exists a conjugate prior that can be written in the form


p(η|χ,ν)=f(χ,ν)g(η)νexp

{
νηTχ

}
(2.229)

wheref(χ,ν)is a normalization coefficient, andg(η)is the same function as ap-
pears in (2.194). To see that this is indeed conjugate, let us multiply the prior (2.229)
by the likelihood function (2.227) to obtain the posterior distribution, up to a nor-
malization coefficient, in the form


p(η|X,χ,ν)∝g(η)ν+Nexp

{

ηT

(N

n=1

u(xn)+νχ

)}

. (2.230)


This again takes the same functional form as the prior (2.229), confirming conjugacy.
Furthermore, we see that the parameterνcan be interpreted as a effective number of
pseudo-observations in the prior, each of which has a value for the sufficient statistic
u(x)given byχ.


2.4.3 Noninformative priors


In some applications of probabilistic inference, we may have prior knowledge
that can be conveniently expressed through the prior distribution. For example, if
the prior assigns zero probability to some value of variable, then the posterior dis-
tribution will necessarily also assign zero probability to that value, irrespective of

Free download pdf