Bandit Algorithms

(Jeff_L) #1
34.3 Conjugate pairs, conjugate priors and the exponential family 399

{X∈C}has measure zero and there is little cause to worry about events that
happen with probability zero. But for a frequentist using Bayesian techniques
for inference this actually matters. Ifθis not sampled fromQ, then nothing
prevents the situation thatθ∈Cand the nonuniqueness of the posterior is an
issue (Exercise 34.10). Probability theory does not provide a way around this
issue.

One should be careful to specify the version of the posterior being used
when using Bayesian techniques for inference in a frequentist setting. This is
important because in the frequentist viewpointθis not part of the probability
space and results are proven forPθfor arbitrary fixedθ∈Θ. By contrast,
the all-in Bayesians includeθin the probability space and thus will not worry
about events with negligible prior probability and for them any version of
the posterior will do.

34.3 Conjugate pairs, conjugate priors and the exponential family


One of the strengths of the Bayesian approach is the ability to explicitly specify
and incorporate prior beliefs into the uncertainty models in a natural way via
the prior. When it comes to Bayesian algorithms, this advantage is belied a little
by the competing necessity of choosing a prior for which the posterior can be
efficiently computed, or sampled from. The ease of computing (sampling from) the
posterior depends on the interplay between the prior and the model. Given the
importance of computation, it is hardly surprising that researchers have worked
hard to find models and priors that behave well together. A prior and model are
called aconjugate pairif the posterior has the same parametric form as the
prior. In this case, the prior is called aconjugate priorto the model.

Gaussian model/Gaussian prior
Suppose that (Θ,G) = (Ω,F) = (R,B(R)) andX: Ω→Ω is the identity and
Pθis Gaussian with meanθand knownsignal varianceσ^2 S. If the priorQis
Gaussian with meanμPandprior varianceσ^2 P, then the posterior distribution
having observedX=xcan be chosen to be

Q(·|x) =N

(


μP/σP^2 +x/σ^2 S
1 /σ^2 P+ 1/σS^2

,


(


1


σ^2 S

+^1


σ^2 P

)− 1 )


.


The proof is left to the reader in Exercise 34.1. The limiting regimes as the
prior/signal variance tend to zero or infinity are quite illuminating. For example,
asσ^2 P→0 the posterior tends to a GaussianN(μP,σ^2 P), which is equal to the
prior and indicates that no learning occurs. This is consistent with intuition. If
the prior variance is zero, then the statistician is already certain of the mean and
Free download pdf