Bandit Algorithms

(Jeff_L) #1
34.3 Conjugate pairs, conjugate priors and the exponential family 400

no amount of data can change their belief. On the other hand, asσP^2 tends to
infinity we see the mean of the posterior has no dependence on the prior mean,
which means that all prior knowledge is washed away with just one sample. You
should think about what happens whenσ^2 S→{ 0 ,∞}.
Notice how the model has fixedσ^2 S, suggesting that the model variance is
known. The Bayesian can also incorporate their uncertainty over the variance. In
this case the model parameters are Θ =R×[0,∞) andPΘ=N(θ 1 ,θ 2 ). But is
there a conjugate prior in this case? Already things are getting complicated, so we
will simply let you know that the family of Gaussian-inverse-gamma distributions
is conjugate.


Bernoul li model/beta prior
Suppose that Θ = [0,1] andPθ=B(θ) is Bernoulli with parameterθ. In this
case it turns out that the family of beta distributions is conjugate, which for
parametersθ= (α,β)∈(0,∞)^2 is given in terms of its probability density
function with respect to the Lebesgue measure:

pα,β(x) =xα−^1 (1−x)β−^1

Γ(α+β)
Γ(α)Γ(β)

, (34.3)


where Γ(x) is the Gamma function. Then the posterior having observedX=x∈
{ 0 , 1 }is also a beta distribution with parameters (α+x,β+ 1−x).


Here and in what follows, in line with the literature, we sweep under the
rug that this posterior is just one of the many choices. This is done to
simplify the language, which is justified by that all posteriors must agree
almost everywhere and thus the slight imprecision will hopefully not lead to
confusion.

Unlike in the Gaussian case, the posterior for the Bernoulli model and beta prior
is unique (Exercise 34.2).

Exponential families
Both the Gaussian and Bernoulli families are examples of a more general family.
Lethbe a measure on (R,B(R)) andT,η:R→Rbe two ‘suitable’ functions,
whereTis called thesufficient statistic. Together,h,ηandTdefine a measure
Pθon (R,B(R)) for eachθ∈Θ⊂Rin terms of its density with respect toh:
dPθ
dh


(x) = exp (η(θ)T(x)−A(θ)),

where A(θ) = log



Rexp(η(θ)T(x))dh(x) is the log-partitionfunction and
Θ =dom(A) ={θ:A(θ)<∞}is the domain ofA. Integrating the density
shows that for anyB∈B(R) andθ∈Θ,


Pθ(B) =


B

dPθ
dh(x)dh(x) =


B

exp (η(θ)T(x)−A(θ))dh(x).
Free download pdf