Bandit Algorithms

34.2 Bayesian learning and the posterior distribution 397

effort developing the necessary tools in Chapter 2, it would seem a waste not
to use them now. And second, the subtle issues that arise highlight some real
consequences of the differences between the Bayesian and frequentist viewpoints.
As we shall see, there is a real gap between these viewpoints.
Let Θ be a set called thehypothesis spaceandGbe aσ-algebra on Θ.
While Θ is often a subset of a Euclidean space, we do not make this assumption.
Aprioris a probability measureQon (Θ,G). Next let (U,H) be a measurable
space andP= (Pθ:θ∈Θ) be a probability kernel from (Θ,G) to (U,H). We
callPthemodel. Let Ω = Θ×U andF=G ⊗H. The prior and the model
combine to yield a probabilityP=Q⊗Pon (Ω,F). The prior is now the marginal
distribution of the joint probability measure:Q(A) =P(A×U). Suppose we
observe the realization of a random elementXdefined on Ω, then the posterior
should somehow be the marginal of the joint probability measure conditioned
on the observation. To make this more precise, let (X,J) be a measurable space
andX: Ω→ XaF/J-measurable map. The posterior having observed that
X=xshould be a measureQ(·|x) on (Θ,G).

We abuse notation by lettingθ: Ω→Θ denote theF/G-measurable random element given by the projection:θ((φ,u)) =φ. This allowsθbeing used as part of the probability expressions below.

Without much thought we might try and apply Bayes’ law (Eq. (2.2)) to claim that the posterior distribution having observedX(ω) =xshould be a measure on (Θ,G) given by

Q(A|x) =P(θ∈A|X=x) =

P(X=x|θ∈A)P(θ∈A) P(X=x)

. (34.1)

The problem with the ‘definition’ in(34.1)is thatP(X=x)can have measure
zero and thenP(θ∈A|X=x)is not defined. This is not an esoteric problem.
Consider the problem whenθis randomly chosen from Θ =Rand its distribution
isQ=N(0,1), the parameterθis observed in Gaussian noise with a variance of
one:U=R,Pθ=N(θ,1) for allθ∈RandX(φ,u) =ufor all (φ,u)∈Θ×U.
Even in this very simple example we haveP(X=x)= 0 for allx∈R. Having read
Chapter 2, the next attempt might be to defineQ(A|X) as aσ(X)-measurable
random variable defined using conditional expectations: ForA∈G,

Q(A|x) =E[I{θ∈A} |X](x),

where we remind the reader thatE[I{θ∈A} |X] is aσ(X)-measurable random
variable that is uniquely defined except for a set of measure zero and also that
the notation on the right-hand side is explained in Fig. 2.4 in Chapter 2. For
most applications of probability theory, the choice of conditional expectation
does not matter. However, as we shortly illustrate with an example, this is not
true here. A related annoying issue is thatQ(·|x) as defined above need not be

Bandit Algorithms

. (34.1)

Get our desktop app

Company

Features

Documentation

Resources