34.2 Bayesian learning and the posterior distribution 395
(^001)
1
π 3 , minimax optimal
π 2 , dominated
π 1 , admissible
π 4 , admissible
Environments
Loss
Figure 34.1Loss as a function of the environment for four different policesπ 1 ,...,π 4
whenE= [0,1]. Which policy would you choose?
Bayesian optimal policy with respect toqis an element of
argmaxπ
∑
ν∈E
q(ν)`(ν,π).
The Bayesian viewpoint is hard to criticize when the user really does know the
underlying likelihood of each environment and the user is risk-neutral. Even
when the distribution is not known exactly, however, sensible priors often yield
provably sensible outcomes, regardless of whether one is interested in the average
loss across the environments, or the worst-case loss, or some other metric.
A distinction is often made between the Bayesian and frequentist viewpoints,
which naturally leads to heated discussions on the merits of one viewpoint
relative to another. This debate does not interest us greatly. We prefer to
think about the pros and cons of problem definitions and solution methods,
regardless of the label on them. Bayesian approaches to bandits have their
strengths and weaknesses and we hope to do them a modicum of justice
here.
34.2 Bayesian learning and the posterior distribution
The last section explained the ‘forward view’ where a policy is chosen in advance
that minimizes the expected loss. The Bayesian can also act sequentially by
updating their beliefs (the prior) as data is observed to obtain a new distribution
on the set of environments (more generally, the set of hypotheses). The new