4.2 The learning objective 55
valuable because they teach us important lessons about equivalent models. For
now, however, we move on.
4.2 The learning objective
The learner’s goal is to maximize the total rewardSn=
∑n
t=1Xt, which is a
random quantity that depends on the actions of the learner and the rewards
sampled by the environment. This is not an optimization problem for three
reasons:
1 What is the value ofnfor which we are maximizing? Occasionally prior
knowledge of the horizon is reasonable, but very often the learner does not
know ahead of time how many rounds are to be played.
2 The cumulative reward is a random quantity. Even if the reward distributions
were known, then we require a measure of utility on distributions ofSn.
3 The learner does not know the distributions that govern the rewards for each
arm.
Of these points, the last is fundamental to the bandit problem and is discussed
in the next section. The lack of knowledge of the horizon is usually not a serious
issue. Generally speaking it is possible to first design a policy assuming the
horizon is known and then adapt it to account for the unknown horizon while
proving that the loss in performance is minimal. This is almost always quite easy
and there exist generic approaches for making the conversion.
Assigning a utility to distributions of Sn is more challenging. Suppose
thatSn is the revenue of your company. Fig. 4.1 shows the distribution of
Snfor two different learners, call themAand B. Suppose you can choose
between learnersAandB. Which one would you choose? One choice is to
go with the learner whose reward distribution has the larger expected value.
Figure 4.1Alternative revenue distributions
This will be our default choice for
stochastic bandits, but it bears remem-
bering that there are other considera-
tions, including the variance or tail be-
havior of the cumulative reward, which
we will discuss occasionally. In particu-
lar, in the situation shown on the above
figure, learnerBachieves a higher ex-
pected total reward thanA. However
Bhas a reasonable probability of earn-
ing less than the least amount thatA
can earn, so a risk sensitive user may prefer learnerA.