Bandit Algorithms

4.2 The learning objective 55

valuable because they teach us important lessons about equivalent models. For now, however, we move on.

4.2 The learning objective

The learner’s goal is to maximize the total rewardSn=

∑n t=1Xt, which is a random quantity that depends on the actions of the learner and the rewards sampled by the environment. This is not an optimization problem for three reasons:

1 What is the value ofnfor which we are maximizing? Occasionally prior knowledge of the horizon is reasonable, but very often the learner does not know ahead of time how many rounds are to be played. 2 The cumulative reward is a random quantity. Even if the reward distributions were known, then we require a measure of utility on distributions ofSn. 3 The learner does not know the distributions that govern the rewards for each arm.

Of these points, the last is fundamental to the bandit problem and is discussed in the next section. The lack of knowledge of the horizon is usually not a serious issue. Generally speaking it is possible to first design a policy assuming the horizon is known and then adapt it to account for the unknown horizon while proving that the loss in performance is minimal. This is almost always quite easy and there exist generic approaches for making the conversion. Assigning a utility to distributions of Sn is more challenging. Suppose thatSn is the revenue of your company. Fig. 4.1 shows the distribution of Snfor two different learners, call themAand B. Suppose you can choose between learnersAandB. Which one would you choose? One choice is to go with the learner whose reward distribution has the larger expected value.

Figure 4.1Alternative revenue distributions

This will be our default choice for stochastic bandits, but it bears remem- bering that there are other considera- tions, including the variance or tail be- havior of the cumulative reward, which we will discuss occasionally. In particu- lar, in the situation shown on the above figure, learnerBachieves a higher expected total reward thanA. However Bhas a reasonable probability of earn- ing less than the least amount thatA can earn, so a risk sensitive user may prefer learnerA.

Bandit Algorithms

4.2 The learning objective

Get our desktop app

Company

Features

Documentation

Resources