Bandit Algorithms

(Jeff_L) #1
212

Suppose you want to use a bandit algorithm to decide which ads to display
on your website. Can you reasonably do this with one of the models/algorithms
proposed so far? A simple idea is to treat each available ad as an arm and for each
arriving user choose the ad using your favourite bandit algorithm. The reward is
one if they click on the ad and zero otherwise. No doubt you immediately realize
that this is not exactly a perfect fit. In fact there are many problems:


  • (Delayed rewards) You don’t actually observe a user not clicking an ad. Maybe
    you give zero reward if the click does not happen soon, but this introduces
    delays and you cannot update the statistics between plays.

  • (Nonstationarity) Unless the time period is very short, then there is significant
    nonstationarity in the preferences of users. None of the algorithms discussed so
    far behave well in this situation.

  • (Ignoring information) If the number of available ads is large (it usually is),
    then treating each action independently is probably not a good idea. Users
    interested in ads for Mercedes might also be interested in other car ads. In
    many cases there is also information available about the user, such as prior
    purchasing decisions, and this too should be taken into account when deciding
    which ad is chosen to each user.


The challenge of delayed rewards and nonstationarity will be discussed elsewhere.
The purpose of this part is to focus on the situation where(a)the algorithm
has access to contextual information at the beginning of each round and(b)the
outcome of playing one arm may yield information about other arms. For example,
in the display ad problem the contextual information might consist of summary
statistics of the current user and the ads in the action set might be classified
into different categories. Exactly how to encode the additional structures is a
non-trivial problem. As usual there is a tradeoff between using models rich enough
to model (almost) any world and using simple models for which generalization is
easy, but for which the assumptions imposed on the world through the model are
more severe.
Except for the first chapter, which is generic, the focus of this part will be on
the special case that the expected reward of each arm is a linear function of some
feature vector in a way that will be made precise in Chapter 19. Along the way
we will discuss many generalizations and give references to the literature. One
aspect that will play a far larger role is computation. While finite-armed bandits
with few arms present few challenges in this respect, when the number of actions
is very large or the information structure of the feedback model is not so easily
separable, then computation can be a serious challenge.

Free download pdf