Bandit Algorithms

11.3 The Exp3 algorithm 147

In Exercise 11.4 we ask you to show that all unbiased estimators in this setting are importance-weighted estimators.

Although the two estimators seem quite similar, it should be noted that the first estimator takes values in [0,∞) while the second takes values in (−∞,1]. Soon we will see that this difference has a big impact on the usefulness of these estimators when used in the Exp3 algorithm.

11.3 The Exp3 algorithm

The simplest algorithm for adversarial bandits is called Exp3, which stands for “Exponential-weight algorithm forExploration andExploitation”. The reason for this name will become clear after the explanation of the algorithm. Let Sˆti=∑ts=1Xˆsibe the total estimated reward by the end of roundt, whereXˆsi is given in Eq. (11.6). It seems natural play actions with larger estimated reward with higher probability. While there are many ways to mapSˆtiinto probabilities, a simple and popular choice is calledexponential weighting, which for tuning parameterη >0 sets

Pti= exp(η Sˆt− 1 ,i) ∑k j=1exp(ηSˆt−^1 ,j)

. (11.7)

The parameterηis called thelearning rate. When the learning rate is largePt concentrates about the arm with the largest estimated reward and the resulting algorithm exploits aggressively. For small learning ratesPtis more uniform and the algorithm explores more frequently. Note that asPtconcentrates the variance of the importance-weighted estimators for poorly performing arms increasing dramatically. There are many ways to tune the learning rate, including allowing it to vary with time. In this chapter we restrict our attention to the simplest case by choosingηto depend only on the number of actionskand the horizonn. Since the algorithm depends onη, this means that the horizon must be known in advance, a requirement which can be relaxed (see Note 11).

11.4 Regret analysis

We are now ready to bound the expected regret of Exp3.

Theorem11.1.Letx∈[0,1]n×kandπbe the policy of Exp3 (Algorithm 9) with learning rateη=

√

log(k)/(nk). Then

Rn(π,x)≤ 2

√

nklog(k).

Bandit Algorithms

11.3 The Exp3 algorithm

. (11.7)

11.4 Regret analysis

√

√

Get our desktop app

Company

Features

Documentation

Resources