11.3 The Exp3 algorithm 147
In Exercise 11.4 we ask you to show that all unbiased estimators in this setting
are importance-weighted estimators.
Although the two estimators seem quite similar, it should be noted that the
first estimator takes values in [0,∞) while the second takes values in (−∞,1].
Soon we will see that this difference has a big impact on the usefulness of
these estimators when used in the Exp3 algorithm.
11.3 The Exp3 algorithm
The simplest algorithm for adversarial bandits is called Exp3, which stands for
“Exponential-weight algorithm forExploration andExploitation”. The reason
for this name will become clear after the explanation of the algorithm. Let
Sˆti=∑ts=1Xˆsibe the total estimated reward by the end of roundt, whereXˆsi
is given in Eq. (11.6). It seems natural play actions with larger estimated reward
with higher probability. While there are many ways to mapSˆtiinto probabilities,
a simple and popular choice is calledexponential weighting, which for tuning
parameterη >0 sets
Pti= exp(η
Sˆt− 1 ,i)
∑k
j=1exp(ηSˆt−^1 ,j)
. (11.7)
The parameterηis called thelearning rate. When the learning rate is largePt
concentrates about the arm with the largest estimated reward and the resulting
algorithm exploits aggressively. For small learning ratesPtis more uniform and
the algorithm explores more frequently. Note that asPtconcentrates the variance
of the importance-weighted estimators for poorly performing arms increasing
dramatically. There are many ways to tune the learning rate, including allowing
it to vary with time. In this chapter we restrict our attention to the simplest
case by choosingηto depend only on the number of actionskand the horizonn.
Since the algorithm depends onη, this means that the horizon must be known in
advance, a requirement which can be relaxed (see Note 11).
11.4 Regret analysis
We are now ready to bound the expected regret of Exp3.
Theorem11.1.Letx∈[0,1]n×kandπbe the policy of Exp3 (Algorithm 9) with
learning rateη=
√
log(k)/(nk). Then
Rn(π,x)≤ 2
√
nklog(k).