Bandit Algorithms

(Jeff_L) #1
This material will be published by Cambridge University Press as Bandit
Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version
is free to view and download for personal use only. Not for re-distribution, sale,
or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017.
The latest version is available athttp://banditalgs.com. Feedback on any
aspect is very welcome: [email protected]

11 The Exp3 Algorithm


In this chapter we first introduce the formal model of adversarial bandit
environments and discuss the relationship to the stochastic bandit model. This is
followed by the discussion of importance weighted estimation, the Exp3 algorithm
which uses this technique, and the analysis of the regret of Exp3.

11.1 Adversarial bandit environments


Letk >1 be the number of arms. Ak-armed adversarial banditis an arbitrary
sequence of reward vectors (xt)nt=1wherext∈[0,1]k. In each round the learner
chooses a distribution over the actionsPt∈ Pk− 1. Then the actionAt∈[k] is
sampled fromPtand the learner receives rewardxtAt. The interaction protocol
is summarized in Fig. 11.1.
A policy in this setting is a functionπ: ([k]×[0,1])∗→ Pk− 1 mapping
history sequences to distributions over actions. The performance of a policyπin
environmentxis measured by the expected regret, which is the expected loss in
revenue of the policy relative to the best fixed action in hindsight.

Rn(π,x) = max
i∈[k]

∑n

t=1

xti−E

[n

t=1

xtAt

]


, (11.1)


where the expectation is over the actions of the learner. The argumentsπandx
are omitted from the regret when they are clear from context.

Adversary secretly chooses rewards (xt)nt=1withxt∈[0,1]k
For roundst= 1, 2 ,...,n:
Learner selects distributionPt∈Pk− 1 and samplesAtfromPt.
Learner observes rewardXt=xtAt.

Figure 11.1Interaction protocol fork-armed adversarial bandits.
Free download pdf