Bandit Algorithms

(Jeff_L) #1
This material will be published by Cambridge University Press as Bandit
Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version
is free to view and download for personal use only. Not for re-distribution, sale,
or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017.
The latest version is available athttp://banditalgs.com. Feedback on any
aspect is very welcome: [email protected]

12 The Exp3-IX Algorithm


In the last chapter we proved a sublinear bound on the expected regret of Exp3,
but with a dishearteningly large variance. The objective of this chapter is to
modify Exp3 so that the regret stays small in expectation and is simultaneously
well concentrated about its mean. Such results are calledhigh probability
bounds. By slightly modifying the algorithm we show that for eachδ∈(0,1)
there exists an algorithm such that with probability at least 1−δ,

Rˆn= max
a∈A

∑n

t=1

(ytAt−yta) =O

(√


nklog

(


k
δ

))


One way to make Exp3 more robust is to ensure sure thatPtiis never too small.
The first thing that comes to mind is to mixPtwith the uniform distribution.
This is an explicit way of forcing exploration, which after further modification
can be made to work. The resulting algorithm is called Exp3.P, which you will
analyze in Exercise 12.1. We explore a similar approach for which the analysis
is a little more straightforward. The idea is to change the reward estimates to
control the variance at the price of introducing some bias.

12.1 The Exp3-IX algorithm


We start by summarizing what we know about the behavior of the random regret
of Exp3. Because we want to use the loss-based estimator it is more convenient
to switch to losses, which we do for the remainder of the chapter. Rewriting
Eq. (11.15) in terms of losses,

Lˆn−Lˆni≤log(k)
η

+


η
2

∑k

j=1

Lˆnj, (12.1)

whereLˆnandLˆniare defined using the loss estimatorYˆtjby

Lˆn=

∑n

t=1

∑k

j=1

PtjYˆtj and Lˆni=

∑n

t=1

Yˆti.
Free download pdf