Bandit Algorithms

(Jeff_L) #1
7.1 The optimism principle 101

The theorem will be proven by boundingE[Ti(n)] for each suboptimal armi. We
make use of a relatively standard idea, which is to decouple the randomness from
the behavior of the UCB algorithm. LetGibe the ‘good’ event defined by


Gi=

{


μ 1 <min
t∈[n]

UCB 1 (t)

}



{


μˆiui+


2


ui

log

(


1


δ

)


< μ 1

}


,


whereui∈[n] is a constant to be chosen later. SoGiis the event whenμ 1 is
never underestimated by the upper confidence bound of the first arm, while at the
same time the upper confidence bound for the mean of armiafteruiobservations
are taken from this arm is below the payoff of the optimal arm. We will show
two things:


1 IfGioccurs, thenTi(n)≤ui.
2 The complement eventGcioccurs with low probability (governed in some way
yet to be discovered byui).

BecauseTi(n)≤nno matter what, this will mean that

E[Ti(n)] =E[I{Gi}Ti(n)] +E[I{Gci}Ti(n)]≤ui+P(Gci)n. (7.5)

The next step is to complete our promise by showing thatTi(n)≤uionGiand
thatP(Gci)is small. Let us first assume thatGiholds and show thatTi(n)≤ui,
which we do by contradiction. Suppose thatTi(n)> ui. Then armiwas played
more thanuitimes over thenrounds and so there must exist a roundt∈[n]
whereTi(t−1) =uiandAt=i. Using the definition ofGi,


UCBi(t−1) = ˆμi(t−1) +


2 log(1/δ)
Ti(t−1)

(definition of UCBi(t−1))

= ˆμiui+


2 log(1/δ)
ui (sinceTi(t−1) =ui)
< μ 1 (definition ofGi)
<UCB 1 (t−1). (definition ofGi)

HenceAt=argmaxjUCBj(t−1) 6 =i, which is a contradiction. Therefore if
Gioccurs, thenTi(n)≤ui. Let us now turn to upper boundingP(Gci). By its
definition,

Gci=

{


μ 1 ≥min
t∈[n]

UCB 1 (t)

}






μˆiui+


2 log(1/δ)
ui

≥μ 1




. (7.6)

Free download pdf