Bandit Algorithms

(Jeff_L) #1
6.4 Exercises 94

(a) Prove that ifεt=ε >0, then limn→∞


Rn
n

=


ε
k

∑k

i=1

∆i.

(b)Let ∆min=min{∆i: ∆i> 0 }and letεt=min


{


1 ,t∆Ck (^2) min


}


whereC >0 is
a sufficiently large universal constant. Prove that there exists a universal
C′>0 such that

Rn≤C′

∑k

i=1

(


∆i+ ∆i
∆^2 min

log max

{


e,n∆

(^2) min
k


})


.


6.8(Elimination algorithm) A simple way to generalize the ETC policy to
multiple arms and overcome the problem of tuning the commitment time is to
use an elimination algorithm. The algorithm operates in phases and maintains
an active set of arms that could be optimal. In the`th phase the algorithm aims
to eliminate from the active set all armsifor which ∆i≥ 2 −`.

1:Input: kand sequence (m`)`
2:A 1 ={ 1 , 2 ,...,k}
3:for`= 1, 2 , 3 ,...do
4: Choose each armi∈A`exactlym`times
5: Let ˆμi,`be the average reward for armifrom this phase only
6: Update active set:

A`+1=

{


i: ˆμi,`+ 2−`≥max
j∈A`

μˆj,`

}


7:end for
Algorithm 2

Without loss of generality, assume that arm 1 is an optimal arm. You may
assume that the horizonnis known.

(a) Show that for any`≥1,


P(1∈/A`+1, 1 ∈A`)≤kexp

(


−m`^2

− 2 `
4

)


(b) Show that ifi∈[k] and≥1 are such that ∆i≥ 2 −, then


P(i∈A`+1, 1 ∈A`, i∈A`)≤exp

(


−m`(∆i−^2

−`) 2


4


)


(c)Let `i = min

{


`≥1 : 2−`≤∆i/ 2

}


. Choose min such a way that P(exists: 1∈/A)≤ 1 /nandP(i∈Ai+1)≤ 1 /n.
(d) Show that your algorithm has regret at most


Rn≤C


i:∆i> 0

(


∆i+

1


∆ilog(n)

)


,


whereC >0 is a carefully chosen universal constant.
Free download pdf