Bandit Algorithms

6.4 Exercises 94

(a) Prove that ifεt=ε >0, then limn→∞

Rn n

=

ε k

∑k

i=1

∆i.

(b)Let ∆min=min{∆i: ∆i> 0 }and letεt=min

{

1 ,t∆Ck (^2) min

}

whereC >0 is a sufficiently large universal constant. Prove that there exists a universal C′>0 such that

Rn≤C′

∑k

i=1

(

∆i+ ∆i ∆^2 min

log max

{

e,n∆

(^2) min
k

})

.

6.8(Elimination algorithm) A simple way to generalize the ETC policy to multiple arms and overcome the problem of tuning the commitment time is to use an elimination algorithm. The algorithm operates in phases and maintains an active set of arms that could be optimal. In the`th phase the algorithm aims to eliminate from the active set all armsifor which ∆i≥ 2 −`.

1:Input: kand sequence (m`)` 2:A 1 ={ 1 , 2 ,...,k} 3:for`= 1, 2 , 3 ,...do 4: Choose each armi∈A`exactlym`times 5: Let ˆμi,`be the average reward for armifrom this phase only 6: Update active set:

A`+1=

{

i: ˆμi,`+ 2−`≥max j∈A`

μˆj,`

}

7:end for Algorithm 2

Without loss of generality, assume that arm 1 is an optimal arm. You may assume that the horizonnis known.

(a) Show that for any`≥1,

P(1∈/A`+1, 1 ∈A`)≤kexp

(

−m`^2

− 2 ` 4

)

(b) Show that ifi∈[k] and≥1 are such that ∆i≥ 2 −, then

P(i∈A`+1, 1 ∈A`, i∈A`)≤exp

(

−m`(∆i−^2

−`) 2

4

)

(c)Let `i = min

{

`≥1 : 2−`≤∆i/ 2

}

. Choose min such a way that P(exists: 1∈/A)≤ 1 /nandP(i∈Ai+1)≤ 1 /n.
(d) Show that your algorithm has regret at most

Rn≤C

∑

i:∆i> 0

(

∆i+

1

∆ilog(n)

)

,

whereC >0 is a carefully chosen universal constant.

Bandit Algorithms

=

{

}

(

{

})

.

{

}

(

)

(

−`) 2

4

)

{

}

∑

(

1

)

,

Get our desktop app

Company

Features

Documentation

Resources