Bandit Algorithms

(Jeff_L) #1
22.1 Notes 258

Input A⊂Rdandδ
Step 0Set`= 1 and letA 1 =A
Step 1Lett`=tbe the current timestep and findG-optimal designπ`∈P(A`)
with Supp(π`)≤d(d+ 1)/2 that maximizes

log detV(π`) subject to


a∈A`

π`(a) = 1

Step 2Letε`= 2−`and

T`(a) =


2 dπ`(a)
ε^2 `

log

(


k`(`+ 1)
δ

)⌉


andT`=


a∈A`

T`(a)

Step 3Choose each actiona∈A`exactlyT`(a) times
Step 4Calculate empirical estimate:

θˆ=V`−^1

t`∑+T`

t=t`

AtXt with V`=


a∈A`

T`(a)aa>

Step 5Eliminate low rewarding arms:

A`+1=

{


a∈A`: max
b∈A`

〈θˆ`,b−a〉≤ 2 ε`

}


Step 6`←`+ 1 andGoto Step 1

Algorithm 12:Phased elimination withG-optimal exploration.

The proof of this theorem follows relatively directly from the high-probability
correctness of the confidence intervals used to eliminate low-rewarding arms. We
leave the details to the reader in Exercise 22.1.


22.1 Notes


1 The assumption that the action set does not change is crucial for Algorithm 12.
Several complicated algorithms have been proposed and analyzed for the case
whereAtis allowed to change from round to round under the assumption that
|At|≤kfor all rounds. For these algorithms it has been proven that


Rn=O

(√


ndlog^3 (nk)

)


. (22.1)


Whenkis small these results improve on the bound for LinUCB in Chapter 19
by a factor of


d.
Free download pdf