Bandit Algorithms

(Jeff_L) #1
18.7 Exercises 225

ξonCand the rewards are (Xt)nt=1where the conditional law ofXtgivenCt
andAtisPCtAt. The mean reward when choosing actioni∈[k] having observed
contextc∈Cisμ(c,i) =


xdPci(x). Let Φ be a subset of functions fromCto
[k]. The regret is

Rn=nsup
φ∈Φ

μ(φ)−E

[n

t=1

Xt

]


,


whereμ(φ) =



μ(c,φ(c))dξ(c). Consider a variation of explore-then-commit,
which explores uniformly at random for the firstmrounds. Then define


μˆ(φ) =

k
m

∑m

t=1

I{At=φ(Ct)}Xt.

For roundst > mthe algorithm choosesAt=φˆ∗(Ct) where

φˆ∗= argmaxφ∈Φμˆ(φ) = argmaxφ∈Φ

∑n

t=1

Xˆtφ(Ct),

whereXˆti=kI{At=φ(Ct)}Xt. When no maximizer exists you may assume
thatμˆ(φˆ∗)≥supφ∈Φμˆ(φ)−εfor anyε >0 of your choice. Show that when Φ
is finite, then for appropriately tunedmthe expected regret of this algorithms
satisfies


Rn=O

(


n^2 /^3 (klog(|Φ|))^1 /^3

)


.


This algorithm is the explore-then-commit version of the epoch greedy
algorithm by Langford and Zhang [2008]. You should not worry too much
about these details, but of courseCshould be associated with aσ-algebra
and the family of distributions (Pca:c∈C,a∈[k]) should be a probability
kernel fromC×[k] toR.

18.9Consider a stochastic contextual bandit problem with the same setup as
the previous exercise andk= 2 arms. As before let Φ be a set of functions from
Cto [k]. Design an algorithm such that


Rn=nmaxφ∈Φμ(φ)−E

[n

t=1

Xt

]


≤C



ndklog

(n
d

)


,


whereC >0 is a universal constant andd= VC(Φ) is the VC dimension of Φ.


Hint Use an initial period of exploration to choose a finite ‘representitive’
subset of Φ and then run Exp4 on this subset.
Free download pdf