Bandit Algorithms

18.7 Exercises 225

ξonCand the rewards are (Xt)nt=1where the conditional law ofXtgivenCt andAtisPCtAt. The mean reward when choosing actioni∈[k] having observed contextc∈Cisμ(c,i) =

∫

xdPci(x). Let Φ be a subset of functions fromCto [k]. The regret is

Rn=nsup φ∈Φ

μ(φ)−E

[n ∑

t=1

Xt

]

,

whereμ(φ) =

∫

μ(c,φ(c))dξ(c). Consider a variation of explore-then-commit,
which explores uniformly at random for the firstmrounds. Then define

μˆ(φ) =

k m

∑m

t=1

I{At=φ(Ct)}Xt.

For roundst > mthe algorithm choosesAt=φˆ∗(Ct) where

φˆ∗= argmaxφ∈Φμˆ(φ) = argmaxφ∈Φ

∑n

t=1

Xˆtφ(Ct),

whereXˆti=kI{At=φ(Ct)}Xt. When no maximizer exists you may assume
thatμˆ(φˆ∗)≥supφ∈Φμˆ(φ)−εfor anyε >0 of your choice. Show that when Φ
is finite, then for appropriately tunedmthe expected regret of this algorithms
satisfies

Rn=O

(

n^2 /^3 (klog(|Φ|))^1 /^3

)

.

This algorithm is the explore-then-commit version of the epoch greedy algorithm by Langford and Zhang [2008]. You should not worry too much about these details, but of courseCshould be associated with aσ-algebra and the family of distributions (Pca:c∈C,a∈[k]) should be a probability kernel fromC×[k] toR.

18.9Consider a stochastic contextual bandit problem with the same setup as
the previous exercise andk= 2 arms. As before let Φ be a set of functions from
Cto [k]. Design an algorithm such that

Rn=nmaxφ∈Φμ(φ)−E

[n ∑

t=1

Xt

]

≤C

√

ndklog

(n d

)

,

whereC >0 is a universal constant andd= VC(Φ) is the VC dimension of Φ.

Hint Use an initial period of exploration to choose a finite ‘representitive’ subset of Φ and then run Exp4 on this subset.

Bandit Algorithms

∫

]

,

∫

(

)

.

]

≤C

√

)

,

Get our desktop app

Company

Features

Documentation

Resources