18.7 Exercises 225
ξonCand the rewards are (Xt)nt=1where the conditional law ofXtgivenCt
andAtisPCtAt. The mean reward when choosing actioni∈[k] having observed
contextc∈Cisμ(c,i) =
∫
xdPci(x). Let Φ be a subset of functions fromCto
[k]. The regret is
Rn=nsup
φ∈Φ
μ(φ)−E
[n
∑
t=1
Xt
]
,
whereμ(φ) =
∫
μ(c,φ(c))dξ(c). Consider a variation of explore-then-commit,
which explores uniformly at random for the firstmrounds. Then define
μˆ(φ) =
k
m
∑m
t=1
I{At=φ(Ct)}Xt.
For roundst > mthe algorithm choosesAt=φˆ∗(Ct) where
φˆ∗= argmaxφ∈Φμˆ(φ) = argmaxφ∈Φ
∑n
t=1
Xˆtφ(Ct),
whereXˆti=kI{At=φ(Ct)}Xt. When no maximizer exists you may assume
thatμˆ(φˆ∗)≥supφ∈Φμˆ(φ)−εfor anyε >0 of your choice. Show that when Φ
is finite, then for appropriately tunedmthe expected regret of this algorithms
satisfies
Rn=O
(
n^2 /^3 (klog(|Φ|))^1 /^3
)
.
This algorithm is the explore-then-commit version of the epoch greedy
algorithm by Langford and Zhang [2008]. You should not worry too much
about these details, but of courseCshould be associated with aσ-algebra
and the family of distributions (Pca:c∈C,a∈[k]) should be a probability
kernel fromC×[k] toR.
18.9Consider a stochastic contextual bandit problem with the same setup as
the previous exercise andk= 2 arms. As before let Φ be a set of functions from
Cto [k]. Design an algorithm such that
Rn=nmaxφ∈Φμ(φ)−E
[n
∑
t=1
Xt
]
≤C
√
ndklog
(n
d
)
,
whereC >0 is a universal constant andd= VC(Φ) is the VC dimension of Φ.
Hint Use an initial period of exploration to choose a finite ‘representitive’
subset of Φ and then run Exp4 on this subset.