18.7 Exercises 225ξonCand the rewards are (Xt)nt=1where the conditional law ofXtgivenCt
andAtisPCtAt. The mean reward when choosing actioni∈[k] having observed
contextc∈Cisμ(c,i) =∫
xdPci(x). Let Φ be a subset of functions fromCto
[k]. The regret isRn=nsup
φ∈Φμ(φ)−E[n
∑t=1Xt]
,
whereμ(φ) =
∫
μ(c,φ(c))dξ(c). Consider a variation of explore-then-commit,
which explores uniformly at random for the firstmrounds. Then define
μˆ(φ) =k
m∑mt=1I{At=φ(Ct)}Xt.For roundst > mthe algorithm choosesAt=φˆ∗(Ct) whereφˆ∗= argmaxφ∈Φμˆ(φ) = argmaxφ∈Φ∑nt=1Xˆtφ(Ct),whereXˆti=kI{At=φ(Ct)}Xt. When no maximizer exists you may assume
thatμˆ(φˆ∗)≥supφ∈Φμˆ(φ)−εfor anyε >0 of your choice. Show that when Φ
is finite, then for appropriately tunedmthe expected regret of this algorithms
satisfies
Rn=O(
n^2 /^3 (klog(|Φ|))^1 /^3)
.
This algorithm is the explore-then-commit version of the epoch greedy
algorithm by Langford and Zhang [2008]. You should not worry too much
about these details, but of courseCshould be associated with aσ-algebra
and the family of distributions (Pca:c∈C,a∈[k]) should be a probability
kernel fromC×[k] toR.18.9Consider a stochastic contextual bandit problem with the same setup as
the previous exercise andk= 2 arms. As before let Φ be a set of functions from
Cto [k]. Design an algorithm such that
Rn=nmaxφ∈Φμ(φ)−E[n
∑t=1Xt]
≤C
√
ndklog(n
d)
,
whereC >0 is a universal constant andd= VC(Φ) is the VC dimension of Φ.
Hint Use an initial period of exploration to choose a finite ‘representitive’
subset of Φ and then run Exp4 on this subset.