10.5 Exercises 137
(f)Find a choice ofhandTsuch that{Pθ:θ∈Θ}is the family of Bernoulli
distributions.
(g)Find a choice ofhandTsuch that{Pθ:θ∈Θ}is the family of Gaussian
distributions with unit variance means inR.
Exponential families represent a wide range of statistical models. We discuss
them in more detail in Chapter 34. The functiond(θ,θ′) is called therelative
entropybetweenPθandPθ′. We discuss this concept more in Chapter 14.
10.5(KL-UCB for exponential families) LetM={Pθ:θ∈Θ}be a
regular non-singular exponential family with sufficient statisticT(x) =xand
E={(Pθi)ki=1:θ∈Θk}be the set of bandits with reward distributions inM.
Design a policyπsuch that for allν∈Eit holds that
nlim→∞
Rn(π,ν)
log(n)
≤
∑
i:∆i> 0
∆i
di,inf
,
whereμ(θ) =
∫
RxdPθ(x) is the mean ofPθanddi,inf=inf{d(θ,φ) :μ(φ)>
μ∗,φ∈Φ}.
Hint Repeat the proof of Theorem 10.6, adapting as necessary. Working through
Exercise 10.4 will probably be useful. See also the paper by Capp ́e et al. [2013].
10.6(KL-UCB for non-canonical exponential families) Repeat the
previous exercise, but relax the assumption thatT(x) =x.
Hint This is a subtle problem. You should adapt the algorithm so that if there
are ties in the upper confidence bounds, then an arm with the largest number of
plays is chosen. A solution is available. Korda et al. [2013] analyzed Thompson
sampling in this setting. Their result only holds whenθ7→
∫
Rxpθ(x)dh(x) is
invertible, which does not always hold.
In the analysis of KL-UCB for canonical exponential families the asymptotic
rate is a good indicator of the finite-time regret in the sense that theo(log(n))
term hidden by the asymptotics has roughly the same leading constant as
the dominant term. By contrast, the analysis here indicates that
E[Ti(n)]≈log(n)
di,inf
+^1
di,min
,
wheredi,min=di,min(0). Although the latter term is negligible asymptotically,
it may be the dominant term for all reasonablen.
10.7(Comparison to UCB) In this exercise you compare KL-UCB and UCB
empirically.