Bandit Algorithms

36.5 Notes 443

(pushforward ofPunder (X,Y)). From this form, we immediately see that mutual information is symmetric (in line with its name), nonnegative and I(X;Y) = 0 if and only ifX andY are independent. LetX be so that H(X) is finite. DefineH(X|Y) =E

[

HPσ(Y)(X)

]

withPσ(Y)(·) =P(·|Y). If
Xis discrete,H(X|Y) of course coincides withH(X|σ(Y)) as defined in the
text. Note thatH(X|Y)≤H(X). When the involved quantities are finite,
I(X;Y) =H(X)−H(X|Y) =H(Y)−H(Y|X) (see also Lemma 36.6), and also
H(X|Y) =H(X,Y)−H(Y), which givesI(X;Y) =H(X) +H(Y)−H(X,Y).
Lemma 36.9 can also be derived by defining the conditional mutual information
I(X;Y|Z) =H(X|Z)−H(X|Y,Z) and then using the chain-ruleI(X;Y,Z) =
I(X;Y) +I(X;Y|Z). The expressionI(X;Y,Z) isI(X; (Y,Z)): the use of
semicolon in the definition ofI(·;·) allows us to remove the extra parenthesis
without introducing ambiguity.
8 The analysis in Section 36.4 can be generalized to structured settings such as
linear bandits [Russo and Van Roy, 2016]. For linear bandits with an infinite
action set the entropy of the optimal action may be infinite. The analysis
can be corrected in this case by discretizing the action set and comparing to
a near-optimal action. This leads to a tradeoff between the fineness of the
discretization and its size, and when the tradeoff is resolved in an optimal
fashion, one obtains an upper bound of orderO(d

√

nlog(1 +n/d)) on the
Bayesian regret, slightly improving previous analysis. The reader is referred to
the recent article by Dong and Van Roy [2018] for this analysis. The assumption
of bounded rewards in Lemma 36.9 can be relaxed to a subgaussian assumption
[Russo and Van Roy, 2016].
9 LetE= [0,1]n×kbe the set of all adversarial bandits and Π the set of all
randomized policies andQbe the set of all finitely supported distributions on
(E,B(E)). Then

R∗n(E) = min π∈Π

sup x∈E

Ex,π

[

max i∈[k]

∑n

t=1

(xti−xtAt)

]

︸︷︷︸

Adversarial regret

= sup Q∈Q

min π∈Π

∫

Ex,π

[

max i∈[k]

∑n

t=1

(xti−xtAt)

]

dQ(x) ︸︷︷︸ Bayesian optimal regret

(36.9)

≤

√

nklog(k)/ 2 ,

where the second equality follows from Sion’s minimax theorem (Exercise 36.11) and the inequality follows from Theorem 36.7. This bound is a factor of 2 better than what we gave in Theorem 11.2, but worse than the minimax optimal bound ofO(

√

kn). The approach is still interesting, however, and has been used in more sophisticated settings like the first near-optimal analysis for adversarial convex bandits [Bubeck et al., 2015a, Bubeck and Eldan, 2016]. As noted earlier, the main disadvantage is that the result is non-constructive.

Bandit Algorithms

[

]

√

[

]

︸︷︷︸

∫

[

]

(36.9)

≤

√

√

Get our desktop app

Company

Features

Documentation

Resources

Bandit Algorithms

[

]

√

[

]

︸ ︷︷ ︸

∫

[

]

(36.9)

≤

√

√

Get our desktop app

Company

Features

Documentation

Resources

︸︷︷︸