38.8 Notes 496
whered(p,q) is the relative entropy between Bernoulli distributions with biasesp
andqrespectively. Now ∆ will be chosen to satisfy ∆≤ 1 /4. It follows from the
entropy inequalities in Eq. (14.15) that
D(P 0 ,Pj)≤4∆^2 E 0 [Tj]. (38.23)
Using the fact that 0≤Tσ−Tj≤Tσ≤n/D, Exercise 14.4, Pinsker’s inequality
(Eq. (14.11)) and (38.23),
Ej[Tσ−Tj]≥E 0 [Tσ−Tj]−
n
D
√
D(P 0 ,Pj)
2
≥E 0 [Tσ−Tj]−
n∆
D
√
2 E 0 [Tj].
Summing overjand applying Cauchy-Schwarz yields
∑k
j=1
Ej[Tσ−Tj]≥
∑k
j=1
E 0 [Tσ−Tj]−
n∆
D
∑k
j=1
√
2 E 0 [Tj]
≥(k−1)E 0 [Tσ]−n∆
D
√
2 kE 0 [Tσ]
≥
c 1 n(k−1)
D
−
n∆
D
√
2 c 2 nk
D
≥
c 1 n(k−1)
2 D
, (38.24)
where the last inequality follows by choosing
∆ =c^1 (k−1)
2
√
D
2 c 2 nk
By Eq. (38.24) there exists aj∈[k] such that
Ej[Tσ−Tj]≥
c 1 n(k−1)
2 Dk
Then for the last step apply Claim 38.11 to show that
Rnj≥c 3 D∆Ej[Tσ−Tj]≥
c^21 c 3 n(k−1)^2
4 k
√
D
2 c 2 nk.
Naive bounding and simplification concludes the proof.
38.8 Notes
1 MDPs in applications can have millions (or “Billions and Billions”) of states,
which should make the reader worried that the bound in Theorem 38.6 could
be extremely large. The takeaway should be that learning in large MDPs
without additional assumptions is hard, as attested by the lower bound in
Theorem 38.7.