Bandit Algorithms

(Jeff_L) #1
24.2 Unit ball 274

∑t
s=1A

(^2) si≥n/d}. Then,
Rn(A,θ) = ∆Eθ
[n

t=1
∑d
i=1


(


√^1


d

−Atisign(θi)

)]





d
2 Eθ

[n

t=1

∑d

i=1

(


√^1


d

−Atisign(θi)

) 2 ]





d
2

∑d

i=1


[τi

t=1

(


√^1


d

−Atisign(θi)

) 2 ]


,


where the second inequality uses that‖At‖^22 ≤1. Fixi∈[d]. Forx∈{± 1 }, define
Ui(x) =


∑τi
t=1(1/


d−Atix)^2 and letθ′∈{±∆}dbe another parameter vector
such thatθj=θj′forj 6 =iandθi′=−θi. Assume without loss of generality that
θi>0. LetPandP′be the laws ofUi(1) with respect to the bandit/learner
interaction measure induced byθandθ′, respectively. Then,

Eθ[Ui(1)]≥Eθ′[Ui(1)]−

(


4 n
d

+ 2


)√


1


2


D(P,P′)


≥Eθ′[Ui(1)]−


2


(


4 n
d

+ 2


)


√√


√√


E


[τi

t=1

A^2 ti

]


(24.3)


≥Eθ′[Ui(1)]−


2


(


4 n
d

+ 2


)√


n
d

+ 1 (24.4)


≥Eθ′[Ui(1)]−

4



3∆n
d


n
d

, (24.5)


where in the first inequality we used Pinsker’s inequality (Eq. (14.11)), the result
in Exercise 14.4, the bound


Ui(1) =

∑τi

t=1

(1/



d−Ati)^2 ≤ 2

∑τi

t=1

1


d

+ 2


∑τi

t=1

A^2 ti≤

4 n
d

+ 2,


and the assumption thatd≤ 2 n. The inequality in Eq. (24.3) follows from the
chain rule for the relative entropy up to a stopping time (Exercise 15.7). Eq. (24.4)
is true by the definition ofτiand Eq. (24.5) by the assumption thatd≤ 2 n.
Then,


Eθ[Ui(1)] +Eθ′[Ui(−1)]≥Eθ′[Ui(1) +Ui(−1)]−

4



3 n∆
d


n
d

= 2Eθ′

[


τi
d

+


∑τi

t=1

A^2 ti

]



4



3 n∆
d


n
d


2 n
d


4



3 n∆
d


n
d

=


n
d

.

Free download pdf