Bandit Algorithms

24.2 Unit ball 274

∑t s=1A

(^2) si≥n/d}. Then,
Rn(A,θ) = ∆Eθ
[n
∑
t=1
∑d
i=1

(

√^1

d

−Atisign(θi)

)]

≥

∆

√

d 2 Eθ

[n ∑

t=1

∑d

i=1

(

√^1

d

−Atisign(θi)

) 2 ]

≥

∆

√

d 2

∑d

i=1

Eθ

[τi ∑

t=1

(

√^1

d

−Atisign(θi)

) 2 ]

,

where the second inequality uses that‖At‖^22 ≤1. Fixi∈[d]. Forx∈{± 1 }, define
Ui(x) =

∑τi t=1(1/

√

d−Atix)^2 and letθ′∈{±∆}dbe another parameter vector such thatθj=θj′forj 6 =iandθi′=−θi. Assume without loss of generality that θi>0. LetPandP′be the laws ofUi(1) with respect to the bandit/learner interaction measure induced byθandθ′, respectively. Then,

Eθ[Ui(1)]≥Eθ′[Ui(1)]−

(

4 n d

+ 2

)√

1

2

D(P,P′)

≥Eθ′[Ui(1)]−

∆

2

(

4 n d

+ 2

)

√√

E

[τi ∑

t=1

A^2 ti

]

(24.3)

≥Eθ′[Ui(1)]−

∆

2

(

4 n d

+ 2

)√

n d

+ 1 (24.4)

≥Eθ′[Ui(1)]−

4

√

3∆n d

√

n d

, (24.5)

where in the first inequality we used Pinsker’s inequality (Eq. (14.11)), the result
in Exercise 14.4, the bound

Ui(1) =

∑τi

t=1

(1/

√

d−Ati)^2 ≤ 2

∑τi

t=1

1

d

+ 2

∑τi

t=1

A^2 ti≤

4 n d

+ 2,

and the assumption thatd≤ 2 n. The inequality in Eq. (24.3) follows from the
chain rule for the relative entropy up to a stopping time (Exercise 15.7). Eq. (24.4)
is true by the definition ofτiand Eq. (24.5) by the assumption thatd≤ 2 n.
Then,

Eθ[Ui(1)] +Eθ′[Ui(−1)]≥Eθ′[Ui(1) +Ui(−1)]−

4

√

3 n∆ d

√

n d

= 2Eθ′

[

τi d

+

∑τi

t=1

A^2 ti

]

−

4

√

3 n∆ d

√

n d

≥

2 n d

−

4

√

3 n∆ d

√

n d

=

n d

.

Bandit Algorithms

(

√^1

)]

≥

∆

√

(

√^1

) 2 ]

≥

∆

√

(

√^1

) 2 ]

,

√

(

+ 2

)√

1

2

D(P,P′)

∆

2

(

+ 2

)

√√

√√

E

]

(24.3)

∆

2

(

+ 2

)√

+ 1 (24.4)

4

√

√

, (24.5)

(1/

√

1

+ 2

+ 2,

4

√

√

[

+

]

−

4

√

√

≥

−

4

√

√

=

.

Get our desktop app

Company

Features

Documentation

Resources