Bandit Algorithms

37.6 Upper bound for easy games 466

1 −δ,

Rˆn≤Rˆ′n+γn+√ 2 nlog(1/δ). (37.15) By Lemma 37.18, the surrogate regret is bounded in terms of the local regret:

Rˆ′n=

∑n

t=1

〈`Bt−`a∗n,ut〉≤

1

εG

maxφ∈H

∑n

t=1

〈`Bt−`φ(Bt),ut〉. (37.16)

We prepare to use Hoeffding-Azuma again. Fixφ∈Harbitrarily. Then,

Et− 1

[

〈`Bt−`φ(Bt),ut〉

]

=

∑

j∈A

P ̃tj

∑

a∈A

Qtja〈`a−`φ(j),ut〉

=

∑

j∈A

P ̃tj

∑

a∈A

Qtja(yta−ytφ(j)),

where we used the fact thatP ̃ta=

∑

jP ̃tjQtja. Hoeffding-Azuma’s inequality now shows that with probability at least 1−δ/|H|, ∑n

t=1

〈`Bt−`φ(Bt),ut〉≤

∑

j∈A

∑n

t=1

P ̃tj

∑

a∈A

Qtja(yta−ytφ(j)) +

√

2 nlog(|H|/δ).

The result is completed via a union bound over allφ∈ Hand chaining with
Eqs. (37.15) and (37.16), and noting that

maxφ∈H

∑

j∈A

∑n

t=1

P ̃tj

∑

a∈A

Qtja(yta−ytφ(j))≤

∑

j∈A

maxφ∈H

∑n

t=1

P ̃tj

∑

a∈A

Qtja(yta−ytφ(j))

=

∑

j∈A

max b∈Nj∩A

∑n

t=1

P ̃tj

∑

a∈A

Qtja(yta−ytb) ︸︷︷︸ Rˆnj

. (37.17)

Proof of Theorem 37.15 The proof has two steps: Bounding the local regretRˆnj for eachj∈A, and then merging the bounds.

Step 1: Bounding the local regret For this step fixj∈ A. The local regret can be massaged into a form suitable for the application of the result of Exercise 12.2. LetZtja =P ̃tj(yta−ytj) andGt=σ(A 1 ,...,At) andG= (Gt)nt=0be the associated filtration. A simple rewriting shows that

Rˆnj= max b∈Nj∩A

∑n

t=1

P ̃tj

∑

a∈A

Qtja(yta−ytb) = max b∈Nj∩A

∑n

t=1

∑

a∈A

Qtja(Ztja−Ztjb).

In order to apply the result in Exercise 12.2 we need to check the conditions. Since (Pt)tand (P ̃t)tareG-predictable it follows that (βt)tand (Zt)tare alsoG- predictable. Similarly, (Zˆt)tisG-adapted because (At)tand (Φt)tareG-adapted. It remains to show that assumptions(a–d)are satisfied. For(a)leta∈Nj∩A.

Bandit Algorithms

1

[

]

=

∑

∑

=

∑

∑

∑

∑

∑

√

∑

∑

∑

∑

=

∑

∑

. (37.17)

∑

∑

Get our desktop app

Company

Features

Documentation

Resources