Bandit Algorithms

(Jeff_L) #1
37.6 Upper bound for easy games 466

1 −δ,


Rˆn≤Rˆ′n+γn+√ 2 nlog(1/δ). (37.15)
By Lemma 37.18, the surrogate regret is bounded in terms of the local regret:

Rˆ′n=

∑n

t=1

〈`Bt−`a∗n,ut〉≤

1


εG

maxφ∈H

∑n

t=1

〈`Bt−`φ(Bt),ut〉. (37.16)

We prepare to use Hoeffding-Azuma again. Fixφ∈Harbitrarily. Then,


Et− 1

[


〈`Bt−`φ(Bt),ut〉

]


=



j∈A

P ̃tj


a∈A

Qtja〈`a−`φ(j),ut〉

=



j∈A

P ̃tj


a∈A

Qtja(yta−ytφ(j)),

where we used the fact thatP ̃ta=



jP ̃tjQtja. Hoeffding-Azuma’s inequality
now shows that with probability at least 1−δ/|H|,
∑n

t=1

〈`Bt−`φ(Bt),ut〉≤


j∈A

∑n

t=1

P ̃tj


a∈A

Qtja(yta−ytφ(j)) +


2 nlog(|H|/δ).

The result is completed via a union bound over allφ∈ Hand chaining with
Eqs. (37.15) and (37.16), and noting that


maxφ∈H


j∈A

∑n

t=1

P ̃tj


a∈A

Qtja(yta−ytφ(j))≤


j∈A

maxφ∈H

∑n

t=1

P ̃tj


a∈A

Qtja(yta−ytφ(j))

=



j∈A

max
b∈Nj∩A

∑n

t=1

P ̃tj


a∈A

Qtja(yta−ytb)
︸ ︷︷ ︸
Rˆnj

. (37.17)


Proof of Theorem 37.15 The proof has two steps: Bounding the local regretRˆnj
for eachj∈A, and then merging the bounds.

Step 1: Bounding the local regret
For this step fixj∈ A. The local regret can be massaged into a form suitable
for the application of the result of Exercise 12.2. LetZtja =P ̃tj(yta−ytj)
andGt=σ(A 1 ,...,At) andG= (Gt)nt=0be the associated filtration. A simple
rewriting shows that

Rˆnj= max
b∈Nj∩A

∑n

t=1

P ̃tj


a∈A

Qtja(yta−ytb) = max
b∈Nj∩A

∑n

t=1


a∈A

Qtja(Ztja−Ztjb).

In order to apply the result in Exercise 12.2 we need to check the conditions.
Since (Pt)tand (P ̃t)tareG-predictable it follows that (βt)tand (Zt)tare alsoG-
predictable. Similarly, (Zˆt)tisG-adapted because (At)tand (Φt)tareG-adapted.
It remains to show that assumptions(a–d)are satisfied. For(a)leta∈Nj∩A.
Free download pdf