Bandit Algorithms

(Jeff_L) #1
18.5 Notes 220

Substituting into Eq. (18.11) leads to

Rn≤

log(M)
η +

ηnK
2 =


2 nKlog(M).

Let us see how this theorem can be applied to the contextual bandit whereCis a
finite set and Φ is the set of all functions fromC →[k]. To each of these functions
φ∈Φ we associate an expertmwithEmi(t)=I{φ(ct) =i}. ThenM=kCand
Theorem 18.1 says that
Rn≤


2 nk|C|log(k),
which is the same bound we derived using an independent copy of Exp3 for each
context. More generally, ifCis arbitrary (possibly infinite) and Φ is a finite set
of functions fromCto [k], then the theorem ensures that
Rn≤


2 nklog(|Φ|).

These results seem quite promising already, but in fact there is another
improvement possible. The intuition is that learning should be easier if the
experts have a high degree of agreement. One way to measure this is by

E∗t=

∑t

s=1

∑k

i=1

max
m∈[M]

E(mis).

In Exercise 18.7) you will show that if all experts make identical recommendations,
thenE∗t=t. And that no matter how the experts behave,
E∗n≤nmin(k,M). (18.13)
In this senseE∗n/ncan be viewed as the effective number of experts, which
depends on the degree of disagreement in the expert’s recommendations. By
modifying the algorithm to use a time varying learning rate one can prove the
following theorem.
Theorem18.3. Assume the same conditions as in Theorem 18.1, except let
ηt=


log(M)/Et∗. Then there exists a universal constantC > 0 such that
Rn≤C


En∗log(M). (18.14)
The proof of Theorem 18.3 is not hard and is left to Exercise 18.4. The bound
tells us that Exp4 with the suggested learning rate is able to adapt to degree of
disagreement between the experts, which seems like quite an encouraging result.
As a further benefit, the learning rate does not depend on the horizon so the
algorithm is anytime.

18.5 Notes


1 The most important concept in this chapter is that there are tradeoffs when
choosing the competitor class. A large class leads to a more meaningful definition
Free download pdf