Bandit Algorithms

(Jeff_L) #1
37.1 Finite adversarial partial monitoring problems 451

the following problem:

L=

(


0 0


1 1


)


, Φ =


(


1 1


1 1


)


.


Clearly, in this game the learner can safely ignore the second action and suffer
zero regret, regardless of the choices of the adversary.
Example37.3 (Matching pennies).The penny-matching problem mentioned in
the introduction hask= 3 actionsd= 2 outcomes and is described by

L=




0 1


1 0


c c


, Φ =




1 1


1 1


1 2



. (37.2)


Matching pennies is a hard game forc > 1 /2 in the sense that the adversary can
force the regret of any adversary to be at least Ω(n^2 /^3 ). To see this, consider the
randomized adversary that chooses the first outcome with probabilitypand the
second with probability 1−p. Letε >0 be a small constant to be chosen later
and assumepis either 1/2 +εor 1/ 2 −ε, which determines two environments.
The techniques in Chapter 13 show that the learner can only distinguish between
these environments by playing the third action about 1/ε^2 times. If the learner
does not choose to do this, then the regret is expected to be Ω(nε). Taking these
together shows the regret is lower bounded byRn= Ω(min(nε,(c− 1 /2 +ε)/ε^2 )).
Choosingε=n−^1 /^3 leads to a bound ofRn= Ω((c− 1 /2)n^2 /^3 ). Notice the
argument fails whenc≤ 1 /2. We encourage you to pause for a minute to convince
yourself about the correctness of the above argument and to consider what might
be the situation whenc≤ 1 /2.


Example37.4 (Bandits). Finite-armed adversarial bandits with binary losses
can be represented in the partial monitoring framework. Whenk= 2 this is
possible with the following matrices:

L=

(


0 1 0 1


0 0 1 1


)


, Φ =


(


1 2 1 2


1 1 2 2


)


.


The number of columns for this game is 2k. For non-binary rewards you would
need even more columns. A partial monitoring problem where Φ =Lcan be
called a bandit problem because the learner observes the loss of the chosen action.
In bandit games we can simply use Exp3 to guarantee a regret ofO(



kn).
Example37.5 (Full information problems).One can also represent problems
where the learner observes all the losses. With binary losses and two actions we
have


L=

(


0 1 0 1


0 0 1 1


)


, Φ =


(


1 2 3 4


1 2 3 4


)


.


Like for bandits the size of the game grows very quickly as more actions/outcomes
are added.
Free download pdf