Bandit Algorithms

37.1 Finite adversarial partial monitoring problems 451

the following problem:

L=

(

0 0

1 1

)

, Φ =

(

1 1

)

.

Clearly, in this game the learner can safely ignore the second action and suffer zero regret, regardless of the choices of the adversary. Example37.3 (Matching pennies).The penny-matching problem mentioned in the introduction hask= 3 actionsd= 2 outcomes and is described by

L=





0 1

1 0

c c



, Φ =





1 1

1 2



. (37.2)

Matching pennies is a hard game forc > 1 /2 in the sense that the adversary can
force the regret of any adversary to be at least Ω(n^2 /^3 ). To see this, consider the
randomized adversary that chooses the first outcome with probabilitypand the
second with probability 1−p. Letε >0 be a small constant to be chosen later
and assumepis either 1/2 +εor 1/ 2 −ε, which determines two environments.
The techniques in Chapter 13 show that the learner can only distinguish between
these environments by playing the third action about 1/ε^2 times. If the learner
does not choose to do this, then the regret is expected to be Ω(nε). Taking these
together shows the regret is lower bounded byRn= Ω(min(nε,(c− 1 /2 +ε)/ε^2 )).
Choosingε=n−^1 /^3 leads to a bound ofRn= Ω((c− 1 /2)n^2 /^3 ). Notice the
argument fails whenc≤ 1 /2. We encourage you to pause for a minute to convince
yourself about the correctness of the above argument and to consider what might
be the situation whenc≤ 1 /2.

Example37.4 (Bandits). Finite-armed adversarial bandits with binary losses can be represented in the partial monitoring framework. Whenk= 2 this is possible with the following matrices:

L=

(

0 1 0 1

0 0 1 1

)

, Φ =

(

1 2 1 2

1 1 2 2

)

.

The number of columns for this game is 2k. For non-binary rewards you would
need even more columns. A partial monitoring problem where Φ =Lcan be
called a bandit problem because the learner observes the loss of the chosen action.
In bandit games we can simply use Exp3 to guarantee a regret ofO(

√

kn).
Example37.5 (Full information problems).One can also represent problems
where the learner observes all the losses. With binary losses and two actions we
have

L=

(

0 1 0 1

0 0 1 1

)

, Φ =

(

1 2 3 4

)

.

Like for bandits the size of the game grows very quickly as more actions/outcomes are added.

Bandit Algorithms

(

0 0

1 1

)

, Φ =

(

1 1

1 1

)

.

L=





0 1

1 0



, Φ =





1 1

1 1

1 2



. (37.2)

(

0 1 0 1

0 0 1 1

)

, Φ =

(

1 2 1 2

1 1 2 2

)

.

√

(

0 1 0 1

0 0 1 1

)

, Φ =

(

1 2 3 4

1 2 3 4

)

.

Get our desktop app

Company

Features

Documentation

Resources