Bandit Algorithms

38.7 Proof of lower bound 494

number of leaves and label these statess 1 ,...,sL. The last two states aresgand sb(‘good’ and ‘bad’ respectively). For eachi∈[L] the learner can take any action a∈Aand transitions to either the good state or the bad state according to

Pa(si,sg) =

1

2

+ε(a,i) and Pa(si,sb) =

1

2

−ε(a,i).

The functionεwill be chosen so thatε(a,i) = 0 for all (a,i) pairs except one. For
this special state-action pair we letε(a,i) = ∆ for appropriately tuned ∆>0.
The good state and the bad state have the same transitions for all actions:

Pa(sg,sg) = 1−δ , Pa(sg,s◦) =δ , Pa(sb,sb) = 1−δ , Pa(sb,s◦) =δ.

Choosingδ= 4/D, which under the assumptions of the theorem is guaranteed
to be in (0,1], ensures that the diameter of the described MDP is at mostD,
regardless of the value of ∆. The reward function isra(s) = 1 ifs=sgand
ra(s) = 0 otherwise.
The connection to finite-armed bandits is straightforward. Each time the learner
arrives in states◦it selects which leaf to visit and then chooses an action from
that leaf. This corresponds to choosing one ofk=LA = Ω(SA) meta actions.
The optimal policy is to select the meta action with the largest probability of
transitioning to the good state. The choice ofδmeans the learner expects to
stay in the good/bad state for approximatelyDrounds, which also makes the
diameter of this MDP aboutD. This means the learner expects to make about
n/Ddecisions and the rewards are roughly in [0,D] so we should expect the
regret to be Ω(D

√

kn/D) = Ω(

√

nDSA).

s◦

s 1 s 2 s 3

sg sb

1 −δ, 1 1 −δ, 0

δ, 1 δ, 0

Good state Bad state

Figure 38.3Lower bound construction for A = 2 and S = 8. The resulting MDP is roughly equivalent to a bandit with six actions.

Bandit Algorithms

1

2

1

2

√

√

Get our desktop app

Company

Features

Documentation

Resources