Bandit Algorithms

(Jeff_L) #1
38.5 Upper confidence bounds for reinforcement learning 487

transition matrix is unknown while the reward function is given. This assumption
is not especially restrictive as the case where the rewards are also unknown is
easily covered using either a reduction or a simple generalization as we explain in
the notes. The regret of a policyπis the deficit of rewards suffered relative to
the expected average reward of an optimal policy:

Rˆn=nρ∗−

∑n

t=1

rAt(St).

The reader will notice we are comparing the nonrandomnρ∗to the random
sum of rewards received by the learner, which was also true in the study of
stochastic bandits. The difference is thatρ∗is an asymptotic quantity while for
stochastic bandits the analogous quantity wasnμ∗. The definition stills makes
sense, however, because for MDPs with finite diameterDthe optimal expected
cumulative reward overnrounds is at leastnρ∗−Dso the difference is negligible
(Exercise 38.17). The main result of this chapter is the following.
Theorem 38.6. LetS,Aandnbe natural numbers andδ ∈(0,1). There
exists an efficiently computable policyπthat when interacting with any MDP
M= (S,A,P,r)withSstates,Aactions, rewards in[0,1]and any initial state
distribution satisfies with probability at least 1 −δ,

Rˆn< CD(M)S√Anlog(nSA/δ),
whereCis a universal constant.
In Exercise 38.18 we ask you to use the assumption that the rewards are
bounded to find a choice ofδ∈(0,1) such that

E[Rˆn]≤1 +CD(M)S


2Anlog(n). (38.12)
This result is complemented by the following lower bound.
Theorem38.7. LetS≥ 3 ,A≥ 2 ,D≥6 + 2logASandn≥DSA. Then for
any policyπthere exists a Markov decision process withSstates,Aactions and
diameter at mostDsuch that

E[Rˆn]≥C


DSAn,
whereC > 0 is again a universal constant.
The upper and lower bounds are separated by a factor of at least


DS, which
is a considerable gap. Recent work has made progress towards closing this gap as
we explain in the notes.

38.5 Upper confidence bounds for reinforcement learning


Reinforcement learning is the subfield of machine learning devoted to
designing and studying algorithms that learn to maximize long-term reward
Free download pdf