Bandit Algorithms

38.5 Upper confidence bounds for reinforcement learning 487

transition matrix is unknown while the reward function is given. This assumption is not especially restrictive as the case where the rewards are also unknown is easily covered using either a reduction or a simple generalization as we explain in the notes. The regret of a policyπis the deficit of rewards suffered relative to the expected average reward of an optimal policy:

Rˆn=nρ∗−

∑n

t=1

rAt(St).

The reader will notice we are comparing the nonrandomnρ∗to the random sum of rewards received by the learner, which was also true in the study of stochastic bandits. The difference is thatρ∗is an asymptotic quantity while for stochastic bandits the analogous quantity wasnμ∗. The definition stills makes sense, however, because for MDPs with finite diameterDthe optimal expected cumulative reward overnrounds is at leastnρ∗−Dso the difference is negligible (Exercise 38.17). The main result of this chapter is the following. Theorem 38.6. LetS,Aandnbe natural numbers andδ ∈(0,1). There exists an efficiently computable policyπthat when interacting with any MDP M= (S,A,P,r)withSstates,Aactions, rewards in[0,1]and any initial state distribution satisfies with probability at least 1 −δ,

Rˆn< CD(M)S√Anlog(nSA/δ), whereCis a universal constant. In Exercise 38.18 we ask you to use the assumption that the rewards are bounded to find a choice ofδ∈(0,1) such that

E[Rˆn]≤1 +CD(M)S

√

2Anlog(n). (38.12) This result is complemented by the following lower bound. Theorem38.7. LetS≥ 3 ,A≥ 2 ,D≥6 + 2logASandn≥DSA. Then for any policyπthere exists a Markov decision process withSstates,Aactions and diameter at mostDsuch that

E[Rˆn]≥C

√

DSAn, whereC > 0 is again a universal constant. The upper and lower bounds are separated by a factor of at least

√

DS, which is a considerable gap. Recent work has made progress towards closing this gap as we explain in the notes.

38.5 Upper confidence bounds for reinforcement learning

Reinforcement learning is the subfield of machine learning devoted to designing and studying algorithms that learn to maximize long-term reward

Bandit Algorithms

√

√

√

38.5 Upper confidence bounds for reinforcement learning

Get our desktop app

Company

Features

Documentation

Resources