Bandit Algorithms

38.1 Problem setup 478

A memoryless policy that does not randomize is called amemoryless deterministic policy. To reduce clutter such policies are written asS → A maps and the set of all such policies is denoted by ΠDM. A policy is called a Markov policyif the actions are randomized and depend only on the round index and the previous state. These policies are represented by fixed sequences of memoryless policies. Under a Markov policy the sequence of states (S 1 ,S 2 ,...) evolve as a Markov chain (see Section 3.2). If the Markov policy is memoryless, this chain is homogeneous.

2 3 4 5 6

1 trap state high reward state

1 , 1 1 , 1

(^35) , 0
(^25) , 0
(^15) , 1
(^45) , 1
(^35) , 0
(^25) , 0
(^15) , 1
(^45) , 1
(^35) , 0
(^25) , 0
(^15) , 1
(^45) , 1
(^35) , 0
(^25) , 0
(^15) , 1
(^45) , 1
(^35) , 10
(^25) , 10
(^15) , 12
(^45) , 12
Figure 38.2A Markov decision process with six states and two actions represented by
solid and dashed arrows, respectively. The numbers next to each arrow represent the
probability of transition and reward for the action respectively. For example, taking the
solid action in state 3 results in a reward of 0 and the probability of moving to state 4 is
3 /5 and the probability of moving to state 3 is 2/5. For human interpretability only, the
actions are given consistent meaning across the states (blue/solid actions ‘increment’ the
state index, black/dashed actions decrement it). In reality there is no sense of similarity
between states or actions build into the MDP formalism.
Probability spaces
It will be convenient to allow infinitely long interactions between the learner
and the environment. In line with Fig. 38.1, when the agent or learner follows a
policyπin MDPM= (S,A,P,r,μ) such a never ending interaction gives rise to
a random process (S 1 ,A 1 ,S 2 ,A 2 ,...) so that for anys,s′∈S,a∈Aandt≥1,
(a)P(S 1 =s) =μ(s).
(b)P(St+1=s′|Ht,At) =PAt(St,s′).
(c)P(At=a|Ht) =π(a|Ht).
Meticulous readers may wonder whether there exists a probability space (Ω,F,P)
holding the infinite sequence of random variables (S 1 ,A 1 ,S 2 ,A 2 ,...) that satisfy
(a)–(c). The Ionescu Tulcea theorem (Theorem 3.3) furnishes us with a positive
answer (Exercise 38.1). Item (b) above is known as theMarkov property. Of

Bandit Algorithms

Get our desktop app

Company

Features

Documentation

Resources