Bandit Algorithms

38.10 Exercises 510

A={left,right}andS= [S] with S≥2. In all statess >1, actionleft deterministically leads to states−1 and provides no reward. In state 1, action leftleaves the state unchanged and yields a reward of 0.05. The actionright tends to make the agent move right, but not deterministically (the learner is swimming against a current). With probability 0.3 the state is incremented, with probability 0.6 the state is left unchanged, while with probability of 0.1 the state is decremented. This action incurs a reward of zero in all states except in state S where it receives a reward of 1. The situation when S = 5 is illustrated in Fig. 38.5.

1 2 3 4 5

1 , 0. 05 0. 3 , 0

0. 7 , 0

1 , 0 0. 3 , 0

0. 6 , 0

0. 1 , 0

1 , 0 0. 3 , 0

0. 6 , 0

0. 1 , 0

1 , 0 0. 3 , 0

0. 6 , 0

0. 1 , 0

1 , 0

0. 9 , 1

0. 1 , 1

current

Figure 38.5The RiverSwim MDP when S = 5. Solid arrows correspond to actionleft and dashed ones to actionright. The right-hand bank is slippery, so the learner sometimes falls back into the river.

(a)Show that the optimal policy always takes actionrightand calculate the
optimal average rewardρ∗as a function of S.
(b)Implement the MDP and test the optimal policy when started from state 1.
Plot the total reward as a function of time and compare it with the plot of
t7→tρ∗. Run multiple simulations to produce error bars. How fast do you
think the total reward concentrates aroundtρ∗? Experiment with different
values of S.
(c)Theε-greedy strategy can also be implemented in MDPs as follows: Based
on the data previously collected estimate the transition probabilities and
rewards using empirical means. Find the optimal policyπ∗of the resulting
MDP and if the current state iss, use the actionπ∗(s) with probability 1−ε
and choose one of the two actions uniformly at random with the remaining
probability. To ensure the empirical MDP has a well-defined optimal policy,
mix the empirical estimate of the next state distributionsPa(s) with the
uniform distribution with a small mixture coefficient. Implement this strategy
and plot the trajectories it exhibits for various MDP sizes. Explain what you
see.
(d)Implement UCRL2 and produce the same plots. Can you explain what you
see?

Bandit Algorithms

Get our desktop app

Company

Features

Documentation

Resources