Bandit Algorithms

38.1 Problem setup 477

then samplesSt+1from the probability vectorPAt(St) and the next round begins
(Fig. 38.1).

Observe stateSt

Choose actionAt ∈ A

Receive rewardrAt(St) UpdateSt+1 ∼ PAt(St)

Incrementt

t= 1 and sampleS 1 ∼μ

Figure 38.1Interaction protocol for Markov decision processes.

Although the action set is the same in all states, this does not mean that Pa(s) orra(s) has any relationship toPa(s′) orra(s′) for statess 6 =s′. In this sense it might be better to use an entirely different set of actions for each state, which would not change the results we present. And while we are at it, of course one could also allow the number of actions to vary over the state space.

Histories and policies Before considering the learning problem, we explain how to act in a known MDP. Because there is no learning going on we call our protagonist the ‘agent’ rather than ‘learner’. In a stochastic bandit the optimal policy given knowledge of the bandit is to choose the action with the largest expected reward in every round. In a Markov decision process the definition of optimality is less clear. The history Ht = (S 1 ,A 1 ,...,St− 1 ,At− 1 ,St) in round t contains the information available before the action for the round is to be chosen. Note that stateStis included inHt. The actions are also included because the agent may randomize. For simplicity the rewards are omitted because the all-knowing agent can recompute them if needed from the state-action pairs. Apolicyis a (possibly randomized) map from the set of possible histories to actions. Simple policies includememoryless policies, which choose actions based on only the current state, possibly in a randomized manner. The set of such policies is denoted by ΠMand its elements are identified with maps π:A×S →[0,1] with

∑

a∈Aπ(a|s) = 1 for anys∈ S so thatπ(a|s) is interpreted as the probability that policyπtakes actionain states.

Bandit Algorithms

∑

Get our desktop app

Company

Features

Documentation

Resources