Bandit Algorithms

This material will be published by Cambridge University Press as Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version is free to view and download for personal use only. Not for re-distribution, sale, or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017. The latest version is available athttp://banditalgs.com. Feedback on any aspect is very welcome: [email protected]

38 Markov Decision Processes

Bandit environments are a sensible model for many simple problems, but they do not model more complex environments where actions have long-term consequences. A brewing company needs to plan ahead when ordering ingredients and the decisions made today affect their position to brew the right amount of beer in the future. A student learning mathematics benefits not only from the immediate reward of learning an interesting topic, but also from their improved job prospects. AMarkov decision processis a simple way to incorporate long-term planning into the bandit framework. Like in bandits, the learner chooses actions and receives rewards. But they also observe astateand the rewards for different actions depend on the state. Furthermore, the actions chosen affect which state will be observed next.

38.1 Problem setup

A Markov decision process (MDP) is defined by a tupleM= (S,A,P,r,μ). The first two itemsSandAare sets called thestate spaceandaction spaceand S =|S|and A =|A|are their sizes, which may be infinite. An MDP is finite if S,A<∞. The quantityP = (Pa:s∈ S,a∈ A) is called thetransition functionwithPa:S ×S →[0,1] so thatPa(s,s′) is the probability that the learner transitions from statestos′when taking actiona. The fourth element of the tuple isr= (ra:a∈ A), which is a collection ofreward functionswith ra:S →[0,1]. When the learner takes actionain statesit receives a deterministic reward ofra(s). The last element isμ∈P(S), which is a distribution over the states that determines the starting state. The transition and reward functions are often represented by vectors or matrices. When the state space is finite we may assume without loss of generality thatS= [S]. We writePa(s)∈[0,1]Sas the probability vector withs′th coordinate given byPa(s,s′). In the same way we let Pa∈[0,1]S×Sbe the right stochastic matrix with (Pa)s,s′=Pa(s,s′). Finally, we viewraas a vector in [0,1]Sin the natural way. The interaction protocol is similar to bandits. Before the game starts, the initial stateS 1 is sampled fromμ. In each roundtthe learner observes the state St∈S, chooses an actionAt∈Aand receives rewardrAt(St). The environment

Bandit Algorithms

38 Markov Decision Processes

38.1 Problem setup

Get our desktop app

Company

Features

Documentation

Resources