Bandit Algorithms

35.3 1 -armed bandits 413

This means that (St)tis a Markov chain over the state spaceSwith kernel (Px)x∈SandUt=u(St) is the utility of stateStvisited at timet. LetPxbe a measure satisfying the above conditions as well asPx(S 1 =x) = 1. As usual letExdenote the expectation with respect toPx. Define thevalue function v:S →Rby v(x) = sup τ∈R 1

Ex[Uτ], (35.2)

whereR 1 is the set ofF= (σ(S 1 ,...,St)t)-adapted stopping times. The result provided by the next theorem is sufficient for our requirements. Theorem35.3. Assume for al lx∈SthatU∞=limn→∞UnexistsPx-a.s. and supn≥ 1 |Un|isPx-integrable. Thenvsatisfies theWald–Bellman equation,

v(x) = max{u(x),

∫

S

v(y)Px(dy)} for allx∈S.

Furthermore,limn→∞v(Sn) =U∞Px-a.s. and the supremum in Eq.(35.2)is achieved by any stopping timeτsuch that for al lt: (a)τ≤ton the event thatUt>

∫

Sv(y)PSt(dy). (b)τ > ton the event thatUt<

∫

Sv(y)PSt(dy)andτ≥t.

A natural choice of stopping time satisfying conditions (a) and (b) in Theorem 35.3 isτ=min{t≥1 :v(St) =Ut}. The conditions represent a possible indifference region where both stopping and continuing lead to the same expected utility.

The proof of Theorem 35.2 is straightforward (Exercise 35.2). Measurability issues make the proof of Theorem 35.3 more technical (Exercise 35.3). Pointers to the literature are given in the notes and a solution to the exercise is available.

35.3 1-armed bandits

The 1-armed bandit problem is a special case where the Bayesian optimal policy has a simple form that can often be computed efficiently. Before reading on, you might like to refresh your memory by looking at Exercises 4.9 and 8.2. Let (E,G,Q,P) be a 2-armed Bayesian bandit environment wherePν 2 =δμ 2 is a Dirac at fixed constantμ 2 ∈Rfor allν∈ E. Because the mean of the second arm is known in advance, we call this a Bayesian 1-armed bandit problem. In Part (a) of Exercise 4.9 you showed that when the horizon is known it suffices to consider only retirement policies that choose the first arm until some random time before switching to the second arm until the end of the game. Since we care about Bayesian optimal policies, the result of Exercise 34.8 allows us to restrict our attention to deterministic retirement policies.

Bandit Algorithms

∫

∫

∫

35.3 1-armed bandits

Get our desktop app

Company

Features

Documentation

Resources