Bandit Algorithms

4.10 Exercises 68

def K(self): pass

# Accepts a parameter 0 <= a <= K-1 and returns the # realisation of random variable X with P(X = 1) being # the mean of the (a+1) th arm. def pull(self , a): pass # Returns the regret incurred so far. def regret(self): pass

4.7(Follow-the-leader implementation) Implement the following simple
algorithm called ‘Follow-the-Leader’, which chooses each action once and
subsequently chooses the action with the largest average observed so far. Ties
should be broken randomly.
def FollowTheLeader(bandit , n):

implement the Follow -the -Leader algorithm by replacing

the code below that just plays the first arm in every round

for t in range(n):
bandit.pull (0)

Depending on the literature you are reading, Follow-the-Leader may be called ‘stay with the winner’ or the ‘greedy algorithm’.

4.8Supposeνis a finite-armed stochastic bandit andπis a policy such that

nlim→∞Rn(π,ν) n

= 0.

LetT∗(n) =

∑n t=1I{μAt=μ∗}be the number of times an optimal arm is chosen. Prove or disprove each of the following statements:

(a) limn→∞E[T∗(n)]/n= 1.
(b) limn→∞P(∆At>0) = 0.

4.9(1-armed bandits) LetM 1 be a set of distributions on (R,B(R)) with
finite means andM 2 ={δμ 2 }be the singleton set with a Dirac atμ 2 ∈R. The
set of banditsE=M 1 ×M 2 is called a1-armed banditbecause, although
there are two arms, the second arm always yields a known reward ofμ 2. A policy
π= (πt)tis called aretirement policyif once action 2 has been played once,
it is played until the end of the game. Precisely, ifat= 2, then

πt+1(2|a 1 ,x 1 ,...,at,xt) = 1 for all (as)ts−=1^1 and (xs)ts=1.

(a)Letnbe fixed andπ= (πt)nt=1be any policy. Prove there exists a retirement

Bandit Algorithms

implement the Follow -the -Leader algorithm by replacing

the code below that just plays the first arm in every round

= 0.

Get our desktop app

Company

Features

Documentation

Resources