4.10 Exercises 68
def K(self):
pass
# Accepts a parameter 0 <= a <= K-1 and returns the
# realisation of random variable X with P(X = 1) being
# the mean of the (a+1) th arm.
def pull(self , a):
pass
# Returns the regret incurred so far.
def regret(self):
pass
4.7(Follow-the-leader implementation) Implement the following simple
algorithm called ‘Follow-the-Leader’, which chooses each action once and
subsequently chooses the action with the largest average observed so far. Ties
should be broken randomly.
def FollowTheLeader(bandit , n):
implement the Follow -the -Leader algorithm by replacing
the code below that just plays the first arm in every round
for t in range(n):
bandit.pull (0)
Depending on the literature you are reading, Follow-the-Leader may be
called ‘stay with the winner’ or the ‘greedy algorithm’.
4.8Supposeνis a finite-armed stochastic bandit andπis a policy such that
nlim→∞Rn(π,ν)
n
= 0.
LetT∗(n) =
∑n
t=1I{μAt=μ∗}be the number of times an optimal arm is chosen.
Prove or disprove each of the following statements:
(a) limn→∞E[T∗(n)]/n= 1.
(b) limn→∞P(∆At>0) = 0.
4.9(1-armed bandits) LetM 1 be a set of distributions on (R,B(R)) with
finite means andM 2 ={δμ 2 }be the singleton set with a Dirac atμ 2 ∈R. The
set of banditsE=M 1 ×M 2 is called a1-armed banditbecause, although
there are two arms, the second arm always yields a known reward ofμ 2. A policy
π= (πt)tis called aretirement policyif once action 2 has been played once,
it is played until the end of the game. Precisely, ifat= 2, then
πt+1(2|a 1 ,x 1 ,...,at,xt) = 1 for all (as)ts−=1^1 and (xs)ts=1.
(a)Letnbe fixed andπ= (πt)nt=1be any policy. Prove there exists a retirement