Bandit Algorithms

(Jeff_L) #1
4.10 Exercises 68

def K(self):
pass

# Accepts a parameter 0 <= a <= K-1 and returns the
# realisation of random variable X with P(X = 1) being
# the mean of the (a+1) th arm.
def pull(self , a):
pass
# Returns the regret incurred so far.
def regret(self):
pass

4.7(Follow-the-leader implementation) Implement the following simple
algorithm called ‘Follow-the-Leader’, which chooses each action once and
subsequently chooses the action with the largest average observed so far. Ties
should be broken randomly.
def FollowTheLeader(bandit , n):


implement the Follow -the -Leader algorithm by replacing


the code below that just plays the first arm in every round


for t in range(n):
bandit.pull (0)


Depending on the literature you are reading, Follow-the-Leader may be
called ‘stay with the winner’ or the ‘greedy algorithm’.

4.8Supposeνis a finite-armed stochastic bandit andπis a policy such that


nlim→∞Rn(π,ν)
n

= 0.


LetT∗(n) =

∑n
t=1I{μAt=μ∗}be the number of times an optimal arm is chosen.
Prove or disprove each of the following statements:

(a) limn→∞E[T∗(n)]/n= 1.
(b) limn→∞P(∆At>0) = 0.


4.9(1-armed bandits) LetM 1 be a set of distributions on (R,B(R)) with
finite means andM 2 ={δμ 2 }be the singleton set with a Dirac atμ 2 ∈R. The
set of banditsE=M 1 ×M 2 is called a1-armed banditbecause, although
there are two arms, the second arm always yields a known reward ofμ 2. A policy
π= (πt)tis called aretirement policyif once action 2 has been played once,
it is played until the end of the game. Precisely, ifat= 2, then


πt+1(2|a 1 ,x 1 ,...,at,xt) = 1 for all (as)ts−=1^1 and (xs)ts=1.

(a)Letnbe fixed andπ= (πt)nt=1be any policy. Prove there exists a retirement

Free download pdf