Bandit Algorithms

(Jeff_L) #1
36.2 Frequentist analysis 432

1:Input Cumulative distribution functionsF 1 (1),...,Fk(1)
2:fort= 1,...,ndo
3: Sampleθi(t)∼Fi(t) for eachi
4: ChooseAt= argmaxi∈[k]θi(t)
5: ObserveXtand update:

Fi(t+ 1) =Fi(t) fori 6 =At and FAt(t+ 1) =Update(FAt(t),At,Xt)
6:end for
Algorithm 23:Follow the perturbed leader.

Thompson sampling is recovered by choosingF 1 (1),...,Fk(1) to the cumulative
distribution functions of the mean reward of each arm for a prior that is
independent over the arms. Then lettingUpdatebe the function that updates
the posterior for the played arm. There are, however, many alternatives ways to
configure this algorithm.

The core property that we use in the analysis of Algorithm 23 (to be
presented soon) is thatFi(t+ 1) =Fi(t) wheneverAt 6 =i. WhenUpdate(·)
is a Bayesian update this corresponds to choosing an independent prior on
the distribution of each arm.

LetFisbe the cumulative distribution function used for armiin roundstwhen
Ti(t−1) =s. This quantity is defined even ifTi(n)< sby using the stacked
model from Section 4.6.

Theorem36.2. Letibe an action andε∈Rbe arbitrary. Then the expected
number of times Algorithm 23 plays actioniis bounded by


E[Ti(n)]≤1 +E

[n− 1

s=0

(


1


G 1 s

− 1


)]


+E


[n− 1

s=0

I{Gis> 1 /n}

]


, (36.3)


whereGis= 1−Fis(μ 1 −ε).


In applicationsεis normally chosen to be a small positive constant. In this case
the first sum in Eq. (36.3) measures the probability that the sample corresponding
to the first arm is nearly optimistic and tends to be smaller when the variance
of the perturbation is larger. The second sum measures the likelihood that
the sample from armiis close toμ 1 and is small when the variance of the
perturbation is small. Balancing these two terms corresponds to optimizing the
exploration/exploitation tradeoff.

Proof of Theorem 36.2 LetFt=σ(A 1 ,X 1 ,...,At,Xt) andEi(t) ={θi(t)≤
μ 1 −ε}. By definition

P(θ 1 (t)≥μ 1 −ε|Ft− 1 ) =G 1 T 1 (t−1) a.s.
Free download pdf