Bandit Algorithms

8.1 Asymptotically optimal UCB 114

as required.

Proof of Theorem 8.1 As usual, the starting point is the fundamental regret decomposition (Lemma 4.5),

Rn=

∑

i:∆i> 0

∆iE[Ti(n)].

The rest of the proof revolves around boundingE[Ti(n)]. Letibe the index of
some suboptimal arm. The main idea is to decomposeTi(n) into two terms. The
first measures the number of times the index of the optimal arm is less than
μ 1 −ε. The second term measures the number of times thatAt=iand its index
is larger thanμ 1 −ε.

Ti(n) =

∑n

t=1

I{At=i}≤

∑n

t=1

I

{

μˆ 1 (t−1) +

√

2 logf(t) T 1 (t−1)

≤μ 1 −ε

}

+

∑n

t=1

I

{

μˆi(t−1) +

√

2 logf(t) Ti(t−1)

≥μ 1 −εandAt=i

}

. (8.4)

The proof of the first part of the theorem is completed by bounding the expectation
of each of these two sums. Starting with the first, we again use Corollary 5.5:

E

[n ∑

t=1

I

{

ˆμ 1 (t−1) +

√

2 logf(t) T 1 (t−1)

≤μ 1 −ε

}]

≤

∑n

t=1

∑n

s=1

P

(

μˆ 1 s+

√

2 logf(t) s

≤μ 1 −ε

)

≤

∑n

t=1

∑n

s=1

exp









−

s

(√

2 logf(t) s +ε

) 2

2









≤

∑n

t=1

1

f(t)

∑n

s=1

exp

(

−

sε^2 2

)

≤

5

ε^2.

The first inequality follows from the union bound over all possible values of
T 1 (t−1). The last inequality is an algebraic exercise (Exercise 8.1). The function
f(t) was chosen precisely so this bound would hold. For the second term in(8.4)

Bandit Algorithms

∑

I

{

√

}

+

I

{

√

}

. (8.4)

E

I

{

√

}]

≤

P

(

√

)

≤









−

(√

) 2

2









≤

1

(

−

)

≤

5

Get our desktop app

Company

Features

Documentation

Resources