Bandit Algorithms

11.4 Regret analysis 149

The ratio in the product can be rewritten in terms ofPtby

Wt Wt− 1

=

∑k

j=1

exp(ηSˆt− 1 ,j) Wt− 1

exp(ηXˆtj) =

∑k

j=1

Ptjexp(ηXˆtj). (11.11)

We need the following facts:

exp(x)≤1 +x+x^2 for allx≤1 and 1 +x≤exp(x) for allx∈R.

Using these two inequalities leads to

Wt Wt− 1

≤1 +η

∑k

j=1

PtjXˆtj+η^2

∑k

j=1

PtjXˆ^2 tj

≤exp



η

∑k

j=1

PtjXˆtj+η^2

∑k

j=1

PtjXˆtj^2



. (11.12)

Notice that this was only possible becauseXˆtjis defined by Eq. (11.6), which ensures thatXˆtj ≤1 and would not have been true had we used Eq. (11.3). Combining Eq. (11.12) and Eq. (11.9),

exp

(

ηSˆni

)

≤kexp



ηSˆn+η^2

∑n

t=1

∑k

j=1

PtjXˆ^2 tj



.

Taking the logarithm of both sides, dividing byη >0 and reordering gives

Sˆni−Sˆn≤log(k) η

+η

∑n

t=1

∑k

j=1

PtjXˆtj^2. (11.13)

As noted earlier, the expectation of the left-hand side isRni. The first term on
the right-hand side is a constant, which leaves us to bound the expectation of the
second term. Lettingytj= 1−xtjandYt= 1−Xtand expanding the definition

Bandit Algorithms

=





. (11.12)

(

)





.

Get our desktop app

Company

Features

Documentation

Resources