Bandit Algorithms

(Jeff_L) #1
11.4 Regret analysis 149

The ratio in the product can be rewritten in terms ofPtby


Wt
Wt− 1

=


∑k

j=1

exp(ηSˆt− 1 ,j)
Wt− 1

exp(ηXˆtj) =

∑k

j=1

Ptjexp(ηXˆtj). (11.11)

We need the following facts:


exp(x)≤1 +x+x^2 for allx≤1 and 1 +x≤exp(x) for allx∈R.

Using these two inequalities leads to

Wt
Wt− 1

≤1 +η

∑k

j=1

PtjXˆtj+η^2

∑k

j=1

PtjXˆ^2 tj

≤exp


η

∑k

j=1

PtjXˆtj+η^2

∑k

j=1

PtjXˆtj^2


. (11.12)


Notice that this was only possible becauseXˆtjis defined by Eq. (11.6), which
ensures thatXˆtj ≤1 and would not have been true had we used Eq. (11.3).
Combining Eq. (11.12) and Eq. (11.9),

exp

(


ηSˆni

)


≤kexp


ηSˆn+η^2

∑n

t=1

∑k

j=1

PtjXˆ^2 tj


.


Taking the logarithm of both sides, dividing byη >0 and reordering gives


Sˆni−Sˆn≤log(k)
η


∑n

t=1

∑k

j=1

PtjXˆtj^2. (11.13)

As noted earlier, the expectation of the left-hand side isRni. The first term on
the right-hand side is a constant, which leaves us to bound the expectation of the
second term. Lettingytj= 1−xtjandYt= 1−Xtand expanding the definition

Free download pdf