Bandit Algorithms

32.4 Notes 373

modelan adversary secretly chooses a sequence of setsS 1 ,...,SnwithSt⊆[`]. In each roundtthe learner choosesAt∈Aand receives a rewardXt(At) where Xt:A→[0,1] is given byXt(a) =I{St∩{a(1),...,a(k)}6=∅}. The feedback is the position of the clicked action, which isMt=min{k∈[m] :At(k)∈St}. The regret is

Rn=

∑n

t=1

(Xt(a∗)−Xt(At)),

wherea∗is the optimal ranking in hindsight:

a∗= argmina∈A

∑n

t=1

Xt(a). (32.6)

Notice that this is the same as the cascade model whenSt={i:Zti= 1}.
5 A challenge in the ranked bandit model is that solving the offline problem (Eq.
32.6) for knownS 1 ,...,Snis NP-hard. How can one learn when finding an
optimal solution to the offline problem is hard? First, hardness only matters
if|A|is large. When`andmare not too large, then exhaustive search is
quite feasible. If this is not an option one may use an approximation algorithm.
It turns out that in a certain sense the best one can do is to use a greedy
algorithm, We omit the details, but the highlight is that there exist efficient
algorithms such that

E

[n ∑

t=1

Xt(At)

]

≥

(

1 −

1

e

)

maxa∈A

∑n

t=1

Xt(a)−O

(

m

√

n`log(`)

)

.

See the article by Radlinski et al. [2008] for more details.
6 By modifying the reward function one can also define an adversarial variant
of the document-based model. As in the previous note, the adversary secretly
choosesS 1 ,...,Snas subsets of [`], but now the reward is

Xt(a) =|St∩{a(1),...,a(k)}|.

The feedback is the positions of the clicked items,St∩{a(1),...,a(k)}. For this
model there are no computation issues. In fact, the problem can be analyzed
using a reduction to combinatorial semibandits, which we ask you to investigate
in Exercise 32.3.
7 The position-based model can also be modelled in the adversarial setting by
lettingStk⊂[`] for eacht∈[n] andk∈[m]. Then defining the reward by

Xt(a) =

∑m

k=1

I{At(k)∈Stk}.

Again, the feedback is the positions of the clicked items,{k∈[m] :At(k)∈Stk}. This model can also be tackled using algorithms for combinatorial semibandits (Exercise 32.4).

Bandit Algorithms

E

]

≥

(

1 −

1

)

(

√

)

.

Get our desktop app

Company

Features

Documentation

Resources