Bandit Algorithms

(Jeff_L) #1
32.4 Notes 373

modelan adversary secretly chooses a sequence of setsS 1 ,...,SnwithSt⊆[`].
In each roundtthe learner choosesAt∈Aand receives a rewardXt(At) where
Xt:A→[0,1] is given byXt(a) =I{St∩{a(1),...,a(k)}6=∅}. The feedback
is the position of the clicked action, which isMt=min{k∈[m] :At(k)∈St}.
The regret is

Rn=

∑n

t=1

(Xt(a∗)−Xt(At)),

wherea∗is the optimal ranking in hindsight:

a∗= argmina∈A

∑n

t=1

Xt(a). (32.6)

Notice that this is the same as the cascade model whenSt={i:Zti= 1}.
5 A challenge in the ranked bandit model is that solving the offline problem (Eq.
32.6) for knownS 1 ,...,Snis NP-hard. How can one learn when finding an
optimal solution to the offline problem is hard? First, hardness only matters
if|A|is large. When`andmare not too large, then exhaustive search is
quite feasible. If this is not an option one may use an approximation algorithm.
It turns out that in a certain sense the best one can do is to use a greedy
algorithm, We omit the details, but the highlight is that there exist efficient
algorithms such that


E


[n

t=1

Xt(At)

]



(


1 −


1


e

)


maxa∈A

∑n

t=1

Xt(a)−O

(


m


n`log(`)

)


.


See the article by Radlinski et al. [2008] for more details.
6 By modifying the reward function one can also define an adversarial variant
of the document-based model. As in the previous note, the adversary secretly
choosesS 1 ,...,Snas subsets of [`], but now the reward is


Xt(a) =|St∩{a(1),...,a(k)}|.

The feedback is the positions of the clicked items,St∩{a(1),...,a(k)}. For this
model there are no computation issues. In fact, the problem can be analyzed
using a reduction to combinatorial semibandits, which we ask you to investigate
in Exercise 32.3.
7 The position-based model can also be modelled in the adversarial setting by
lettingStk⊂[`] for eacht∈[n] andk∈[m]. Then defining the reward by


Xt(a) =

∑m

k=1

I{At(k)∈Stk}.

Again, the feedback is the positions of the clicked items,{k∈[m] :At(k)∈Stk}.
This model can also be tackled using algorithms for combinatorial semibandits
(Exercise 32.4).
Free download pdf