Bandit Algorithms

This material will be published by Cambridge University Press as Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version is free to view and download for personal use only. Not for re-distribution, sale, or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017. The latest version is available athttp://banditalgs.com. Feedback on any aspect is very welcome: [email protected]

30 Combinatorial Bandits

A combinatorial bandit is a linear bandit with a special kind of combinatorial action setA⊂{ 0 , 1 }d, which for constantm∈[d] satisfies

A⊆

{

a∈{ 0 , 1 }d:‖a‖ 1 ≤m

}

The setting is studied in both adversarial and stochastic models. We focus on the former in this chapter and discuss the latter in the notes. As usual the adversary chooses a sequence of loss vectorsy 1 ,...,ynwithyt∈Rdand the expected regret is

Rn=E

[

max a∈A

∑n

t=1

〈At−a,yt〉

]

In Chapters 27 and 28 we assumed thatyt∈ {y:supa∈A|〈a,y〉| ≤ 1 }. This restriction is not consistent with the applications we have in mind, so instead we assume thatyt∈[0,1]d, which by the definition ofAensures that|〈At,yt〉|≤m for allt. In the standard bandit model the learner observes〈At,yt〉in each round. In many applications for which combinatorial bandits are applied there is actually more information available. The simplest is thefull information setting where the learner observes the whole vectoryt. The full information setup is interesting, but does not have the flavor of a bandit problem and so we do not discuss it further. There is an inbetween setting where the learner receives semibanditfeedback, which is the vector (At 1 yt 1 ,...,Atdytd). SinceAti∈{ 0 , 1 } this is equivalent to observingytifor thoseiwhenAti= 1.

30.1 Applications

Shortest path problems LetG= (V,E) be a fixed graph with a finite set of verticesVand edgesE⊆V×V with|E|=d. The online shortest path problem is a game overnrounds between an adversary and a learner. Given fixed verticesu,v∈V, the learner’s objective in each round is to find the shortest path betweenuandv. At the beginning of the game the adversary chooses a sequence of vectorsy 1 ,...,ynwithytirepresenting the length of theith edge inEin roundt. In each round the learner chooses a path betweenuandv. The regret of the learner is the difference between the

Bandit Algorithms

30 Combinatorial Bandits

{

}

[

]

30.1 Applications

Get our desktop app

Company

Features

Documentation

Resources