32.2 Policy 366
this model relative to the previous ones is that it offers more flexibility and yet it
is not so flexible that learning is impossible.
32.2 Policy
We now explain the policy for learning to rank whenvis unknown, but satisfies
Assumption 32.1. After the description is an illustration that may prove helpful.
Step 0: Initialization
The policy takes as input a confidence parameterδ∈(0,1) and`andm. The
policy maintains a binary relationGt⊆[`]×[`]. In the first roundt= 1 the
relation is empty:G 1 =∅. You should think ofGtas maintaining pairs (i,j)
for which the policy has proven with high probability thatα(i)< α(j). Ideally,
Gt⊆{(i,j)∈[`]×[`] :α(i)< α(j)}.
Step 1: Defining a partition
In each roundtthe learner computes a partition of the actions based on a
topological sort according to relationGt. GivenA⊂[`] defineminGt(A) to be
the set of minimum elements ofAaccording to relationGt.
minGt(A) ={i∈A: (i,j)∈/Gtfor allj∈Gt}.
Then letPt 1 ,Pt 2 ,...be the partition of [`] defined inductively by
Ptd= minGt
(
[`]\
d⋃− 1
c=1
Ptc
)
.
Finally, letMt=max{d:Ptd 6 =∅}. The reader should check that ifGtdoes not
have cycles, thenMtis well defined and finite and thatPt 1 ,...,PtMtis indeed a
partition of [`] (Exercise 32.5). The event thatGtcontains cycles is a failure event.
In order for the policy to be well defined we assume it chooses some arbitrary
fixed action in this case.
Step 2: Choosing an action
LetIt 1 ,...,ItMtbe a partition of [`] defined inductively by
Itd= [|∪c≤dPtc|]\[|∪c<dPtc|].
Next let Σt⊆Abe the set of actionsσsuch thatσ(Itd) =Ptdfor alld∈[Mt].
The algorithm choosesAtuniformly at random from Σt. Intuitively the policy
first shuffles the items inPt 1 and uses these as the first|Pt 1 |entries in the ranking.
ThenPt 2 is shuffled and the items are appended to the ranking. This process is
repeated until the ranking is complete. For an itemi∈[`], we denote byDtithe
unique indexdsuch thati∈Ptd.