Bandit Algorithms

(Jeff_L) #1
19.3 Regret analysis 232

The proof of Theorem 19.2 depends on the following lemma.
Lemma19.4.LetV 0 be positive definite andx 1 ,...,xn∈Rdbe a sequence of
vectors with‖xt‖ 2 ≤L <∞for allt∈[n]. Then


∑n

t=1

(


1 ∧‖xt‖^2 Vt−− (^11)


)


≤2 log

(


detVn
detV 0

)


≤ 2 dlog

(


traceV 0 +nL^2
ddet^1 /d(V 0 )

)


.


Proof Using that for anyu∈[0,1],u∧ 1 ≤2 ln(1 +u), we get
∑n

t=1

(


1 ∧‖xt‖^2 Vt−− (^11)


)


≤ 2



t

log

(


1 +‖xt‖^2 Vt−− (^11)


)


.


We now argue that this last expression is log


(


detVn
detV 0

)


. Fort≥1 we have


Vt=Vt− 1 +xtx>t =Vt^1 −/ 12 (I+Vt−−^11 /^2 xtx>tVt−−^11 /^2 )Vt^1 −/ 12.
Hence
det(Vt) = det(Vt− 1 ) det

(


I+Vt−−^11 /^2 xtx>tVt−−^11 /^2

)


= det(Vt− 1 )

(


1 +‖xt‖^2 V− 1
t− 1

)


,


where the second equality follows because the matrixI+yy>has eigenvalues
1 +‖y‖^22 and 1 as well as the fact that the determinant of a matrix is the product
of its eigenvalues. Putting things together we see that


det(Vn) = det(V 0 )

∏n

t=1

(


1 +‖xt‖^2 Vt−− (^11)


)


,


which is equivalent to the first inequality that we wanted to prove. To get the
second inequality note that by the inequality of arithmetic and geometric means,


det(Vn) =

∏d

i=1

λi≤

(


1


d

traceVn

)d

(


traceV 0 +nL^2
d

)d
,

whereλ 1 ,...,λdare the eigenvalues ofVn.


Proof of Theorem 19.2 By part (c) of Assumption 19.1 it suffices to prove the
bound on the event thatθ∗∈Ctfor all roundst∈[n]. LetA∗t=argmaxa∈At〈a,θ∗〉
be an optimal action for roundtandrtbe the instantaneous regret in roundt
defined by
rt=〈A∗t−At,θ∗〉.
Letθ ̃t∈Ctbe the parameter in the confidence set for which〈At,θ ̃t〉=UCBt(At).
Then using the fact thatθ∗∈Ctand the definition of the algorithm leads to


〈A∗t,θ∗〉≤UCBt(A∗t)≤UCBt(At) =〈At,θ ̃t〉.
Using Cauchy-Schwarz inequality and the assumption thatθ∗∈Ctand facts that
θ ̃t∈CtandCt⊆Etleads to

rt=〈At∗−At,θ∗〉≤〈At,θ ̃t−θ∗〉≤‖At‖Vt−− 11 ‖θ ̃t−θ∗‖Vt− 1 ≤ 2 ‖At‖Vt−− (^11)



βt.
Free download pdf