Bandit Algorithms

(Jeff_L) #1
19.3 Regret analysis 231

Soˆθtis an estimate ofθ∗, which makes it natural to chooseCtto be centered at
θˆt− 1. For what follows we will simply assume that the confidence setCtis closed
and satisfies
Ct⊆Et=

{


θ∈Rd:‖θ−θˆt− 1 ‖^2 Vt− 1 ≤βt

}


, (19.6)


where (βt)tis an increasing sequence of constants withβ 1 ≥1. The setEtis an
ellipsoid centered atθˆt− 1 and with principle axis being the eigenvectors ofVt
with corresponding lengths being the reciprocal of the eigenvalues. Notice that as
tgrows the matrixVthas increasing eigenvalues, which means the volume of the
ellipse is also shrinking (at least, providedβtdoes not grow too fast). As noted
beforehand, the next chapter will be devoted to show thatCt=Etis a natural
choice for carefully chosenβt. In the rest of this chapter, we simply examine the
consequence of using a confidence set satisfying Eq. (19.6) and assume all the
desirable properties.

19.3 Regret analysis


We prove a regret bound for LinUCB under the assumption that the confidence
intervals indeed contain the true parameter with high probability and boundedness
conditions on the action set and rewards.

Assumption19.1.The following hold:

(a)|〈a,θ∗〉|≤1 for alla∈

⋃n
t=1At.
(b)‖a‖ 2 ≤Lfor alla∈

⋃n
t=1At.
(c)There exists aδ∈(0,1) such that with probability 1−δ, for allt∈[n],
θ∗∈CtwhereCtsatisfies Eq. (19.6).

Theorem19.2. Under the conditions of Assumption 19.1 with probability 1 −δ
the regret of LinUCB satisfies

Rˆn≤


8 nβnlog

(


detVn
detV 0

)




8 dnβnlog

(


dλ+nL^2

)


.


Theorem 20.5 in the next chapter shows thatβnmay be chosen to be


βn=


λL+


2 log

(


1


δ

)


+dlog

(


dλ+nL^2

)


.


By choosingδ= 1/nwe obtain the follow corollary bounding the expected regret.

Corollary19.3.Under the conditions of Assumption 19.1 the expected regret
of LinUCB withδ= 1/nis bounded by

Rn≤Cd


nlog(nL),

whereC > 0 is a suitably large universal constant.
Free download pdf