Bandit Algorithms

(Jeff_L) #1
24.4 Unrealizable case 276

Next define matrixV∈Rp×dto be a block-diagonal matrix with 1×kblocks,
each containing the row vector (1, 2 ,...,k)>. For example, whenp= 3, we have

V=




1 ··· k 0 ··· 0 0 ··· 0
0 ··· 0 1 ··· k 0 ··· 0
0 ··· 0 0 ··· 0 1 ··· k


.


LetBt=V At∈[k]prepresent the vector of ‘base’ actions chosen by the learner
in each of thepbandits in roundt. The optimal action in theith bandit is
b∗i(θ) = argmaxb∈[k]θ(bi).

The regret can be decomposed into the regrets in thep‘base bandit’ problems (a
form of separability, again):


Rn(θ) =

∑p

i=1

∆Eθ

[n

t=1

I{Bti 6 =b∗i}

]


︸ ︷︷ ︸


Rni(θ)
Fori∈[p] we abbreviateθ(−i)= (θ(1),...,θ(i−1),θ(i+1),...,θ(p)). Then,

1
|Θ|p


θ∈Θp

Rn(θ) =^1
|Θ|p

∑p

i=1


θ∈Θp

Rni(θ)

=


∑p

i=1

1


|Θ|p−^1


θ(−i)∈Θp−^1

1


|Θ|



θ(i)∈Θ

Rni(θ)


1


8


∑p

i=1

1


|Θ|p−^1


θ(−i)∈Θp−^1


kn (24.6)

=^1


8


p


kn=^1
8


dpn.

Here, in the first equality we use the convention thatθdenotes the vector obtained
by ‘inserting’θ(i)intoθ(−i)at theith ‘block’. Other than this, the only tricky
step is the inequality, which follows by choosing ∆≈


k/nand repeating the
argument outlined in Exercise 15.2. We leave it to the reader to check the details
(Exercise 24.1).


24.4 Unrealizable case


An important generalization of the linear model is theunrealizablecase where
the mean rewards are not assumed to follow a linear model exactly. Suppose
thatA ⊂Rdis a finite set with|A| =kand thatXt=ηt+μ(At) where
μ:A→Ris an unknown function. Letθ∈Rdbe the parameter vector for which
supa∈A|〈θ,a〉−μ(a)|is as small as possible:
θ= argminθ∈Rdsup
a∈A


|〈θ,a〉−μ(a)|.
Free download pdf