Bandit Algorithms

(Jeff_L) #1
23.2 Elimination on the hypercube 261

Assumption23.1.The following hold:
(a)(Sparse parameter)There exist known constants m 0 andm 2 such that
‖θ∗‖ 0 ≤m 0 and‖θ∗‖ 2 ≤m 2.
(b)(Bounded mean rewards):〈a,θ∗〉≤1 for alla∈Atand all roundst.
(c)(Subgaussian noise):The noise is conditionally 1-subgaussian:
for allλ∈R, E[exp(ληt)|Ft− 1 ]≤exp(λ^2 /2)a.s.,

whereFt=σ(A 1 ,X 1 ,...,At,Xt,At+1).
Much ink has been spilled on what can be said about the speed of learning in
linear models like (23.1) when (At)tare passively generated and the parameter
vector is known to be sparse. Most results are phrased about recoveringθ∗, but
there also exist a few results that quantify the error when predictingXt. The
ideal outcome would be that the learning speed depends mostly onm 0 , with only
a mild dependence ond. Almost all the results come under the assumption that
the covariance matrix of the actions (At)tis well-conditioned.

Thecondition numberof a positive definite matrixAis the ratio of its
largest and smallest eigenvalues. A matrix iswell conditionedif it has a
small condition number.

The details are a bit more complicated than just the conditioning, but the
main point is that the usual assumptions imposed on the covariance matrix of
the actions for passive learning are never satisfied when the actions are chosen
by a good bandit policy. The reason is simple. Bandit algorithms want to choose
the optimal action as often as possible, which means the covariance matrix will
have an eigenvector that points (approximately) towards to optimal action with
a large corresponding eigenvalue. We need some approach that does not rely on
such strong assumptions.

23.2 Elimination on the hypercube


As a warmup, consider the case where the action set is thed-dimensional
hypercube:At=A= [− 1 ,1]d. To reduce clutter we denote the true parameter
vector byθ=θ∗. The hypercube is notable as an action set because it enjoys
perfect separability. For each dimensioni∈[d] the value ofAti∈[− 1 ,1] can
be chosen independently ofAtjforj 6 =i. Because of this the optimal action is
a∗= sign(θ), where

sign(θ)i= sign(θi) =








1 , ifθi>0 ;
0 , ifθi= 0 ;
− 1 , ifθi< 0.
Free download pdf