Bandit Algorithms

23.2 Elimination on the hypercube 261

Assumption23.1.The following hold: (a)(Sparse parameter)There exist known constants m 0 andm 2 such that ‖θ∗‖ 0 ≤m 0 and‖θ∗‖ 2 ≤m 2. (b)(Bounded mean rewards):〈a,θ∗〉≤1 for alla∈Atand all roundst. (c)(Subgaussian noise):The noise is conditionally 1-subgaussian: for allλ∈R, E[exp(ληt)|Ft− 1 ]≤exp(λ^2 /2)a.s.,

whereFt=σ(A 1 ,X 1 ,...,At,Xt,At+1). Much ink has been spilled on what can be said about the speed of learning in linear models like (23.1) when (At)tare passively generated and the parameter vector is known to be sparse. Most results are phrased about recoveringθ∗, but there also exist a few results that quantify the error when predictingXt. The ideal outcome would be that the learning speed depends mostly onm 0 , with only a mild dependence ond. Almost all the results come under the assumption that the covariance matrix of the actions (At)tis well-conditioned.

Thecondition numberof a positive definite matrixAis the ratio of its largest and smallest eigenvalues. A matrix iswell conditionedif it has a small condition number.

The details are a bit more complicated than just the conditioning, but the main point is that the usual assumptions imposed on the covariance matrix of the actions for passive learning are never satisfied when the actions are chosen by a good bandit policy. The reason is simple. Bandit algorithms want to choose the optimal action as often as possible, which means the covariance matrix will have an eigenvector that points (approximately) towards to optimal action with a large corresponding eigenvalue. We need some approach that does not rely on such strong assumptions.

23.2 Elimination on the hypercube

As a warmup, consider the case where the action set is thed-dimensional hypercube:At=A= [− 1 ,1]d. To reduce clutter we denote the true parameter vector byθ=θ∗. The hypercube is notable as an action set because it enjoys perfect separability. For each dimensioni∈[d] the value ofAti∈[− 1 ,1] can be chosen independently ofAtjforj 6 =i. Because of this the optimal action is a∗= sign(θ), where

sign(θ)i= sign(θi) =







1 , ifθi>0 ; 0 , ifθi= 0 ; − 1 , ifθi< 0.

Bandit Algorithms

23.2 Elimination on the hypercube







Get our desktop app

Company

Features

Documentation

Resources