Bandit Algorithms

4.3 Knowledge and environment classes 57

Name Symbol Definition

Bernoulli EBk {(B(μi))i:μ∈[0,1]k}

Uniform EUk {(U(ai,bi))i:a,b∈Rkwithai≤bifor alli}

Gaussian (known var.) ENk(σ^2 ) {(N(μi,σ^2 ))i:μ∈Rk}

Gaussian (unknown var.) ENk {(N(μi,σ^2 i))i:μ∈Rkandσ^2 ∈[0,∞)k}

Finite variance EVk(σ^2 ) {(Pi)i:VX∼Pi[X]≤σ^2 for alli}

Finite kurtosis EKurtk (κ) {(Pi)i: KurtX∼Pi[X]≤κfor alli}

Bounded support E[ka,b] {(Pi)i: Supp(Pi)⊆[a,b]}

Subgaussian ESGk(σ^2 ) {(Pi)i:Piisσ-subgaussian for alli}

Table 4.1Typical environment classes for stochastic bandits.Supp(P)is the (topological)
support of distributionP. The kurtosis of a random variableXis a measure of its tail
behavior and is defined byE[(X−E[X])^4 ]/V[X]^2. Subgaussian distributions have similar
properties to the Gaussian and will be defined in Chapter 5.

achieve a good performance in a larger class. In a way, the theory of finite-armed
stochastic bandits tries to quantify this expectation in a rigorous fashion.

Structured bandits
Environment classes that are not unstructured are called structured. Relaxing
the requirement that the environment class is a product set makes structured
bandit problems much richer than the unstructured setup. The following examples
illustrate the flexibility.

Example4.1. LetA={ 1 , 2 }andE={(B(θ),B(1−θ)) :θ∈[0,1]}. In this
environment class the learner does not know the mean of either arm, but can
learn the mean of both arms by playing just one. The knowledge of this structure
dramatically changes the difficulty of learning in this problem.

Example4.2 (Stochastic linear bandit).LetA⊂Rdandθ∈Rdand

νθ= (N(〈a,θ〉,1) :a∈A) andE={νθ:θ∈Rd}.

In this environment class the reward of an action is Gaussian and its mean is given
by the inner product between the action and some unknown parameter. Notice
that even ifAis extremely large, the learner can deduce the true environment
by playing justdactions that spanRd.

Example4.3. Consider an undirected graphGwith verticesV={ 1 ,...,|V|}
and edgesE ={ 1 ,...,|E|}. In each round the learner chooses a path from
vertex 1 to vertex|V|. Then each edgee∈[E] is removed from the graph with
probability 1−θefor unknownθ∈[0,1]|E|. The learner succeeds in reaching

Bandit Algorithms

Get our desktop app

Company

Features

Documentation

Resources