4.4 The regret 58
their destination if all the edges in their chosen path are present. This problem
can be formalized by lettingAbe the set of paths and
νθ=
(
B
(
∏
e∈a
θe
)
:a∈A
)
and E={νθ:θ∈[0,1]|E|}.
An important feature of structured bandits is that the learner can often
obtain information about some actions while never playing them.
4.4 The regret
In Chapter 1 we informally defined the regret as being the deficit suffered by the
learner relative to the optimal policy. Letν= (Pa:a∈A) be a stochastic bandit
and define
μa(ν) =
∫∞
−∞
xdPa(x).
Then letμ∗(ν) = maxa∈Aμa(ν) be the largest mean of all the arms.
We assume throughout thatμa(ν) exists and is finite for all actions and
thatargmaxa∈Aμa(ν) is nonempty. The latter assumption could be relaxed
by carefully adapting all arguments using nearly optimal actions, but in
practice this is never required.
The regret of policyπon bandit instanceνis
Rn(π,ν) =nμ∗(ν)−E
[n
∑
t=1
Xt
]
, (4.1)
where the expectation is taken with respect to the measure on outcomes induced
by the interaction ofπandν. Minimizing the regret is equivalent to maximizing
the expectation ofSn, but the normalization inherent in the definition of the
regret is useful when stating results, which would otherwise need to be stated
relative to the optimal action.
If the context is clear we will often drop the dependence onνandπin various
quantities. For example, by writingRn=nμ∗−E[
∑n
t=1Xt]. Similarly, the
limits in sums and maxima are abbreviated when we think you can work
out ranges of symbols in a unique way. For example:μ∗= maxiμi