18.3 Exp4 218
18.3 Exp4
Exp4 is not just an increased version number, but stands forExponential weighting
forExploration andExploitation withExperts. The idea of the algorithm is
very simple. Since exponential weighting worked so well in the standard bandit
problem we aim to adopt it to the problem at hand. However, since the goal is
to compete with the best expert in hindsight, it is not the actions that we will
score, but the experts. Exp4 thus maintains a probability distributionQtover
experts and uses this to come up with the next action in the obvious way, by first
choosing an expertMtat random fromQtand then following the chosen expert’s
advice to chooseAt∼E(Mt)t. The reader is invited to check for themselves that
this is the same as samplingAtfromPt=QtE(t)whereQtis treated as a row
vector. Once the action is chosen, one can use their favorite reward estimation
procedure to estimate the rewards for all the actions, which is then used to
estimate how much total reward the individual experts would have made so far.
The reward estimates are then used to updateQtusing exponential weighting.
The pseudocode of Exp4 is given in Algorithm 11.
1:Input: n,k,M,η,γ
2:SetQ 1 = (1/M,..., 1 /M)∈[0,1]^1 ×M(a row vector)
3:fort= 1,...,ndo
4: Receive adviceE(t)
5: Choose the actionAt∼Pt, wherePt=QtE(t)
6: Receive the rewardXt=xtAt
7: Estimate the action rewards:Xˆti= 1−I{PAtit+=γi}(1−Xt)
8: Propagate the rewards to the experts:X ̃t=E(t)Xˆt
9: Update the distributionQtusing exponential weighting:
Qt+1,i= exp(η
X ̃ti)Qti
∑
jexp(ηX ̃tj)Qtj
for alli∈[M]
10:end for
Algorithm 11:Exp4.
The algorithm usesO(M) memory andO(M+k) computation per round
(when sampling in two steps). Hence it is only practical when bothMandkare
reasonably small.
18.4 Regret analysis
We restrict our attention to the case whenγ= 0, which is the original algorithm.
The version whereγ >0 is called Exp4-IX and its analysis is left for Exercise 18.3.
Theorem18.1. Letγ = 0andη=
√
2 log(M)/(nk)and denote byRnthe