Nature 2020 01 30 Part.01

(Ann) #1
Nature | Vol 577 | 30 January 2020 | 675

contexts, even as the specific distribution of rewards changes. If RPE
channels with particular levels of optimism are selectively activated
with optogenetics, this should sculpt the learned distribution, which
should in turn be detectable with behavioural measures of sensitivity
to moments of the distribution. We list further predictions in the Sup-
plementary Information.
Distributional RL also gives rise to a number of broader questions.
What are the circuit- or cellular-level mechanisms that give rise to a
diversity of asymmetry in positive versus negative RPE scaling? It is
also worth considering whether other mechanisms, aside from asym-
metric scaling of RPEs, might contribute to distributional coding. It is
well established, for example, that positive and negative RPEs differ-
entially engage striatal D 1 and D 2 dopamine receptors^21 , and that the
balance of these receptors varies anatomically^22 –^24. This suggests a sec-
ond potential mechanism for differential learning from positive versus
negative RPEs^25. Moreover, how do different RPE channels anatomically
couple with their corresponding reward predictions (see Extended
Data Fig. 4i–k)? Finally, what effects might distributional coding have
downstream, at the level of action learning and selection? With this
question in mind, it is notable that some current theories in behavioural
economics centre on risk measures that can be easily read out from
the kind of distributional codes that the present work has considered.
Finally, we speculate on the implications of the distributional hypothesis
of dopamine for the mechanisms of mental disorders such as addiction and
depression. Mood has been linked with predictions of future reward^26 , and
it has been proposed that both depression and bipolar disorder may involve
biased forecasts concerning value-laden outcomes^27. It has recently been
proposed that such biases may arise from asymmetries in RPE coding^28 ,^29.
There are clear potential connections between these ideas and the phenom-
ena we have reported here, presenting opportunities for further research.


Online content


Any methods, additional references, Nature Research reporting sum-
maries, source data, extended data, supplementary information,
acknowledgements, peer review information; details of author con-
tributions and competing interests; and statements of data and code
availability are available at https://doi.org/10.1038/s41586-019-1924-6.



  1. Schultz, W., Stauffer, W. R. & Lak, A. The phasic dopamine signal maturing: from reward via
    behavioural activation to formal economic utility. Curr. Opin. Neurobiol. 43 , 139–148 (2017).

  2. Glimcher, P. W. Understanding dopamine and reinforcement learning: the dopamine
    reward prediction error hypothesis. Proc. Natl Acad. Sci. USA 108 , 15647–15654 (2011).

  3. Watabe-Uchida, M., Eshel, N. & Uchida, N. Neural circuitry of reward prediction error.
    Annu. Rev. Neurosci. 40 , 373–394 (2017).

  4. Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H. & Tanaka, T. Parametric return
    density estimation for reinforcement learning. In Proc. 26th Conference on Uncertainty in
    Artificial Intelligence (eds Grunwald, P. & Spirtes, P.) http://dl.acm.org/citation.
    cfm?id=3023549.3023592 (2010).
    5. Bellemare, M. G., Dabney, W., & Munos, R. A distributional perspective on reinforcement
    learning. In International Conference on Machine Learning (eds Precup, D. & The, Y. W.)
    449–458 (2017).
    6. Dabney, W. Rowland, M. Bellemare, M. G. & Munos, R. Distributional reinforcement
    learning with quantile regression. In AAAI Conference on Artificial Intelligence (2018).
    7. Sutton, R. S. & Barto, A. G. Reinforcement Learning: an Introduction Vol. 1 (MIT Press,
    1998).
    8. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518 ,
    529–533 (2015).
    9. Silver, D. et al. Mastering the game of Go with deep neural networks and tree search.
    Nature 529 , 484–489 (2016).
    10. Hessel, M. et al. Rainbow: combining improvements in deep reinforcement learning. In
    32nd AAAI Conference on Artificial Intelligence (2018).
    11. Botvinick, M. M., Niv, Y. & Barto, A. G. Hierarchically organized behavior and its neural
    foundations: a reinforcement learning perspective. Cognition 113 , 262–280 (2009).
    12. Wang, J. X. et al. Prefrontal cortex as a meta-reinforcement learning system. Nat.
    Neurosci. 21 , 860–868 (2018).
    13. Song, H. F., Yang, G. R. & Wang, X. J. Reward-based training of recurrent neural networks
    for cognitive and value-based tasks. eLife 6 , e21492 (2017).
    14. Barth-Maron, G. et al. Distributed distributional deterministic policy gradients. In
    International Conference on Learning Representations https://openreview.net/
    forum?id=SyZipzbCb (2018).
    15. Dabney, W., Ostrovski, G., Silver, D. & Munos, R. Implicit quantile networks for
    distributional reinforcement learning. In International Conference on Machine Learning
    (2018).
    16. Pouget, A., Beck, J. M., Ma, W. J. & Latham, P. E. Probabilistic brains: knowns and
    unknowns. Nat. Neurosci. 16 , 1170–1178 (2013).
    17. Lammel, S., Lim, B. K. & Malenka, R. C. Reward and aversion in a heterogeneous midbrain
    dopamine system. Neuropharmacology 76 , 351–359 (2014).
    18. Fiorillo, C. D., Tobler, P. N. & Schultz, W. Discrete coding of reward probability and
    uncertainty by dopamine neurons. Science 299 , 1898–1902 (2003).
    19. Eshel, N. et al. Arithmetic and local circuitry underlying dopamine prediction errors.
    Nature 525 , 243–246 (2015).
    20. Rowland, M., et al. Statistics and samples in distributional reinforcement learning. In
    International Conference on Machine Learning (2019).
    21. Frank, M. J., Seeberger, L. C. & O’Reilly, R. C. By carrot or by stick: cognitive reinforcement
    learning in parkinsonism. Science 306 , 1940–1943 (2004).
    22. Hirvonen, J. et al. Striatal dopamine D1 and D2 receptor balance in twins at increased
    genetic risk for schizophrenia. Psychiatry Res. Neuroimaging 146 , 13–20 (2006).
    23. Piggott, M. A. et al. Dopaminergic activities in the human striatum: rostrocaudal gradients
    of uptake sites and of D1 and D2 but not of D3 receptor binding or dopamine.
    Neuroscience 90 , 433–445 (1999).
    24. Rosa-Neto, P., Doudet, D. J. & Cumming, P. Gradients of dopamine D1- and D2/3-binding
    sites in the basal ganglia of pig and monkey measured by PET. Neuroimage 22 , 1076–1083
    (2004).
    25. Mikhael, J. G. & Bogacz, R. Learning reward uncertainty in the basal ganglia. PLOS
    Comput. Biol. 12 , e1005062 (2016).
    26. Robb, B. et al. A computational and neural model of momentary subjective well-being.
    Proc. Natl Acad. Sci. USA 111 , 12252–12257 (2014).
    27. Huys, Q. J., Daw, N. D. & Dayan, P. Depression: a decision-theoretic analysis. Annu. Rev.
    Neurosci. 38 , 1–23 (2015).
    28. Bennett, D. & Niv, Y. Opening Burton’s clock: psychiatric insights from computational
    cognitive models. Preprint at https://doi.org/10.31234/osf.io/y2vzu (2018).
    29. Tian, J. & Uchida, N. Habenula lesions reveal that multiple mechanisms underlie
    dopamine prediction errors. Neuron 87 , 1304–1316 (2015).
    30. Eshel, N., Tian, J., Bukwich, M. & Uchida, N. Dopamine neurons share common response
    function for reward prediction error. Nat. Neurosci. 19 , 479–486 (2016).
    Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
    published maps and institutional affiliations.


© The Author(s), under exclusive licence to Springer Nature Limited 2020

e Sensitivity to distribution

10% cue
50% cue90% cue
00 .2 0.4
Probability of reward

0.6 0.8 1.0

Loss (log)

10 –2

10 –4

10 –1

10 –3

d Decoded distribution (DA)

0 0.20.4
Reward (AU)

0.60.81.0

Density

0

2

4

6

Decoded distribution (DA) 8
c

0.1 μ

l
2.5 μ

l 5 μl
10 μ

l
20 μ

l

Density

0

0.1

0.2

0.3

Reward

abDistributional TD decoding Classical TD decoding
Ground truth
Decoded

0

0.2

0.6
0.4

0.8

Density

0

0.1

0.3
0.2

0.4

Density

0.1 μ

l
2.5 μ

l 5 μl
10 μ

l
20 μ

l
0.1 Reward
μl2.5 μl 5 μl 10 μl 20 μl
Reward

Fig. 5 | Decoding reward distributions from neural responses.
a, Distributional TD simulation trained on the variable-magnitude task, whose
actual (smoothed) distribution of rewards is shown in grey. After training the
model, we interpret the learned values as a set of expectiles. We then decode
the set of expectiles into a probability density (blue traces). Multiple solutions
are shown in light blue, and the average across solutions is shown in dark blue.
(See Methods for more details.) b, Same as a, but with a classical TD simulation.
c, Same as a, but using data from recorded dopamine cells. The expectiles are
defined by the reversal points and the relative scaling from the slopes of


positive and negative RPEs, as shown in Fig. 4. Unlike the classical TD
simulation, the real dopamine cells collectively encode the shape of the reward
distribution that animals have been trained to expect. d, Same decoding
analysis, using data from each of the cue conditions in the variable-probability
task, based on cue responses of dopamine neurons (decoding for GABAergic
neurons shown in Extended Data Fig. 8i, j). e, The neural data for both
dopamine and GABAergic neurons were best fit by Bernoulli distributions
closely approximating the ground-truth reward probabilities in all three cue
conditions.
Free download pdf