Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Dec 23;11(12):e1004622.
doi: 10.1371/journal.pcbi.1004622. eCollection 2015 Dec.

Tamping Ramping: Algorithmic, Implementational, and Computational Explanations of Phasic Dopamine Signals in the Accumbens

Affiliations

Tamping Ramping: Algorithmic, Implementational, and Computational Explanations of Phasic Dopamine Signals in the Accumbens

Kevin Lloyd et al. PLoS Comput Biol. .

Abstract

Substantial evidence suggests that the phasic activity of dopamine neurons represents reinforcement learning's temporal difference prediction error. However, recent reports of ramp-like increases in dopamine concentration in the striatum when animals are about to act, or are about to reach rewards, appear to pose a challenge to established thinking. This is because the implied activity is persistently predictable by preceding stimuli, and so cannot arise as this sort of prediction error. Here, we explore three possible accounts of such ramping signals: (a) the resolution of uncertainty about the timing of action; (b) the direct influence of dopamine over mechanisms associated with making choices; and (c) a new model of discounted vigour. Collectively, these suggest that dopamine ramps may be explained, with only minor disturbance, by standard theoretical ideas, though urgent questions remain regarding their proximal cause. We suggest experimental approaches to disentangling which of the proposed mechanisms are responsible for dopamine ramps.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Phasic dopamine signals resemble a temporal difference error.
(A) Changes in extracellular dopamine concentration (Δ[DA]) in the nucleus accumbens (NAc) core before (left; single trial) and after (right; mean + SEM) experience of repeated pairings between a predictive cue (horizontal black bar) and a reward (inverted black triangle) delivered at cue offset. Initially, a phasic increase in dopamine is observed at the time of reward delivery. After repeated experience of the relationship between cue and reward, a phasic increase is observed at the time of cue onset, but not at the time of reward, which is still delivered. Adapted from [6], with permission. (B) Models based on temporal difference (TD) learning predict transfer of the TD error δ t from the time of reward (‘R’; left) to time of predictive cue (‘CS’; right) over the course of learning for both trace and delay conditioning.
Fig 2
Fig 2. Dopamine response function.
Left: Change in NAc extracellular dopamine concentration evoked by electrical stimulation of VTA (red boxes indicate points at which electrical stimulation began and ended). Adapted from [10], with permission. Right: alpha function used to model the effect of a punctate, non-zero TD error (red triangle) on dopamine concentration (Eq (4)).
Fig 3
Fig 3. Two conceptions of a cued lever press.
(A) A latency τ with which to press the lever is selected in an initial cued state (‘1’), leading to completion of the press τ seconds later (‘2’). (B) A latency τ with which to press the lever is selected in an initial cued state (‘1’), leading to a state of preparedness to press τ seconds later (‘2’). Completion of the press (‘3’) occurs only after a subsequent interval τ post. After a further inter-trial interval τ I, the process begins anew.
Fig 4
Fig 4. Roitman et al. [10] reported increases in average NAc dopamine concentration that occur shortly before completion of a lever press for reward.
(A) Task: rats press a lever at a time of their own choosing for reward (intra-oral sucrose) following a cue indicating that reward is available. (B) Cue presentation (black triangle) evokes a phasic increase in dopamine concentration (mean + SEM) if the cue indicates that reward is available (upper trace), but not when there is no such cue-reward pairing (lower trace); the decrease in signal in the latter case is not caused by dopamine [10]. (C;D) When aligned to time of lever press (vertical dashed line), dopamine concentration is observed to peak at the time of the press, beginning to increase shortly before this time. This is observed both for (C) short-latency trials, where presses are emitted shortly after presentation of the cue (<5 s; average time of presentation indicated by black triangle, range represented by horizontal scale bar) and (D) long-latency trials, where there is a longer delay between cue and response (>5 s). On long-latency trials, average peak dopamine concentration is higher around time of response than around time of cue (D, inset). A lever press leads to both sucrose infusion (black bar) and presentation of a tone-light stimulus (open bar). Figures B–D adapted from [10], with permission.
Fig 5
Fig 5. Howe et al. [14] reported gradual increases in striatal dopamine concentration as rats approach reward in a maze.
(A) Following an initial warning click, a position-triggered tone indicates to rats which arm of the maze to visit in order to receive reward (upper). Changes in current (middle) and dopamine concentration (lower) measured by FSCV in ventromedial striatum during a single T-maze trial. (B) Average dopamine concentration (±SEM) reaches similar peak values on short vs. long trials for the same maze (upper) and for mazes of different length (lower). (C) Single-trial example showing a close correspondence between the rat’s proximity to the goal (upper) and striatal dopamine concentration (lower). All figures adapted from [14], with permission.
Fig 6
Fig 6. Possible signals received by the critic about action timing.
We assume that the actor selects an action a (e.g., a latency to lever press) and communicates this choice to downstream pre-motor/motor areas for implementation. We also assume that the critic receives an ‘indirect’ signal a′′ via efference copy from downstream areas just prior to performance of the action itself. This latter signal resolves any uncertainty the critic may have about the time of action. The critic may also receive a ‘direct’ signal a′ from the actor which carries information about the selected action, and which is received immediately after the actor makes its decision.
Fig 7
Fig 7. Constant and variable hazard functions.
(A) Two different gamma densities of the time T at which the critic receives notification of an impending lever press. (B) Corresponding hazard functions h(t^)=limΔt^0{P(Tt^+Δt^|T>t^)/Δt^}. Note that the hazard function is constant in the G(1,1) case, but increases with time in the G(2,1) case.
Fig 8
Fig 8. Pattern of prediction errors depends on the nature of communication between actor and critic.
In each case (A–C), we consider signals for three particular times of T at which the critic receives notice of the impending lever press: 1 s (blue), 3 s (red), and 10 s (green). Parts of the signal where there is overlap between two or more different times of T are plotted in black. In each case, we plot TD errors (top), TD errors convolved with symmetric kernel (middle), and TD errors convolved with ‘asymmetric’ kernel (bottom). (A) Indirect communication (a′′) only, TG(1,1). (B) Indirect communication (a′′) only, TG(2,1). (C) Both direct and indirect communication (a′; a′′), TG(2,1), with timing uncertainty (uncertainty scaling constant k = 0.1). Vertical dashed lines indicate times of observable events, i.e. cue presentation (t = 0, black) and lever presses (t = T+τ post, coloured). Note the difference in y-axis scaling between (A;B) and (C). Model parameters: a = −1, b = 0, r = 1, τ post = 0.5 s, τ I = 30 s.
Fig 9
Fig 9. Simulated cue- and press-aligned changes in dopamine concentration, for comparison with Fig 4.
Simulated average changes (±SEM) in dopamine concentration for the case where the critic receives both direct and indirect communication, asymmetric convolution of TD errors. (A) Cue-aligned, all trials. (B) Press-aligned, short-latency (<5 s) trials. (C) Press-aligned, long-latency (>5 s) trials. Insets show average peak changes (+SEM) in dopamine around the time of cue presentation and time of lever press. Number of simulated trials N = 1000, realizations of T drawn G(2,1). Model parameters as before: a = −1, b = 0, r = 1, k = 0.1, τ post = 0.5 s, τ I = 30 s.
Fig 10
Fig 10. Simulated tonic and phasic dopamine fluctuations.
(A) Simulated tonic fluctuations of dopamine concentration [DA] around a constant level (horizontal dashed line). (B) Addition of a comparatively large phasic fluctuation in dopamine concentration due to a TD error occurring at t = 1 s (vertical dashed line).
Fig 11
Fig 11. Tonic fluctuations generate average ramping signals.
(A) Single trial example showing evolution of the decision variable x(t) (upper) and dopamine concentration [DA] (lower) over time. (B) Average [DA] (±SEM) aligned to time of threshold crossing. Number of simulated trials N = 1000. [DA] process parameters: dt = 0.01 s, θ = 1, κ = 0.01, σ = 0.1. DDM parameters: A = 1, c = 1, z = 5.
Fig 12
Fig 12. Phasic fluctuations generate ramping signals.
(A) Average dopamine concentration [DA] (±SEM) aligned to time of threshold crossing. Number of simulated trials N = 1000. (B) Time of threshold-crossing (latency) is negatively correlated with size of TD error h in the model (ρ = −0.43). (C) Similarly, response magnitude of dopaminergic cells to a trial-start cue (upper plots, showing population response histograms by behavioural reaction time, RT) is negatively correlated with a monkey’s reaction time (lower) in an instrumental, reward-related task. Adapted from [83], with permission. [DA] process parameters: dt = 0.01 s, θ = 1, κ = 0.01, σ = 0.1. DDM parameters: A = 2, c = 0.1, z = 5. TD errors: μTD=4,σTD2=1.
Fig 13
Fig 13. Correlation of average utility rate and size of TD error.
(A) As the utility r of a reward increases, putatively from a change in motivational state, both the average utility rate ρ (assumed to be signalled by tonic dopamine) and the size of TD error δtp in response to a trial-start cue (phasic dopamine) increase. (B) A negative correlation between TD error and latency is observed. Here, we again assume the lever-pressing task depicted in Fig 3B. The critic is assumed to receive only indirect information about the actor’s choices. Model parameters: a = −0.05, b = 0, τ post = 0.5 s, τ I = 0 s, β = 1.
Fig 14
Fig 14. Optimal latency τ* as a function of discount factor γ and cost a.
The optimal latency τ* tends to decrease as either the discount factor γ or the cost of acting quickly a decrease. (A) Terminating SMDP, V γ(s t + τ) = 1, ∀τ, γ. There exists a limit on the cost of acting a lim below which there is no solution for τ* (solid red line). (B) Difference between the optimal τ* for the cases of continuing and terminating SMDP for the case that τ I = 30 s. As τ I is large relative to −1/log γ, there is little difference Δτ* from the terminating case. (C) Difference between the optimal τ* for the cases of continuing and terminating SMDP for the case that τ I = 1 s. In this case, future rewards hasten lever pressing as seen in the more prevalent decreases in τ*.
Fig 15
Fig 15. Simulations replicate Howe et al. results.
(A) [DA] gradually increases as the goal is approached, peaking at the same values whether different times were taken to traverse a maze of fixed length, or in mazes of different lengths with a fixed magnitude of reward (time is taken as a proxy for distance in the latter case). (B) Peak [DA] is greater for larger rewards. (C) [DA] tracks proximity to the goal. In this example, goal proximity over time is non-monotonically increasing (left), and we plot both the corresponding scaled value quantity (1 − γ)V γ(s t + 1) (middle) and the convolution of the latter with the DRF which yields [DA] (right). Parameters: γ = 0.98, r = 1 (unless indicated otherwise).

Similar articles

Cited by

References

    1. Montague PR, Dayan P, Sejnowski TJ. A framework for mesencephalic dopamine systems based on predictive hebbian learning. The Journal of Neuroscience. 1996;16(5):1936–1947. - PMC - PubMed
    1. Schultz W, Dayan P, Montague PR. A neural substrate of prediction and reward. Science. 1997;275:1593–1599. 10.1126/science.275.5306.1593 - DOI - PubMed
    1. Sutton RS. Learning to predict by the methods of temporal differences. Machine Learning. 1988;3(1):9–44. 10.1023/A:1022633531479 - DOI
    1. Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT Press; 1998.
    1. Clark JJ, Collins AL, Sanford CA, Phillips PEM. Dopamine encoding of Pavlovian incentive stimuli diminishes with extended training. The Journal of Neuroscience. 2013;33(8):3526–3532. 10.1523/JNEUROSCI.5119-12.2013 - DOI - PMC - PubMed

Publication types

LinkOut - more resources