Research ArticlePSYCHOLOGICAL SCIENCE

Two sides of the same coin: Beneficial and detrimental consequences of range adaptation in human reinforcement learning

See allHide authors and affiliations

Science Advances  02 Apr 2021:
Vol. 7, no. 14, eabe0340
DOI: 10.1126/sciadv.abe0340

Abstract

Evidence suggests that economic values are rescaled as a function of the range of the available options. Although locally adaptive, range adaptation has been shown to lead to suboptimal choices, particularly notable in reinforcement learning (RL) situations when options are extrapolated from their original context to a new one. Range adaptation can be seen as the result of an adaptive coding process aiming at increasing the signal-to-noise ratio. However, this hypothesis leads to a counterintuitive prediction: Decreasing task difficulty should increase range adaptation and, consequently, extrapolation errors. Here, we tested the paradoxical relation between range adaptation and performance in a large sample of participants performing variants of an RL task, where we manipulated task difficulty. Results confirmed that range adaptation induces systematic extrapolation errors and is stronger when decreasing task difficulty. Last, we propose a range-adapting model and show that it is able to parsimoniously capture all the behavioral results.

INTRODUCTION

In the famous Ebbinghaus illusion, two circles of identical size are placed near to each other. Larger circles surround one, while smaller circles surround the other. As a result, the central circle surrounded by larger circles appears smaller than the central circle surrounded by smaller circles, indicating that the subjective perception of size of an object is affected by its surroundings.

Beyond perceptual decision-making, a wealth of evidence in neuroscience and in economics suggests that the subjective economic value of an option is not estimated in isolation but is highly dependent on the context in which the options are presented (1, 2). The vast majority of neuroeconomic studies of context-dependent valuation in human participants considered situations where subjective values are triggered by explicit cues, that is, stimuli whose value can be directly inferred, such as lotteries or snacks (35). However, in a series of recent papers, we and other groups demonstrated that contextual adjustments also permeate reinforcement learning situations in which option values have to be inferred from the history of past outcomes (68). We showed that an option, whose small objective value [for example 7.5 euro cents (c)] is learned in a context of smaller outcomes, is preferred to an option whose objective value (25c) is learned in a context of bigger outcomes, thus providing an equivalent of the Ebbinghaus illusion in economic choices. Similar observations in birds suggest that this is a feature of decision-making broadly shared across vertebrates (9, 10).

Although (as illustrated by the Ebbinghaus example) value context dependence may lead to erroneous or suboptimal decisions, it could be normatively understood as an adaptive process aimed at rescaling the neural response according to the range of the available options. Specifically, it could be seen as the result of an adaptive coding process aiming at increasing the signal-to-noise ratio by a system (the brain) constrained by the fact that behavioral variables have to be encoded by finite firing rates. In other terms, such range adaptation would be a consequence of how the system adjusts and optimizes the function associating the firing rate to the objective values, to set the slope of the response to the optimal value for each context (11, 12).

If range adaptation is a consequence of how the brain automatically adapts its response to the distributions of the available outcomes, then factors that facilitate the identification of these distributions might make more pronounced its behavioral consequences, because of the larger difference between the objective option values (context-independent or absolute) and their corresponding subjective values (context-dependent or relative). This leads to a counterintuitive prediction in the context of reinforcement learning. This prediction is in notable contrast with the intuition embedded in virtually all learning algorithms that making a learning problem easier (in our case, by facilitating the identification of the outcome distributions) should, if anything, lead to more accurate and objective internal representations. In the present study, we aim at testing this hypothesis while, at the same time, gaining a better understanding of range adaptation at the algorithmic level.

To empirically test this hypothesis, we build on previous research and used a task featuring a learning phase and a transfer phase (6). In the learning phase, participants had to determine, by trial and error, the optimal option in four fixed pairs of options (contexts), with different outcome ranges. In the transfer phase, the original options were rearranged in different pairs, thus creating new contexts. This setup allowed us to quantify learning (or acquisition) errors during the first phase and transfer (or extrapolation) errors during the second phase. Crucially, the task contexts were designed such that the correct responses (that is, choice of options giving a higher expected value) in the transfer phase were not necessarily correct responses during the learning phase. We varied this paradigm in eight different versions where we manipulated the task difficulty in complementary ways. First, some of the experiments (E3, E4, E7, and E8) featured complete feedback information, meaning that participants were informed about the outcome of the forgone option. This manipulation reduces the difficulty of the task by resolving the uncertainty concerning the counterfactual outcome (that is the outcome of the unchosen option). Complete feedback information has been repeatedly shown to improve learning performance (8, 13). Second, some of the experiments (E5, E6, E7, and E8) featured a block (instead of interleaved) design, meaning that all the trials featuring one context were presented in a row. This manipulation reduces task difficulty by reducing working memory demand and has also been shown to improve learning performance (14). Last, in some of the experiments (E2, E4, E6, and E8), feedback was also provided in the transfer phase, thus allowing us to assess whether and how the values learned during the learning phase can be revised.

Analysis of choice behavior provided support for the counterintuitive prediction and indicated that acquisition error rate in the learning phase is largely dissociable from extrapolation error rate in the transfer phase. Critically (and paradoxically), error rate in the transfer phase was higher when the learning phase was easier. Accordingly, the estimated deviation between the objective values and the subjective values increased in the complete feedback and block design tasks. The deviation was corrected only in the experiments with complete feedback in the transfer phase.

To complement choice rate analysis, we developed a computational model that implements range adaption as a range normalization process, by tracking the maximum and the minimum possible reward in each learning context. Model simulations parsimoniously captured performance in the learning and the transfer phase, including the suboptimal choices induced by range adaptation. Model simulations also allowed us to rule out alternative interpretations of our results offered by two prominent theories in psychology and economics: habit formation and risk aversion (15, 16). Model comparison results were confirmed by checking out-of-sample likelihood as a quantitative measure of goodness of fit.

RESULTS

Experimental protocol

We designed a series of learning and decision-making experiments involving variants of a main task. The main task was composed of two phases: the learning and the transfer phase. During the learning phase, participants were presented with eight abstract pictures, organized in four stable choice contexts. In the learning phase, each choice context featured only two possible outcomes: either 10/0 points or 1/0 point. The outcomes were probabilistic (75 or 25% probability of the nonzero outcomes), and we labeled the choices contexts as a function of the difference in expected value between the most and the least rewarding option: ∆EV = 5 and ∆EV = 0.5 (Fig. 1A). In the subsequent transfer phase, the eight options were rearranged into new choice contexts, where options associated with 10 points were compared to options associated with 1 point [see (7, 10) for similar designs in humans and starlings]. The resulting new four contexts were also labeled as a function of the difference in expected value between the most and the least rewarding option: ∆EV = 7.25, ∆EV = 6.75, ∆EV = 2.25, and ∆EV = 1.75 (Fig. 1B). In our between-participants study, we developed eight different variants of the main paradigm where we manipulated whether we provided trial-by-trial feedback in the transfer phase (with/without), the quantity of information provided at feedback (partial: only the outcome of the chosen option is shown/complete: both outcomes are shown), and the temporal structure of choice contexts presentation (interleaved: choice contexts appear in a randomized order/block: all trials belonging to the same choice contexts are presented in a row) (Fig. 1C). All the experiments implementing the above-described experimental protocol and reported in the Results section were conducted online (n = 100 participants in each experiment); we report in the Supplementary Materials the results concerning a similar experiment realized in the laboratory.

Fig. 1 Experimental design.

(A) Choice contexts in the learning phase. During the learning phase, participants were presented with four choice contexts, including high magnitude (∆EV = 5.0 contexts) and low magnitude (∆EV = 0.5 contexts). (B) Choice contexts in the transfer phase. The four options were rearranged into four new choice contexts, each involving both the 1- and the 10-point outcome. (C) Experimental design. The eight experiments varied in the temporal arrangement of choice contexts (interleaved or block) and the quantity of feedback in the learning phase (partial or complete) and the transfer phase (without or with feedback). (D) Successive screens of a typical trial (durations are given in milliseconds).

Overall correct response rate

The main dependent variable in our study was the correct response rate, i.e., the proportion of expected value-maximizing choices in the learning and the transfer phase (crucially our task design allowed to identify an expected value-maximizing choice in all choice contexts). In the learning phase, the average correct response rate was significantly higher than chance level 0.5 [0.69 ± 0.16, t(799) = 32.49, P < 0.0001, and d = 1.15; Fig. 2, A and B]. Replicating previous findings, in the learning phase, we also observed a moderate but significant effect of the choice contexts, where the correct choice rate was higher in the ∆EV = 5.0 compared to the ∆EV = 0.5 contexts (0.71 ± 0.18 versus 0.67 ± 0.18; t(799) = 6.81, P < 0.0001, and d = 0.24; Fig. 2C) (6).

Fig. 2 Behavioral results.

(A) Correct choice rate in the learning phase as a function of the choice context (∆EV = 5.0 or ∆EV = 0.5). Left: Learning curves. Right: Average across all trials (n = 800 participants). (B) Average correct response rate in the learning phase per experiment (in blue: one point per participant) and meta-analytical (in orange: one point per experiment). (C) Difference in correct choice rate between the ∆EV = 5.0 and the ∆EV = 0.5 contexts per experiment (in blue: one point per participant) and meta-analytical (in orange: one point per experiment). (D) Correct choice rate in the transfer phase as a function of the choice context (∆EV = 7.25, ∆EV = 6.75, ∆EV = 2.25, or ∆EV = 1.75). Left: Learning curves. Right: Average across all trials (n = 800 participants). (E) Average correct response rate in the transfer phase per experiment (in pink: one point per participant) and meta-analytical (in orange: one point per experiment). (F) Correct choice rate for the ∆EV = 1.75 context only (in pink: one point per participant) and meta-analytical (in orange: one point per experiment). In all panels, points indicate individual average, areas indicate probability density function, boxes indicate 95% confidence interval, and error bars indicate SEM.

Correct response rate was also higher than chance in the transfer phase (0.62 ± 0.17, t(799) = 20.29, P < 0.0001, and d = 0.72; Fig. 2, D and E), but it was also strongly modulated by the choice context (F2.84,2250.66 = 271.68, P < 0.0001, and η2 = 0.20, Huynh-Feldt corrected). In the transfer phase, the ∆EV = 1.75 choice context is of particular interest, because the expected value maximizing option was the least favorable option of a ∆EV = 5.0 context in the learning phase, and conversely, the expected value minimizing option was the most favorable option of a ∆EV = 0.5 context of the learning phase. In other words, a participant relying on expected values calculated on a context-independent scale will prefer the option with EV = 2.5 (EV2.5 option) compared to with EV = 0.75 (EV0.75 option). On the other side, a participant encoding the option values on a fully context-dependent manner (which is equivalent to encode the rank between two options in a given context) will perceive the EV2.5 option as less favorable compared to the EV0.75 option. Therefore, preferences in the ∆EV = 1.75 context are diagnostic of whether values are learned and encoded on an absolute or relative scale. Crucially, in the ∆EV = 1.75 context, we found that participants’ average correct choice rate was significantly below chance level (0.42 ± 0.30, t(799) = −7.25, P < 0.0001, and d = −0.26; Fig. 2F), thus demonstrating that participants express suboptimal preferences in this context, i.e., they do not choose the option with the highest objective expected value.

Between-experiments comparisons: Learning phase

In this section, we analyze the correct response rate as a function of the experimental factors manipulated across the eight experiments (the quantity of feedback information, which could be either partial or complete; the temporal structure of choice context presentation, which could be block or interleaved; and whether feedback was provided in the transfer phase). In the Results section, we report the significant results, but please see Tables 1 and 2 for all results and effect sizes.

Table 1 Statistical effects of the ANOVA on the choice rate as a function of task factors.

LF, learning feedback (complete/partial); TF, transfer feedback (with/without); BE, block effect (block/interleaved); PE, phase effect (learning/transfer), DFn, degrees of freedom numerator, DFd, degrees of freedom denominator, F-val, Fisher value; Diff, value of the difference between the two conditions (main effects only); ηp2, portion of variance explained. *P < 0.05, **P < 0.01, and ***P < 0.001.

View this table:
Table 2 Participants’ age and correct choice rate as a function of experiments and task factors.

View this table:

First, we analyzed the correct choice rate in the learning phase (Fig. 2B). As expected, increasing feedback information had a significant effect on correct choice rate in the learning phase (F1,792 = 55.57, P < 0.0001, and ηp2 = 0.18); similarly, performance in the block design experiments was significantly higher (F1,792 = 87.22, P < 0.0001, and ηp2 = 0.25). We found a significant interaction between feedback information and task structure, reflecting that the difference of performance between partial and complete feedback was higher in block design (F1,792 = 5.05, P = 0.02, and ηp2 = 0.02). We found no other significant main effect nor double or triple interaction (Table 1).

We also analyzed the difference in performance between the ∆EV = 5.0 and ∆EV = 0.5 choice contexts across experiments (Fig. 2C). We found a small but significant effect of temporal structure, the differential being smaller in the block compared to interleaved experiments (F1,792 = 7.71, P = 0.006, and ηp2 = 0.01), and found no other significant main effect nor interaction.

To sum up, as expected (8, 13, 14), increasing feedback information and clustering the choice contexts had a beneficial effect on correct response rate in the learning phase. Designing the choice contexts in blocks also blunted the difference in performance between the small (∆EV = 0.5) and big (∆EV = 5.0) magnitude contexts.

Between-experiments comparisons: Transfer phase

We then analyzed the correct choice rate in the transfer phase (Fig. 2E). Expectedly, showing trial-by-trial feedback in the transfer phase led to significantly higher performance (F1,792 = 137.18, P < 0.0001, and ηp2 = 0.07). Increasing feedback information from partial to complete also had a significant effect on transfer phase correct choice rate (F1,792 = 22.36, P < 0.0001, and ηp2 = 0.01). We found no significant main effect of task structure in the transfer phase (see Table 1).

We found a significant interaction between feedback information and the presence of feedback in the transfer phase, showing that the increase in performance due to the addition of feedback information is higher when both outcomes were displayed during the learning phase (F1,792 = 20.18, P < 0.0001, and ηp2 = 0.01). We also found a significant interaction between transfer feedback and task structure, reflecting that the increase in performance due to the addition of feedback information was even higher in block design (F1,792 = 42.22, P < 0.0001, and ηp2 = 0.02). Last, we found a significant triple interaction between feedback information, the presence of feedback in the transfer phase, and task structure (F1,792 = 5.02, P = 0.03, and ηp2 = 0.003). We found no other significant double interaction. We also separately analyzed the correct choice rate in the ∆EV = 1.75 context (Fig. 2F). Overall, the statistical effects presented a similar pattern as the correct choice rate across all conditions (see Table 2), indicating that overall correct choice rate and the correct choice rate in the key comparison ∆EV = 1.75 provided a coherent picture. Furthermore, comparing the ∆EV = 1.75 to chance level (0.5) revealed that participants, overall, significantly expressed expected value minimizing preferences in this choice context. Crucially, the lowest correct choice rate was observed in the experiment featuring complete feedback, clustered choice contexts (i.e., block design), and no feedback in the transfer phase [E7; 0.27 ± 0.32, t(99) = −7.11, P < 0.0001, and d = −0.71]; the addition of feedback in the transfer phase reversed the situation, because the only experiment where participants expressed expected value maximizing preference was E8 [0.59 ± 0.29, t(99) = 2.96, P = 0.0038, and d = 0.30].

Between-phase comparison

We found a significant interaction between the phase (learning or transfer) and transfer feedback (without/with) on correct choice rate (F1,792 = 82.30, P < 0.0001, and ηp2 = 0.09). This interaction is shown in Fig. 3 and reflects the fact that while adding transfer feedback information had a significant effect on transfer performance (F1,792 = 137.18, P < 0.0001, and ηp2 = 0.05; Fig. 3, A and B), it was not sufficient to outperform learning performance (with transfer feedback: learning performance 0.69 ± 0.16 versus transfer performance 0.68 ± 0.15, t(399) = 0.89, P = 0.38, and d = 0.04; Fig. 3B).

Fig. 3 Learning versus transfer phase comparison and inferred option values.

(A and B) Average response rate in the learning (blue) and transfer (pink) phase for experiments without (A) and with (B) trial-by-trial transfer feedback. Left: Learning curves. Right: average across all trials. (C) Average inferred option values for the experiments without trial-by-trial transfer feedback (E1, E3, E5, and E7). (D) Trial-by-trial inferred option values for the experiments with trial-by-trial transfer feedback (E2, E4, E6, and E8). In all panels, points indicate individual average, areas indicate probability density function, boxes indicate 95% confidence interval, and error bars indicate SEM.

Last, close inspection of the learning curves revealed that in experiments where feedback was not provided in the transfer phase (E1, E3, E5, and E7), correct choice rates (and therefore option preferences) were stationary (Fig. 3, A and B). This observation rules out the possibility that reduced performance in the transfer phase was induced by progressively forgetting the values of the options (in which case we should have observed a nonstationary and decreasing correct response rate).

In conclusion, comparison between the learning and the transfer phase reveals two interrelated and intriguing facts: (i) Despite the fact that the transfer phase happens immediately after an extensive learning phase, performance is, if anything, lower compared to the learning phase; (ii) factors that improve performance (by intrinsically or extrinsically reducing task difficulty) in the learning phase have either no (feedback information) or a negative (task structure) impact on the transfer phase performance.

Inferred option values

To visualize and quantify how much observed choices deviate from the experimentally determined true option values, we treated the four possible subjective option values as free parameters. More precisely, we initialized all subjective option values to their true values (accordingly, we labeled the four possible options as follows: EV7.5, EV2.5, EV0.75, and EV0.25), and fitted their values, as if they were free parameters, by maximizing the likelihood of the observed choices. We modeled choices using the logistic function (for example, options EV2.5 and EV0.75)P(EV2.5)=11+e(V(EV0.75)V(EV2.5))(1)

So that if a participant chose indifferently between the EV2.5 and the EV0.75 option, their fitted values would be very similar: V(EV2.5) ≈ V(EV0.75). Conversely, a participant with a sharp (optimal) preference for EV2.5 over EV0.75 would have different fitted values: V(EV2.5) > V(EV0.75). In a first step, in the experiments where feedback was not provided in the transfer phase (E1, E3, E5, and E7), we optimized a set of subjective values per participant.

Consistent with the correct choice rate results described above, we found a value inversion of the two intermediary options (EV2.5 = 4.46 ± 1.2, EV0.75 = 5.26 ± 1.2, t(399) = −7.82, P < 0.0001, and d = −0.67), which were paired in the ∆EV = 1.75 context (Fig. 3C). The differential was also strongly modulated across experiments (F3,396 = 18.9, P < 0.0001, and ηp2 = 0.13; Fig. 3C) and reached its highest value in E7 (complete feedback and block design).

As a second step, in the experiments where feedback was provided in the transfer phase (E2, E4, E6, and E8), we optimized a set of subjective values per trial. This fit allows us to estimate the trial-by-trial evolution of the subjective values over task time. The results of this analysis clearly show that suboptimal preferences progressively arise during the learning phase and disappear during the transfer phase (Fig. 3D). However, the suboptimal preference was completely corrected only in E8 (complete feedback and block design) by the end of the transfer phase.

The analysis of inferred option values clearly confirms that participants’ choices do not follow the true underlying monotonic ordering of the objective option values. Furthermore, it also clearly illustrates that in choice contexts that are supposed to facilitate the learning of the option values (complete feedback and block design), the deviation from monotonic ordering, at least at the beginning of transfer phase, is paradoxically greater. Monotonicity was fully restored only in E8, where complete feedback was provided in the transfer phase.

Computational formalization of the behavioral results

To formalize context-dependent reinforcement learning and account for the behavioral results, we designed a modified version of a standard model, where option-dependent Q values are learnt from a range-adapted reward term. In the present study, we implemented range adaptation as a range normalization process, which is one among other possible implementations (17). At each trial t, the relative reward, RRAN, t, is calculated as followsRRAN,t=ROBJ,tRMIN,t(s)RMAX,t(s)RMIN,t(s)+1(2)where s is the decision context (i.e., a combination of options) and RMAX and RMIN are state-level variables, initialized to 0 and updated at each trial t if the outcome is greater (RMAX) or smaller (RMIN) than its current value. In the denominator “+1” is added, in part, to prevent division by zero (even if this could also easily be avoided by adding a simple conditional rule) and, mainly, to make the model nest a simple Q-learning model. ROBJ, t was the objective obtained reward, which in our main experiments could take the following values: 0, +1, and +10 points. Thus, because in our task, the minimum possible outcome is always zero, RMIN, t update was omitted while fitting the first eight experiments (but included in a ninth dataset analyzed below). On the other side, RMAX will converge to the maximum outcome value in each decision context, which in our task is either 1 or 10 points. In the first trial, RRAN = ROBJ [because RMAX,0(s) = 0], and in later trials, it is progressively normalized between 0 and 1 as the range value RMAX(s) converges to its true value. We refer to this model as the RANGE model, and we compared it to a benchmark model (ABSOLUTE) that updates option values based the objective reward values (note that the ABSOLUTE is nested within the RANGE model).

For each model, we estimated the optimal free parameters by likelihood maximization. We used the out-of-sample likelihood to compare goodness of fit and parsimony of the different models (Table 3). To calculate the out-of-sample likelihood in the learning phase, we performed the optimization on half of the trials (one ∆EV = 5.0 and one ∆EV = 0.5 context) in the learning phase, and the best-fitting parameters in this first set were used to predict choices in the remaining half of trials. In the learning phase, we found that the RANGE model significantly outperformed the ABSOLUTE model [out-of-sample log-likelihood LLRAN versus LLABS, t(799) = 6.89 P < 0.0001, and d = 0.24; Table 3]. To calculate the out-of-sample likelihood in the transfer phase, we fitted the parameters on all trials of the learning phase, and the best-fitting parameters were used to predict choices in the transfer phase. Thus, the resulting likelihood is not only out-of-sample but also cross-learning phase. This analysis revealed that the RANGE model outperformed the ABSOLUTE model [out-of-sample log-likelihood LLRAN versus LLABS, t(799) = 8.56, P < 0.0001, and d = 0.30].

Table 3 Quantitative model comparing.

Values reported here represent out-of-sample log-likelihood after twofold cross-validation. Comparison to the RANGE model: ***P < 0.001; **P < 0.01; $P < 0.08.

View this table:

To study the behaviors of our computational model and assess the behavioral reasons underlying the out-of-sample likelihood results, we simulated the two models (using the individual best-fitting parameters) (18). In the learning phase, only the RANGE model managed to reproduce the observed correct choice rate. Specifically, the ABSOLUTE model predicts very poor performance in the ∆EV = 0.5 context [ABSOLUTE versus data, t(799) = −16.90, P < 0.0001, and d = 0.60; RANGE versus data, t(799) = −1.79, P = 0.07, and d = −0.06; Fig. 4A].

Fig. 4 Model comparison.

Model simulations of the ABSOLUTE and the RANGE models (dots) superimposed on the behavioral data (boxes indicated the mean and 95% confidence interval) in each context. (A) Simulated data in the learning phase were obtained with the parameters fitted in half the data (the ∆EV = 5.0 and the ∆EV = 0.5 contexts on the leftmost part of the panel) of the learning phase. (B) Data and simulations of the correct choice rate differential between high-magnitude (∆EV = 5.0) and low-magnitude (∆EV = 0.5) contexts. (C) Simulated data in the transfer phase were obtained with the parameters fitted in all the contexts of the learning phase. (D) Data and simulations in the context ∆EV = 1.75 only. (E) Average inferred option values for the behavioral data and simulated data (colored dots: RANGE model) for the experiments without trial-by-trial feedback in the transfer phase. (F) Trial-by-trial inferred option values for the behavioral data and simulated data (colored dots: RANGE model) for the experiments with trial-by-trial feedback in the transfer phase. As in Fig. 3D, here, the curves indicate trial-by-trial fit of each inferred option value.

In the transfer phase, and particularly in the ∆EV = 1.75 context, only the RANGE model manages to account for the observed correct choice rate, while the ABSOLUTE model fails (ABSOLUTE versus data, t(799) = 13.20, P < 0.0001, and d = 0.47; RANGE versus data, t(799) = 0.36, P = 0.72, and d = 0.01; Fig. 4, C and D). In general, the ABSOLUTE model tends to overestimate the correct choice rate in the transfer phase.

In addition to looking at the qualitative choice patterns, we also inferred the subjective option values from the RANGE model simulations. The RANGE model was able to perfectly reproduce the subjective option value pattern that we observed in the data, specifically the violation of monotonic ranking (Fig. 4E) and their temporal dynamics (Fig. 4F).

Ruling out habit formation

One of the distinguishing behavioral signatures of the RANGE model compared the ABSOLUTE one is the preference for the suboptimal option in the ∆EV = 1.75 context. Because the optimal option in the ∆EV = 1.75 context is not often chosen during the learning phase (where it is locally suboptimal), it could be argued that this result arises from taking decisions based on a weighted average between their absolute values and past choice propensity (a sort of habituation or choice trace). To rule out this interpretation, we fitted and simulated a version of a HABIT model, which takes decisions based on a weighted sum of the absolute Q values and a habitual choice trace (16, 19). The habitual choice trace component is updated with an additional learning rate parameter that gives a bonus to the selected action. Decisions are taken comparing option-specific decision weights DtDt(s,c)=(1ω)*Qt(s,c)+ω*Ht(s,c)(3)where at each trial t, state s, and chosen option c, ω is the arbiter, Q is the absolute Q value, and H is the habitual choice trace component. The weight ω is fitted as an additional parameter (for ω = 0, the model reduces to the ABSOLUTE model) and governs the relative influence of each controller.

We found that the HABIT model, similarly to the ABSOLUTE model, fails to perfectly match the participants’ behavior, especially in the ∆EV = 0.5 and ∆EV = 1.75 contexts (Fig. 5A). In the learning phase, the addition of a habitual component is not enough to cope for the difference in option values, and therefore, the model simulations in the transfer phase fail to match the observed choice pattern (Fig. 5B). This is because the HABIT model encodes values on an absolute scale and does not manage to develop a strong preference for the correct response in the ∆EV = 0.5 context, in the first place (Fig. 5A). Thus, it does not carry a choice trace strong enough to overcome the absolute value of the correct response in the ∆EV = 1.75 context (Fig. 5B; fig. S2, A and B; and Table 3). Quantitative model comparison between the RANGE and the HABIT model capacity to predict the transfer phase choices, numerically favored the RANGE model reaching marginal statistical significance [out-of-sample log-likelihood LLRAN versus LLHAB, t(799) = 1.77, P = 0.07, and d = 0.05; Table 3]. To summarize, a model assuming absolute value encoding coupled with a habitual component could not fully explain observed choices in both the learning and transfer phase.

Fig. 5 Ruling out alternative models and validation in an additional experiment.

Model simulations of the HABIT, the UTILITY, and the RANGE models (dots) over the behavioral data (mean and 95% confidence interval) in each context. (A and C) Simulated data in the learning phase were obtained with the parameters fitted in half the data (the ∆EV = 5.0 and the ∆EV = 0.5 contexts on the leftmost part of the panel) of the learning phase. Simulated data in the transfer phase were obtained with the parameters fitted in all the contexts of the learning phase. (B and D) Data and simulations in the context ∆EV = 1.75 only. (E and F) Behavioral data from Bavard et al. (6). Comparing the full RANGE model to its simplified version RMAX in the learning phase (correct choice rate per choice context) and in the transfer test (choice rate per option). This study included both gain-related contexts (with +1€, +0.1€, and 0.0€ as possible outcomes) and loss-related contexts (with −1€, −0.1€, and 0.0€ as possible outcomes) in the learning phase. Choice rates in the transfer phase are ordered as a function of decreasing expected value as in (6).

Ruling out diminishing marginal utility

One of the distinguishing behavioral signatures of the RANGE model is that it predicts very similar correct choice rates in the ∆EV = 5.00 and the ∆EV = 0.50 contexts compared to the behavioral data, while both the ABSOLUTE and the HABIT predict a huge drop in performance in the ∆EV = 0.50 that directly stems from its small difference in expected value. It could be argued that this result arises from the fact that expected utilities (and not expected values) are learned in our task. Specifically, a diminishing marginal utility parameter would blunt differences in outcome magnitudes and would suppose that choices are made by comparing outcome probabilities. The process could also explain the preference for the suboptimal option in the ∆EV = 1.75 context, because the optimal option in the ∆EV = 1.75 context is rewarded (10 points) only the 25% of the time, while the suboptimal option is rewarded (1 point) 75% of the time. To rule out this interpretation, we fitted and simulated a UTILITY model, which updates Q value–based reward utilities calculated from absolute reward as followsRUTI,t=(ROBJ,t)ν(4)where the exponent ν is the utility parameter (0 < ν < 1, for ν = 1 the model reduces to the ABSOLUTE model). We found an empirical average value of ν = 0.32 (±0.01 SEM).

We found that the UTILITY model, similarly to the RANGE model, captures quite well the participants’ behavior in the learning phase (Fig. 5C). However, concerning the transfer phase (especially the ∆EV = 1.75 context), it fails to capture the observed pattern (Fig. 5, C and D). Additional analyses suggest that this is specifically driven by the experiments where the feedback was provided during the transfer phase (Fig. 5D). The static nature of the UTILITY fails to match the fact that the preferences in the ∆EV = 1.75 context can be reversed by providing complete feedback (fig. S2, C and D). Quantitative model comparison showed that the RANGE model also outperformed the UTILITY model in predicting the transfer phase choices [out-of-sample log-likelihood LLRAN versus LLUTI, t(799) = 3.21, P = 0.001, and d = 0.06; Table 3]. To summarize, a model assuming diminishing marginal utilities could not fully explain observed choices in the transfer phase.

Suboptimality of range adaptation in our task

The RANGE model is computationally more complex compared to the ABSOLUTE model, as it presents additional internal variables (RMAX and RMIN), which are learnt with a dedicated parameter. Here, we wanted to assess whether this additional computational complexity really paid off in our task.

We split the participants according to the sign of out-of-sample likelihood difference between the RANGE and the ABSOLUTE model: If positive, then the RANGE model better explains the participant’s data (RAN > ABS), if negative, the ABSOLUTE model does (ABS > RAN). Reflecting our overall model comparison result, we found more participants in the RAN > ABS, compared to the ABS > RAN category (n = 545 versus n = 255).

We found no main effect of winning model on overall (both phases) performance [F1,798 = 0.03, P = 0.87, and ηp2 = 0]. We found that while RANGE encoding is beneficial and allows for better performances in the learning phase, it leads to the worst performance in the transfer phase [F1,798 = 187.3, P < 0.0001, and ηp2 = 0.19; Fig. 6A]. In other terms, in our task, it seems that the learning phase and the transfer phase are playing the game tug of war: When performance is pulled in favor of the learning phase, this will be at the cost of the transfer phase (and vice versa).

Fig. 6 The financial cost of relative value learning.

(A) Correct choice rate in the learning phase (blue) and the transfer phase (pink). Participants are split as a function of the individual difference in out-of-sample log-likelihood between the ABSOLUTE and the RANGE models. ABS > RAN participants are better explained by the ABS model (positive difference, n = 255). RAN > ABS participants are better explained by the RAN model (negative difference, n = 545). (B) Actual and simulated money won in pounds over the whole task (purple), the learning phase only (blue), and the transfer phase only (pink). Points indicate individual participants, areas indicate probability density function, boxes indicate confidence interval, and error bars indicate SEM. Dots indicate model simulations of ABSOLUTE (white) and RANGE (black) models.

A second question is whether overall in our study, behaving as a RANGE model turns out to be economically advantageous. To answer this question, we compared the final monetary payoff in the real data, following the simulations using the participant-level best-fitting parameters. Consistently with the task design, we found that the monetary outcome was higher in the transfer phase than in the learning phase [transfer gains M = 2.16 ± 0.54, learning gains M = 1.99 ± 0.35, t(799) = 8.71, P < 0.0001, and d = 0.31]. Crucially, we found that the simulation of the RANGE model induces significantly lower monetary earnings (ABSOLUTE versus RANGE, t(799) = 19.39, P < 0.0001, and d = 0.69; Fig. 6B). This result indicates that despite being locally adaptive (in the learning phase), in our task, range adaptation is economically disadvantageous, thus supporting the idea that it is the consequence of an automatic, uncontrolled process.

Validation of range adaptation in previous dataset

The first eight experiments only featured positive outcomes (in addition to 0). Because, in our model, the state-level variables (RMAX and RMIN) are initialized to 0, RMAX converges to the maximum outcome value in each choice context, while RMIN remains 0 in every trial and choice context. This setup is therefore not ideal to test the full normalization rule that we are proposing here. To obviate this limitation, we reanalyzed a ninth dataset (n = 60) from a previously published study on a related topic (6). Crucially, in addition to manipulating outcome magnitude (“10c” versus “1€”, similar to our learning phase), this study also manipulated the valence of the outcomes (gain versus loss). This latter manipulation allows to assess situations where the value of RMIN should change and converge to negative values, thus allowing us to compare the full range normalization rule to its simplified versionROBJRMINRMAXRMIN versus ROBJRMAX

We note that in this ninth dataset outcomes can take both negative and positive values: −1€, −0.1€, 0.0€, +0.1€, and +1.0€. We later refer to the simplified version of the model as the RMAX model. Model simulations show that while the RMAX model can capture the learning and transfer phase patterns for the gain-related options, it fails to do so for the loss-related options (Fig. 5, E and F). In the loss-related contexts (where the maximum possible outcome is 0) outcome value normalization can only rely on RMIN. Because the RMAX model does not take into account RMIN, it is doomed to encode loss-related outcomes on an objective scale.

On the other hand, by updating both RMAX in the gain contexts and RMIN in the loss contexts, the RANGE model can normalize outcomes in all contexts and is able to match participants’ choice preferences concerning both loss-related and gain-related options in the learning and the transfer phases (Fig. 5, E and F). To conclude, this final analysis is consistent with the idea that range adaptation takes the form of a range normalization rule, which takes into account both the maximum and the minimum possible outcomes.

DISCUSSION

In the present paper, we investigated context-dependent reinforcement learning, more specifically range adaptation, in a large cohort of human participants tested online over eight different variants of a behavioral task. Building on previous studies of context-dependent learning, the core idea of the task is to juxtapose an initial learning phase with fixed pairs of options (featuring either small or large outcomes) to a subsequent transfer phase where options are rearranged in new pairs (mixing up small and large outcomes) (6, 7, 10). In some experiments, we directly reduced task difficulty by reducing outcome uncertainty by providing complete feedback. In some experiments, we indirectly modulated task difficulty by clustering in time the trials of a given contexts, therefore reducing working memory demand. Last, in some experiments, feedback was also provided in the transfer phase.

Behavioral findings

As expected, correct choice rate in the learning phase was higher when the feedback was complete, which indicates that participants integrated the outcome of the forgone option when it is presented (8, 14). Also expectedly, in the learning phase, participants displayed a higher correct choice rate when the trials of a given context were blocked together, indicating that reducing working memory demands facilitate learning (15). Replicating previous findings, we also found that, overall, correct response rate was slightly but significantly higher in the big magnitude contexts (∆EV = 5.0), but the difference was much smaller compared to what one would expect assuming unbiased value learning and representation [as showed by the ABSOLUTE model simulations (6)]: a pattern consistent with a partial range adaptation. The outcome magnitude–induced difference in correct choice rate was significantly smaller and not different from zero in block experiments (full adaptation), thus providing a first suggestion that reducing task difficulty increases range adaptation. Despite learning phase performance being fully consistent with our hypothesis, the crucial evidence comes from the results of the transfer phase. Overall correct response rate pattern in the transfer phase did not follow that of the learning phase. Complete feedback and block design factors have no direct beneficial effect on transfer phase performance. In fact, the worst possible transfer phase performance was obtained in a complete feedback and block experiment. This was particularly notable in the ∆EV = 1.75 condition, where participants significantly preferred the suboptimal option and, again, the worst score was obtained in a complete feedback and block design experiment. Crucially, we ruled out that the comparably low performance in the transfer phase was due to having forgotten the value of the options. Because the transfer phase is, by definition, after the learning phase, although very unlikely (the two phases were only a few seconds apart), it is conceivable that a drop in performance is due to the progressive forgetting of the option values. Two features of the correct choice rate curves allowed to reject this interpretation: (i) Correct choice rate abruptly decreases just after the learning phase; (ii) when feedback is not provided, the choice rate remains perfectly stable with no sign of regression to chance level. On the other side, i.e., when feedback was provided in the transfer phase, the correct choice rate increased to reach (on average) the level reached at the end of the learning phase. The results are therefore consistent with the idea that in the transfer phase, participants express context-dependent option values acquired during the learning phase, which entails a first counterintuitive phenomenon: Even if the transfer phase is performed immediately after the learning phase, the correct choice rate drops. This is due to the rearrangement of the options in new choice contexts, where options that were previously optimal choices (in the small magnitude contexts) become suboptimal choices. We also observed a second counterintuitive phenomenon: Factors that increase performance during the learning phase (i.e., increasing feedback information and reducing working memory load) paradoxically further undermined transfer phase correct choice rate. The conclusions based on these behavioral observations were confirmed by inferring the most plausible option values based on the observed choices, where we could compare the objective ranking of the options to their subjective estimation. The only experiment where we observed an almost monotonic ranking was the partial feedback/interleaved experiment, even if we observed no significant difference between the EV = 2.5 and the EV = 0.75 options. In all the other experiments, the EV = 0.75 option was valued more compared to the EV = 2.5 option, with the highest difference observed in the complete feedback/block design. Thus, in notable opposition with the almost universally shared intuition that reducing task difficulty should lead to more accurate subjective estimates; here, we present a clear instance where the opposite is true.

Computational mechanisms

The observed behavioral results were satisfactorily captured by a parsimonious model (the RANGE model) that instantiated a dynamic range normalization process. Specifically, the RANGE model learns in parallel context-dependent variables (RMAX and RMIN) that are used to normalize the outcomes. The variables RMAX and RMIN are learnt incrementally, and the speed determines the extent of the normalization, leading to partial or full range adaptation as a function of a dedicated free parameter: the contextual learning rate. Developing a new model was necessary, as previous models of context-dependent reinforcement learning did not include range adaptation and focused on different dimensions of context dependence (reference point centering and outcome comparison) (7, 8). The model also represents an improvement over a previous study where we instantiated partial range adaptation assuming a perfect and innate knowledge about the outcome ranges and a static hybridization between relative and absolute outcome values (6).

One limitation is that in the present formulation RMAX and RMIN can only grow and decrease, respectively. This is a feature that is well suited for our task, which features static contingencies, but may not correspond to many other laboratory-based and real-life situations, where the outcome range can drift over time. This limitation could be overcome by assuming, for example, that RMAX is also updated at a smaller rate when the observed outcome is smaller than the current RMAX (the opposite could be true for RMIN). Last, we note that our model applied to the main eight experiments (where RMIN was irrelevant) can also be seen as a special case of a divisive normalization process [temporal normalization (20)]. To verify the relevance of the full range normalization rule, we reanalyzed a previous dataset involving negative outcomes, where we were able to show that both the RMAX and RMIN were important to explain the full spectrum of the behavioral results. However, we acknowledge that additional functional forms of normalization could and should be considered in future studies to settle the issue of the exact algorithmic implementation of outcome normalization. Last, it is worth noting that range normalization has been shown to perform poorly in explaining context-dependent decision-making in other (i.e., not reinforcement learning) paradigms (17, 21, 22), opening to the possibility that the normalization algorithm is different in experience-based and description-based choices. Future research contrasting different outcome ranges and multiple-option tasks are required to firmly determine which functional forms of normalization are better suited for both experience-based and description-based choices (23).

We compared and ruled out another plausible computational interpretation derived from learning theory (24, 25). Specifically, we considered a habit formation model (16). We reasoned that our transfer phase results (and particularly the value inversion in the ∆EV = 1.75 context) could derive from the participants choosing on the basis of a weighted average between objective values and past choice propensities. In the learning phase, the suboptimal option in the ∆EV = 1.75 context (EV = 0.75) was chosen more frequently than the optimal option (EV = 2.5). However, model simulations showed that the HABIT model was not capable to explain the observed pattern. In the learning phase, the HABIT model, just like the ABSOLUTE model, did not develop a preference for the EV = 0.75 option strong enough to generate a habitual trace sufficient to explain the transfer phase pattern. Beyond model simulation comparisons, we believe that this interpretation could have been rejected on the basis of a priori arguments. The HABIT model can be conceived as a way to model habitual behavior, i.e., responses automatically triggered by stimulus-action associations. However, both in real life and laboratory experiments, habits have been shown to be acquired over time scales (days, months, and years) order of magnitudes bigger compared to the time frame of our experiments (26, 27). It is even debatable whether in our task participants developed even a sense of familiarity toward the (never seen before) abstract cues that we used as stimuli. The HABIT model can also be conceived as a way to model choice hysteresis, sometimes referred to as choice repetition or perseveration bias, that could arise from a form of sensory-motor facilitation, where recently performed actions become facilitated (19, 28). However, in our case the screen position of the stimuli was randomized in a trial-by-trial basis and most of the experiments involved interleaved design, thus precluding any strong role for sensory-motor facilitation–induced choice inertia.

We also compared and ruled out a plausible computational interpretation derived from economic theory (29). Since the pioneering work of Daniel Bernoulli [1700 to 1782 (30)], risk aversion is explained by assuming diminishing marginal utility of objective outcomes. At the limit, if diminishing marginal utility was applied in our case, then the utility of 10 points could be perceived as the utility of 1 point. In this extreme scenario, choices would be only based on the comparison between the outcome probabilities. This could explain most aspects of the choice pattern. The UTILITY model did a much better job compared to the HABIT model. However, compared to the RANGE model, it failed to reproduce the observed behavior of the experiments where feedback was provided in the transfer phase. This naturally results from the fact that the model assumes diminishing marginal utility as being a static property of the outcomes and therefore cannot account for experience-dependent correction of context-dependent biases. However, also in this case, a priori considerations could have ruled out the UTILITY interpretation. Our experiment involves stakes small enough to make diminishing marginal utility not reasonable. Rabin provides a full treatment of this issue and shows that the explaining risk aversion for small stakes (as those used in the laboratory) using diminishing marginal utility leads to extremely unlikely predictions, such as turning down gambles with infinite positive expected values (15). Indeed, if anything, following the intuition of Markowitz (31), most realistic models of the utility function suppose risk neutrality (or risk seeking) for small gains.

Our results contribute to the old and still ongoing debate about whether the brain computes option-oriented values, independently from the decision process itself (2, 32). On one side of the spectrum, decision theories such as expected utility theory and prospect theory, postulate that a value is attached to each option independently of the other options simultaneously available (32). On the other side of the spectrum, other theories, such as regret theory, postulate that the value of an option is primarily determined by the comparison with other available options (33). A similar gradient exists in the reinforcement learning framework, between methods such as the Q-learning, on one side, and direct policy learning without value computations, on the other side (34). Recent studies in humans, coupling imaging to behavioral modeling, provided some support for direct policy learning in humans, by showing that, in complete feedback tasks, participants’ learning was driven by a teaching signal, essentially determined by the comparison between the obtained and the forgone outcomes (essentially a regret/relief signal) (7, 35). Beyond behavioral model comparison, analysis of neural activity in the ventral striatum (a brain system traditionally thought to encode option-specific prediction errors (36)) was also consistent with direct policy learning. However, while our findings clearly falsify the Q-learning’s assumption that option values are learned on a context-independent (or objective) scale, model simulations also reject the other extreme view of direct policy learning (see the Supplementary Materials). Our results are rather consistent with a hybrid scenario where option-specific values are initially encoded on an objective scale and are progressively normalized to eventually represent the context-specific rank of each option. This view is also consistent with previous results using tasks including loss-related options that clearly showed that option valence was taken into account in transfer learning performance (6, 8). Of note, the notion of “valence” (negative versus positive) is unknown to direct policy learning methods. However, several studies using similar paradigms clearly show that other behavioral measures, such as reaction times and confidence, are strongly affected by the valence of the learning context, thus providing additional evidence against pure direct policy learning methods (13, 37). Last, consistent with our intermediate view, other imaging studies found value-related representations more consistent with a partial normalization process (38, 39).

Last, we note that our computational analysis is at the algorithmic and not at the implementational level (40). In other terms, the RANGE model is a model of the mathematical operations that are performed to achieve a computational goal (i.e., to normalize outcomes to bound subjective option values between 0 and 1). To do so, our model learns two context-level variables (RMAX and RMIN), whose values are unbounded (they converge to their objective values). The present treatment is silent on how these context-level variables are represented at the neural level. While it is certain that coding constraints will also apply to these context-level variables (RMAX and RMIN), further modeling and electrophysiological work is needed to address this important issue.

To conclude, we demonstrated that in humans, reinforcement learning values are learnt in a context-dependent manner that is compatible with range adaptation (instantiated as a range normalization process) (41). Specifically, we tested the possibility that this normalization automatically results from the way outcome information is processed (42), by showing that the lower the task difficulty, the fuller range adaptation. This leads to a paradoxical result: Reducing task difficulty can, in some occasions, decrease choice optimality. This unexpected result can be understood with a perceptual analogy. Going into a dark room forces us to adapt our retinal response to the dark so that when we go back into a light condition, we do not see very well. The longer we are exposed to dim light, the stronger the effect when we go back to normal.

Our findings fit in the debate aimed at deciding whether the computational processes leading to suboptimal decisions have to be considered flaws or features of human cognition (43, 44). Range-adapting reinforcement learning is clearly adaptive in the learning phase. We could hypothesize that the situations in which the process is adaptive are more frequent in real life. In other terms, the performance of the system has to be evaluated as a function of the tasks it has been selected to solve. Coming back to the perceptual analogy, it is true that we may be hit by a bus when we exit a dark room because we do not see well, but on average, the benefit of a sharper perception in a dark room is big enough to compensate for the (rare) event of a bus waiting for us outside the dark room. Ultimately, whether context-dependent reinforcement learning should be considered a flaw or a desirable feature of human cognition should be determined comparing the real-life frequency of the situations where it is adaptive (as in the learning phase) to that where it is maladaptive (as in the transfer phase). However, while our study does not settle this issue, our findings do demonstrate that this process induces, in some circumstances, economically suboptimal choices. Whether or not the same process is responsible for maladaptive economic behavior in real-life situations will be addressed by future studies using more ecological settings and field data (45).

MATERIALS AND METHODS

Participants

For the laboratory experiment, we recruited 40 participants (28 females, aged 24.28 ± 3.05 years) via internet advertising in a local mailing list dedicated to cognitive science–related activities. For the online experiments, we recruited 8 × 100 participants (414 females, aged 30.06 ± 10.10 years) from the Prolific platform (www.prolific.co). We based the online sample size on a power analysis that was based on the behavioral results of the laboratory experiment. In the ∆EV = 1.75 context, laboratory participants reached a difference between choice rate and chance (0.5) of 0.11 ± 0.30 (mean ± SD). To obtain the same with a power of 0.95, the MATLAB function “samsizepwr.m” indicated a value of 99 participants that we rounded to 100. The research was carried out following the principles and guidelines for experiments including human participants provided in the Declaration of Helsinki (1964, revised in 2013). The INSERM Ethical Review Committee/IRB00003888 approved the study on 13 November 2018, and participants were provided written informed consent before their inclusion. To sustain motivation throughout the experiment, participants were given a bonus depending on the number of points won in the experiment [average money won in pounds: 4.14 ± 0.72, average performance against chance: 0.65 ± 0.13, t(799) = 33.91, and P < 0.0001]. A laboratory-based experiment was originally performed (n = 40) to ascertain that online testing would not significantly affect the main conclusions. The results are presented in the Supplementary Materials.

Behavioral tasks

Participants performed an online version of a probabilistic instrumental learning task adapted from previous studies (6). After checking the consent form, participants received written instructions explaining how the task worked and that their final payoff would be affected by their choices in the task. During the instructions the possible outcomes in points (0, 1, and 10 points) were explicitly showed as well as their conversion rate (1 point = 0.005£). The instructions were followed by a short training session of 12 trials aiming at familiarizing the participants with the response modalities. Participants could repeat the training session up to two times and then started the actual experiment.

In our task, options were materialized by abstract stimuli (cues) taken from randomly generated identicons, colored such that the subjective hue and saturation were very similar according to the HSLUV color scheme (www.hsluv.org).On each trial, two cues were presented on both sides of the screen. The side in which a given cue was presented was pseudo-randomized, such that a given cue was presented an equal number of times on the left and the right. Participants were required to select between the two cues by clicking on one cue. The choice window was self-paced. A brief delay after the choice was recorded (500 ms); the outcome was displayed for 1000 ms. There was no fixation screen between trials. The average reaction time was 1.36 ± 0.04 s (median, 1.16), and the average experiment completion time was 325.24 ± 8.39 s (median, 277.30).

As in previous studies, the full task consisted in one learning phase followed by a transfer phase (68, 46). During the learning phase, cues appeared in four fixed pairs. Each pair was presented 30 times, leading to a total of 120 trials. Within each pair, the two cues were associated to a zero and a nonzero outcome with reciprocal probabilities (0.75/0.25 and 0.25/0.75). At the end of the trial, the cues disappeared and the selected one was replaced by the outcome (“10,” “1,” or “0”) (Fig. 1A). In experiments E3, E4, E7, and E8, the outcome corresponding to the forgone option (sometimes referred to as the counterfactual outcome) was also displayed (Fig. 1C). Once they had completed the learning phase, participants were displayed with the total points earned and their monetary equivalent.

During the transfer phase after the learning phase, the pairs of cues were rearranged into four new pairs. The probability of obtaining a specific outcome remained the same for each cue (Fig. 1B). Each new pair was presented 30 times, leading to a total of 120 trials. Before the beginning of the transfer phase, participants were explained that they would be presented with the same cues, only that the pairs would not have been necessarily displayed together before. To prevent explicit memorizing strategies, participants were not informed that they would have to perform a transfer phase until the end of the learning phase. After making a choice, the cues disappeared. In experiments E1, E3, E5, and E7, participants were not informed of the outcome of the choice on a trial-by-trial basis, and the next trial began after 500 ms. This was specified in the instruction phase. In experiments E2, E4, E6, and E8, participants were informed about the result of their choices in a trial-by-trial basis, and the outcome was presented for 1000 ms. In all experiments, they were informed about the total points earned at the end of the transfer phase. In addition to the presence/absence of feedback, experiments differed in two other factors. Feedback information could be either partial (experiments E1, E2, E5, and E6) or complete (experiments E3, E4, E7, and E8; meaning, the outcome of the forgone option was also showed). When the transfer phase included feedback, the information factor was the same as in the learning phase. Trial structure was also manipulated, such that in some experiments (E5, E6, E7, and E8), all trials of a given choice context were clustered (“blocked”), and in the remaining experiments (E1, E2, E3, and E4), they were interleaved, in both the learning phase and the transfer phase (Fig. 1C).

Reanalysis of a previous experiment involving gain and losses

In the present paper, we also include new analyses of previously published experiments (6). The general design of the previous experiments is similar to that used in the present experiments, as they also involved a learning phase and a transfer phase. However, the previous experimental designs differed from the present one in several important aspects. First, in addition to an outcome magnitude manipulation (“10c” versus “1€”, similar to our learning phase), the study also manipulated the valence of the outcomes (gain versus loss), generating to a 2 × 2 factorial design. In the gain contexts, participants had to maximize gains, while in the loss contexts, they could only minimize losses. As in the other experiments, outcomes were probabilistic (75 or 25%), and an option was associated with only one type of nonzero outcome. Second, the organization of the transfer phase was quite different. Each option was compared with all other possible options. The main dependent variable extracted from the transfer phase is therefore not the correct response rate but simply the choice rate per option (which is proportional to its subjective value). The data were pooled across two experiments featuring partial (n = 20) and partial-and-complete feedback trials (n = 40). In both experiments, the choice contexts were interleaved. Other differences include the fact that these previous experiments were laboratory-based and featured a slightly different number of trials, different stimuli and timing [see (6) for more details].

Analyses

Behavioral analyses. The main dependent variable was the correct choice rate, i.e., choices directed toward the option with the highest expected value. Statistical effects were assessed using multiple-way repeated measures analyses of variance (ANOVAs) with choice context (labeled in the manuscript by their difference in expected values: ΔEV) as within-participant factor, and feedback information, feedback in the transfer phase and task structure as between-participant factors. Post hoc tests were performed using one-sample and two-sample t tests for respectively within- and between-experiment comparisons. To assess overall performance, additional one sample t tests were performed against chance level (0.5).We report the t statistic, P value, and Cohen’s d to estimate effect size (two-sample t test only). Given the large sample size (n = 800), central limit theorem allows us to assume normal distribution of our overall performance data and to apply properties of normal distribution in our statistical analyses, as well as sphericity hypotheses. Concerning ANOVA analyses, we report the uncorrected statistical, as well as Huynh-Feldt correction for repeated measures ANOVA when applicable (47), F statistic, P value, partial eta-squared ηp2, and generalized eta-squared η2 (when Huynh-Feldt correction is applied) to estimate effect size. All statistical analyses were performed using MATLAB (www.mathworks.com) and R (www.r-project.org). For visual purposes, learning curves were smoothed using a moving average filter (span of 5 in MATLAB’s smooth function).

Models

We analyzed our data with variation of simple reinforcement learning models (48, 49). The goal of all models is to estimate in each choice context (or state) the expected reward (Q) of each option and pick the one that maximizes this expected reward Q.

At trial t, option values of the current context s are updated with the delta ruleQt+1(s,c)=Qt(s,c)+αcδc,t(5)Qt+1(s,u)=Qt(s,u)+αuδu,t(6)where αc is the learning rate for the chosen (c) option and αu the learning rate for the unchosen (u) option, i.e., the counterfactual learning rate. δc and δu are prediction error terms calculated as followsδc,t=Rc,tQt(s,c)(7)δu,t=Ru,tQt(s,u)(8)

δc is calculated in both partial and complete feedback experiments, and δu is calculated in the experiments with complete feedback only.

We modeled participants’ choice behavior using a softmax decision rule representing the probability for a participant to choose one option a over the other option bPt(s,a)=11+e(Qt(s,b)Qt(s,a)*β)(9)where β is the inverse temperature parameter. High temperatures (β → 0) cause the action to be all (nearly) equiprobable. Low temperatures (β → +∞) cause a greater difference in selection probability for actions that differ in their value estimates (48).

We compared four alternative computational models: the ABSOLUTE model, which encodes outcomes on an absolute scale independently of the choice context in which they are presented; the RANGE model, which tracks the value of the maximum reward in each context and normalizes the actual reward accordingly, rescaling rewards between 0 and 1; the HABIT model, which integrates action weights into the decision process; and the UTILITY model that assumes diminishing marginal utility.

ABSOLUTE model. The outcomes are encoded as the participants see them (i.e., their objective value). In the eight online experiments, they are encoded as their actual value in points: ROBJ, t ∈ {10,1,0}. In the dataset retrieved from Bavard et al. (6), they are encoded as their actual value in euros ROBJ, t ∈ { − 1€, − 0.1€,0€, + 0.1€, and + 1.0€}.

RANGE model. The outcomes (both chosen and unchosen) are encoded on a context-dependent relative scale. On each trial, the relative reward RRAN, t is calculated as followsRRAN,t=ROBJ,tRMIN,t(s)RMAX,t(s)RMIN,t(s)+1(2)

As RMIN is initialized to zero and never changes, in the eight online experiments, this model can be reduced toRRAN,t=ROBJ,tRMAX,t(s)+1(10)where s is the decision context (i.e., a combination of options) and RMAX and RMIN are context-dependent variables, initialized to 0 and updated at each trial t if the outcome is greater (or smaller, respectively) than its current valueRMAX,t+1(s)=RMAX,t(s)+αR(ROBJ,tRMAX,t(s)) if ROBJ,t>RMAX,t(s)(11)RMIN,t+1(s)=RMIN,t(s)+αR(ROBJ,tRMIN,t(s)) if ROBJt<RMIN,t(s)(12)

Accordingly, outcomes are progressively normalized so that eventually RRAN, t ∈ [0,1]. The chosen and unchosen option values and prediction errors are updated with the same rules as in the ABSOLUTE model. αR is an additional free parameter, the contextual—or range—learning rate, that is used to update the range variables. Note that the ABSOLUTE model is nested within the RANGE model (αR = 0).

HABIT model. The outcomes are encoded on an absolute scale, but decisions integrate a habitual component (16, 19). To do so, in addition to the Q values, a habitual (or choice trace) component H is tracked and updated (with a dedicated learning rate parameter) that takes into account the selected action (1 for chosen option and 0 for the unchosen option). The choice is performed with a softmax rule based on decision weights D that integrate Q values and decision weights HDt(s,c)=(1ω)*Qt(s,c)+ω*Ht(s,c)(3)where at each trial t, state s, and chose option c, D is the arbiter, Q is the goal-directed component (Q values matrix), and H is the habitual component. The weight ω is fitted as an additional parameter and governs the relative weights of values and habits (for ω = 0, the model reduces to the ABSOLUTE model).

UTILITY model. The outcomes are encoded as an exponentiation of the absolute reward, leading to a curvature of the value function (29)RUTI,t=(ROBJ,t)ν(4)where the exponent ν is the utility parameter, with 0 < ν < 1 (for ν = 1 the model reduces to the ABSOLUTE model).

SUPPLEMENTARY MATERIALS

Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/7/14/eabe0340/DC1

https://creativecommons.org/licenses/by-nc/4.0/

This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial license, which permits use, distribution, and reproduction in any medium, so long as the resultant use is not for commercial advantage and provided the original work is properly cited.

REFERENCES AND NOTES

Acknowledgments: Funding: S.P. is supported by an ATIP-Avenir grant (R16069JS), the Programme Emergence(s) de la Ville de Paris, the Fondation Fyssen, the Fondation Schlumberger pour l’Education et la Recherche, the FrontCog grant (ANR-17-EURE-0017) and the Institut de Recherche en Santé Publique (IRESP, grant number : 20II138-00). S.B. is supported by MILDECA (Mission Interministerielle de Lutte contre les Drogues et les Conduites Addictives) and the EHESS (Ecole des Hautes Etudes en Sciences Sociales). A.R. thanks the US Army for financial support (contract W911NF2010242). The funding agencies did not influence the content of the manuscript. Author contributions: S.B. and S.P. designed the experiments. S.B. ran the experiments. S.B. and S.P. analyzed the data. S.B., A.R., and S.P. interpreted the results. S.B. and S.P. wrote the manuscript. A.R. edited the manuscript. Competing interests: The authors declare that they have no financial or other competing interests. Data and materials availability: All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials and are available from Github repository (https://github.com/hrl-team/range). All custom scripts have been made available from Github repository (https://github.com/hrl-team/range). Additional modified scripts can be accessed upon request.

Stay Connected to Science Advances

Navigate This Article