Research ArticleNEUROSCIENCE

Reading between the lines: Listener’s vmPFC simulates speaker cooperative choices in communication games

See allHide authors and affiliations

Science Advances  03 Mar 2021:
Vol. 7, no. 10, eabe6276
DOI: 10.1126/sciadv.abe6276

Abstract

Humans have a remarkable ability to understand what is and is not being said by conversational partners. It has been hypothesized that listeners decode the intended meaning of a communicative signal by assuming speakers speak cooperatively, rationally simulating the speaker’s choice process and inverting it to recover the speaker’s most probable meaning. We investigated whether and how rational simulations of speakers are represented in the listener’s brain, by combining referential communication games with functional neuroimaging. We show that listeners’ ventromedial prefrontal cortex encodes the probabilistic inference of what a cooperative speaker should say given a communicative goal and context, even when such inferences are irrelevant for reference resolution. The listener’s striatum encodes the amount of update on intended meaning, consistent with inverting a simulated mental model. These findings suggest a neural generative mechanism, subserved by the frontal-striatal circuits, that underlies our ability to understand communicative and, more generally, social actions.

INTRODUCTION

A cornerstone of effective communication is our ability to read between the lines—to recognize the intended meaning of a speaker, even when the meaning is not coded in the utterance directly. The process of disambiguating the implied meaning in context, often known as pragmatic interpretation, has long been hypothesized to rely on cooperation between communicators: A speaker tailors an utterance to help a listener recognize a meaning, and the listener recovers that meaning by assuming that the speaker spoke to be understood (1). The hypothesized role of cooperation has inspired a wealth of philosophical inquiries (25) and, more recently, empirical (68) and computational (9, 10) investigations into human communication, but little is known about the link between the key computational principles and underlying neural mechanisms.

Computationally, pragmatic interpretation requires a listener to identify the speaker’s underlying intention that motivated the choice of the utterance. One important class of models posits that this can be achieved via an internal generative process, similar to how the brain translates sensations into perceptions (1013). Decades of work have suggested that the brain infers sensory causes (e.g., an object in sight) from their bodily effects (e.g., a retinal image) by modeling the sensation-generating process and then inverting this model to derive the most probable cause of the sensation (11, 14, 15). In communication, a similar strategy for a listener entails modeling the speaker’s decision-making process, that is, determining how an interacting web of causes—intention, context, and knowledge—gives rise to the choice of an utterance. It has been proposed that a listener simulates speaker behavior using a rational, goal-directed choice model (10). Speakers are expected to compare candidate expressions to make a choice for best helping the audience recognize the intended meaning in a given context. In addition, listeners need to monitor knowledge and beliefs shared with the speaker (common ground) for simulating speaker behavior based on mutual knowledge, rather than the listener’s own private knowledge (4, 6).

This generative account, also known as the Rational Speech Act (RSA) model (10), provides precise and falsifiable behavioral predictions in a parsimonious framework for social signal interpretation that can be extended from communication (10, 12, 13) to social perception (16, 17) and interpersonal decision-making (18). However, no direct evidence is available suggesting that understanding unstated intent in communication involves rational, context-specific simulation of speakers. In particular, it is unclear whether the putative mental simulation signals are represented in the listener’s brain and flexibly facilitate the utterance interpretation in varying contexts. It also remains to be explored whether the listener’s brain actively interrogates received utterances using automatically generated simulations of speakers or, given that such mental modeling is cognitively costly, produces internal estimation only when necessary (e.g., in the face of communicative ambiguity). A third unresolved question concerns how the mental simulation is internally represented—for example, whether different aspects of information (e.g., utterance, context, and common ground information) directly support or modulate the putative simulation signal in the listener’s brain, or whether some information is abstracted away during the simulation process.

We address these questions by combining tools and methods from computational pragmatics with those of neuroeconomics, a field that has provided substantial insights into the neural mechanisms of rational, cooperative behavior (19, 20). A particular strength of neuroeconomic approach is that, by building on formal models of social behavior, it allows for connecting neurobiological data with specific, quantitative predictions for the underlying process. Despite its success in elucidating how the brain predicts actions and intentions of others with value-based goals to produce choices in a range of social interactions [e.g., trust building (21, 22), learning to anticipate others’ actions (2327), coordination (28), and strategic reasoning (2932)], the neurocognitive mechanisms of interpersonal communication have yet to be explored.

Specifically, using model-based functional magnetic resonance imaging (fMRI) (33), we investigated a simple yet well-characterized game of referential communication (Fig. 1A) (7, 34, 35). The experiment involves a listener and a speaker who were randomly matched in pairs in each trial and faced the same communicative context consisting of three objects with varying colors and shapes (Fig. 1B; see also Materials and Methods and figs. S1 and S2). The speaker was asked to refer to a target object by choosing between alternative expressions denoting either the color or shape of the target. The listener, who did not know the target, needed to recover the intended referent from the referring expression chosen by the speaker. Although speakers in this experiment did not describe the referents in complete detail, the rate of listeners recovering the targets was as high as 75.72 ± 0.61% (mean ± SEM), significantly greater than that resulting from literal interpretations (literal recovery rate = 66.88%, t40 = 14.53, P < 2 × 10–16) (fig. S3).

Fig. 1 Empirical approach and task schematic.

(A) Using model-based fMRI, we investigated the core assumption of the RSA framework that a listener rationally simulates the speaker’s decision-making process during communication. We tested this hypothesis in three communicative setups, designed for exploring the existence, generality, and sensitivity of rational simulation signals in the listener’s brain. (B) Schematic of the referential game in the first experimental setup, symmetric condition (top), where the presence of three geometric objects (context) is common knowledge shared between communicators (Materials and Methods). The context and target location (indicated by the arrow in the speaker screen) vary across trials. Referential interpretation is modeled as a Bayesian inferential process using RSA (bottom): A listener evaluates how likely a particular object is to be the target given the received expression and context by inverting a mental model of speaker behavior characterized by pragmatic likelihood. (C) Illustration of pragmatic likelihood computation. Pragmatic likelihood is characterized as a softmax function of the relative specificity between candidate references (e.g., “blue” or “circle”) associated with a potential target (e.g., the blue circle). The specificity of a reference is quantified by its Shannon information, negatively correlated with the number of objects in context that can be described by the referring expression (#blue = 2, #circle = 2 in this example).

The paradigm has three key advantages for providing a quantitative framework connecting listener choices with the underlying cognitive processes in the context of model-based fMRI: (i) a large number of trials for each listener, (ii) tight control of the decision space of communicators, and (iii) parametric manipulation of contexts. To eliminate potential learning effects, no feedback was provided to either the listener or the speaker during the experiment (Materials and Methods). Moreover, we used a well-established experimental setup with a population of many anonymously interacting listeners and speakers with low probability of reencounter (18, 23, 26). This setting, together with the random matching protocol, provides a natural model for scenarios such as one-shot communication between strangers in novel situations, minimizing the influence of higher-order belief considerations emerging from continuous conversation.

A total of 41 subjects were scanned while they participated as “listeners” against a separate pool of 60 anonymous “speakers” (Materials and Methods). The fMRI experiment contained three conditions, designed for exploring the existence, generality, and sensitivity of rational simulation signals in the listener’s brain (Fig. 1A). In the first condition (symmetric condition), we explored the neural encoding of a model-derived internal simulation signal that reflects the probabilistic inference of what a rational, cooperative speaker would say in context. To investigate the extent to which the internal simulation is involved in communication, we compared its neural encoding when it is, versus is not, necessary to predict speakers’ choices for referential interpretation. In the second condition (symmetric-garment condition), we performed the same experiment with altered stimuli, testing whether mental simulation depends on the specific eliciting stimuli. In the third condition (asymmetric condition), we evaluated the sensitivity of mental simulation signals in the brain to the common ground information shared between communicators, by experimentally manipulating the epistemic state of the speaker.

RESULTS

Computational model of behavior

First, we fitted listener choices in the symmetric condition with the RSA model, in which listeners anticipate speakers to choose the most specific (informative) expression to refer to a target (e.g., “square” is more specific than “blue” for denoting the blue square in Fig. 1B). By comparing the specificity between competing references, listeners simulate the probability that a speaker will choose a particular reference given a target and context [i.e., P(received expression|candidate target, context), henceforth pragmatic likelihood] and then invert the pragmatic likelihood with Bayes’ rule to derive the most probable referent (Fig. 1, B and C; Materials and Methods). The computational model, together with the task design in which we systematically varied reference specificity by manipulating features of the presented objects, allowed us to create the trial-wise regressor of pragmatic likelihood estimate and explore its neural correlate in the listener’s brain.

Consistent with previous research (34, 35), the RSA model predictions closely resembled listeners’ behavior, with the regression of listeners’ actual choice frequencies against model predictions being highly significant at the group level (r = 0.89, P < 2 × 10−16; Fig. 2A) and across listeners (logistic regression coefficient = 5.84 ± 0.17, t40 = 34.24, P < 2 × 10−16). In out-of-sample tests, the RSA model correctly predicted 68.44% of listener choices (chance level, 33.3%) and outperformed a variety of alternative models that could be used to guide utterance interpretation (fig. S4; Materials and Methods).

Fig. 2 Model estimation and trial types.

(A) Actual listener choices conditional on the received referring expression and context, as a function of the model-derived posterior probabilities using a cross-validation method. Data are pooled over all listeners conditional on the received reference and context and binned by a step size of 0.1 based on the posterior probabilities generated from out-of-sample model predictions (see Materials and Methods). The dashed line represents a perfect model fit. The size of a circle is proportional to the number of observations. (B) Actual speaker choice frequencies (gray bar) match the pragmatic likelihood estimates derived from the listener data (red dashed line). The speakers’ choice frequencies are computed by averaging the choices of color expressions across all speakers, conditional on each assigned target in each trial. The mentally simulated speaker choices are derived using the value estimate of pragmatic likelihood associated with the color expression conditional on each available item in each context. The actual and simulated choice probabilities are sorted by the relative specificity between the color and shape expressions associated with the item of interest (x axis). (C) Illustration of S+/S− trial classification. Top: Examples of trials when simulating speaker choices is relevant (S+) or is not relevant (S−) for reference resolution. In the S+ example (left), if the target were the blue square, a speaker would have uttered “square,” which denotes the blue square unambiguously. The fact that the speaker sent “blue” instead of “square” indicates to the listener that the blue circle is the target. In the S− example (right), by comparison, a listener may single out the red circle upon receiving the expression “red,” without realizing that the red circle would be referred to as “red” with 50% probability by the speaker. Bottom: The data-driven classification for examples presented in the top panel. Red crosses represent the actual choice frequencies of listeners in the corresponding decision. Black crosses represent the posterior probability distribution of listener choices derived from the best-fitting RSA estimation. Gray dots are the posterior probabilities simulated from RSA based on 100 randomly perturbed pragmatic likelihood values. Perturbation is restricted to items that can be literally described by the received expression. For example, given the received reference “blue” in the S+ example in the top panel, perturbed pragmatic likelihood values are randomly assigned to the blue square and blue circle, resulting in the similar distributions of gray dots for these two items. (D) Classification outcomes across all trials. Histograms depict the average Euclidean distance between the posterior probability predictions generated by the best-fitting RSA and 100 randomly perturbed RSA.

Important for interpreting neuroimaging data, we found that pragmatic likelihood estimates, derived from listener data, matched the aggregate choice pattern of the speakers. More specifically, using the RSA parameter calibrated on the listener data, we computed the value estimates of pragmatic likelihood associated with the color expression for each item in each trial. We found a significant correlation between these pragmatic likelihood estimates and the actual frequencies that speakers referred to an assigned item by its color (r = 0.97, P < 2 × 10−16). Moreover, and consistent with the model assumption, both the simulated and actual speaker choices increased as the function of the relative specificity between candidate expressions, in a manner mimicking the softmax function (Fig. 2B). These results therefore suggested that the value estimates of pragmatic likelihood reflected not only what speakers should rationally select to achieve a communicative goal but also what they actually selected in the experiment.

Recovering the intended referent may or may not require mental simulation

To allow for testing the hypothesis that pragmatic likelihood is actively tracked by the listener’s brain, even when mental simulation of the speaker is not required, we further classified choices faced by listeners into two categories using the RSA model, in which mental simulation may (S+) or may not (S−) influence referential interpretation in the task setting. The classification was performed using a data-driven approach. We generated RSA model predictions using randomly perturbed values of pragmatic likelihood and compared these predictions with the predictions generated by the best-fitting RSA (Fig. 2, C and D). In S+ trials, pragmatic likelihood critically shapes the utterance interpretation such that perturbing pragmatic likelihood gives rise to differential interpretations of the same utterance. In S− trials, however, listeners always arrive at the same interpretation even when pragmatic likelihood is randomly perturbed, leaving it difficult to determine whether reference resolution entails the rational prediction of the speaker based on behavioral data alone (see also Materials and Methods and fig. S5 for mathematical definition and detailed characterization).

Update-related signals are encoded in the listener striatum

Next, we investigated whether the listener’s brain activity reflected key computational components derived from the RSA model, including the Bayesian update and pragmatic likelihood estimates, at the point when listeners received messages from speakers. A standard general linear model (GLM) analysis revealed that the Bayesian update signal, as assessed by the difference between the prior and model-derived posterior probability that the chosen object was the intended referent, scaled with activity in the listener bilateral striatum on a trial-by-trial basis (Fig. 3A; see Materials and Methods and fig. S6 for how prior probability was assessed). This effect is consistent with previous findings in decision neuroscience that the striatum is involved in representing a variety of signals for error-like updates, including those derived from the Bayesian setups (36, 37).

Fig. 3 Listener striatum represents update-related signals.

(A) The listener bilateral striatum represents the update estimate (posterior-prior probability) for the chosen object at the time of expression onset [P < 0.05 cluster-wise family-wise error rate (FWE)–corrected, cluster-forming threshold P < 0.001; see also fig. S7 and table S2]. (B) Across listeners, a superior RSA model fit to the listener data is associated with enhanced neural responses to the update signal, in an independent region of interest (ROI) for learning and updating (using the term “prediction error”) defined from Neurosynth (38). The ROI is predominately confined to the nucleus accumbens. Each circle represents a listener.

Activity in the listener striatum reflected both components in the update estimate, scaling positively with the posterior probability of the chosen object, but negatively with the prior probability (fig. S7). Listeners with higher striatal sensitivity to the posterior probability estimates also responded more strongly to the prior probability estimates in the same region, which is consistent with the possibility that the striatum tracks the update signal as a whole, rather than different components in the update in separation. Moreover, across subjects, the strength of this update effect was indexed by the individual differences in the listeners’ behavior, such that listeners whose choices were better characterized by RSA showed a greater update-related activation in a region of interest (ROI) independently defined for learning and updating from an automated online meta-analysis (Fig. 3B) (38).

Pragmatic likelihood estimates are encoded in the listener vmPFC

We then looked for brain regions where activity reflected pragmatic likelihood estimates of the chosen object at the time when referring expressions were presented to listeners. Under the Bayesian setup, the value estimates of pragmatic likelihood were inevitably correlated with that of posterior probability (r = 0.57 ± 0.08). To control for this and other possible correlations, we included the trial-wise estimates of pragmatic likelihood [i.e., P(received expression|listener chosen item, context)], together with the posterior probability [i.e., P(listener chosen item|received expression, context)], prior probability [i.e., P0(listener chosen item|context)], trial type (S+/S−), and reaction time (RT) in a single GLM (see table S1 for correlation coefficients between regressors; see also Materials and Methods for GLM specifications). The method of orthogonalizing GLM regressors, widely used in model-based fMRI studies, provides regression coefficients that capture variances uniquely explained by each regressor, while removing any shared variations (39).

The analysis revealed a significant effect of pragmatic likelihood estimates in a single cluster in the ventromedial prefrontal cortex (vmPFC) (Fig. 4A and table S2). The vmPFC effect was seen after the receipt of a referring expression from the speaker but was not significant when we tested the effect at the time when listeners were presented with the context while waiting for reference delivery (uncorrected P > 0.05). Moreover, the activation regions identified by this GLM analysis with respect to the prior, posterior, trial type, and RT were all spatially segregated from the vmPFC, even at a liberal, uncorrected threshold of P < 0.05, suggesting that none of these potentially confounding variables can explain the same portion of vmPFC signals (Fig. 4, B and E; see also fig. S8 for whole-brain results).

Fig. 4 Listener vmPFC encodes pragmatic likelihood estimates, above and beyond prior, posterior, and reaction time (RT).

(A) BOLD activity in the listener vmPFC is significantly correlated with pragmatic likelihood (PL) estimates of the chosen object, controlling for prior, posterior, RT, and trial type (S+/S−) (P < 0.05 cluster-wise FWE-corrected, cluster-forming threshold P < 0.001). (B) Time courses for GLM analyses of the effects of pragmatic likelihood and posterior probability on vmPFC activity. The ROI is defined as a 6-mm ball around the peak voxel in the vmPFC (MNI: −6/44/−7) as identified in (A). Vertical dashed lines indicate the onset of the referring expression. a.u., arbitrary units. (C to E) Mean fMRI activity in the vmPFC ROI for each value estimate of pragmatic likelihood, ranked by relative specificity, conditional on posterior, prior, and the median split according to RT, respectively. Insets show the mean fMRI signals in the same vmPFC ROI (y axis), colored by pragmatic likelihood estimates, and plotted against posterior, prior, and quantiles of RT, respectively. For visualization purposes, the vmPFC signals are ranked by relative specificity rather than pragmatic likelihood estimates, while all test statistics in the study are reported on the basis of the value estimates of pragmatic likelihood. See also fig. S8 (A to D) for whole-brain analyses and fig. S8F for analyses based on anatomically defined vmPFC ROI. Error bars represent intersubject SEM. Circle sizes represent sample sizes.

To illustrate the vmPFC effect, we extracted mean fMRI signals from the vmPFC ROI and plotted the extracted signals against pragmatic likelihood estimates, colored by the sizes of posterior, prior, and RT, respectively (Fig. 4, C to E). This demonstrated two features of vmPFC activity. First, the mean vmPFC signals increased with the value estimates of pragmatic likelihood (β = 0.82 ± 0.19, t40 = 4.28, P = 1.1 × 10−4). Second, at each level of pragmatic likelihood, there was no difference in vmPFC signals between high versus low posterior probability (β = 0.06 ± 0.12, t40 = 0.50, P = 0.62), prior probability (β = −0.05 ± 0.11, t40 = −0.41, P = 0.69), or RT (β = 0.14 ± 0.10, t40 = 1.40, P = 0.17), suggesting that the observed vmPFC effect could not be attributed to these variables. In complementary analyses, we plotted mean vmPFC activity, conditional on the value estimate of pragmatic likelihood, against posterior, prior, and quantiles of RT, respectively (insets in Fig. 4, C to E). This revealed a marked difference in the correlation patterns: Whereas activity in the vmPFC showed no sensitivity to posterior (β = 0.01 ± 0.25, t40 = 0.03, P = 0.98), prior (β = 0.18 ± 0.28, t40 = 0.56, P = 0.56), or RT (β = 0.002 ± 0.09, t40 = 0.02, P = 0.98), there was a significant main effect with respect to pragmatic likelihood estimates regardless of how vmPFC signals were binned. In addition to the analyses on the basis of the functional ROI selected by pragmatic likelihood, we examined an anatomical vmPFC ROI defined by automated anatomical labeling template and observed similar results (fig. S8F).

Besides these key decision variables, we tested an additional set of cognitive factors that may be processed by the listener’s brain during communication, including the message type (color/shape), choice type (left/right/middle), context configuration (see the “Experimental stimuli” section), and a number of measures related to decision difficulty. The observed vmPFC effect could not be attributed to any of these variables (fig. S9) and remained robust to the inclusion of these variables as regressors of no interest in a same regression (fig. S10B, top).

Beyond pragmatic likelihood estimates associated with the chosen item, activity in overlapping regions of the vmPFC was also correlated with other notions of pragmatic likelihood. These included pragmatic likelihood estimates for the item presented on the left of the context [i.e., P(received expression|left item, context)], as well as those related to the most salient item according to the independently measured prior probability distributions [i.e., P(received expression|item with the highest prior in context, context)] (fig. S11). That the vmPFC encodes multiple notions of pragmatic likelihood is consistent with the Bayesian assumption, under which a listener needs to evaluate the speaker’s action-intention contingencies between the received expression and each available item in context, rather than only the chosen item. Last, in line with the hypothesis that the listener’s vmPFC represents other-predictive signals, the fMRI signals extracted from the vmPFC ROI were predictive of speakers’ actual choices, outperforming the model-derived pragmatic likelihood estimates (fig. S12).

Listener’s vmPFC tracks pragmatic likelihood even when not required

Strikingly, the pragmatic likelihood estimate was encoded in the listener vmPFC even when not required for decoding the intended referents. A separate whole-brain GLM regression showed that, within S− trials, activity in an overlapping vmPFC cluster was strongly correlated with pragmatic likelihood estimates for the chosen object at the time of expression onset [P < 0.05 cluster-wise family-wise error rate (FWE)–corrected; see also fig. S10C (top) and table S2]. Figure 5A illustrates this finding by plotting the mean vmPFC activity against pragmatic likelihood estimates of the chosen item ordered by relative specificity, for S+ and S− types. Regression betas extracted from the vmPFC ROI confirmed that there was no significant difference in the regression coefficients with respect to pragmatic likelihood estimates in S+ versus S− trials (∆β = 0.02 ± 0.42, t40 = 0.05, P = 0.96).

Fig. 5 Listener’s vmPFC tracks pragmatic likelihood estimates even when not required.

(A) Mean fMRI activity in the vmPFC ROI (y axis) against pragmatic likelihood estimates ranked by relative specificity (x axis), conditional on S+/S− types. The ROI is defined as a 6-mm ball around the peak voxel (MNI: −6/44/−7) as identified in Fig. 4A. The inset shows vmPFC betas with respect to (w.r.t.) pragmatic likelihood estimates separately extracted for S+ and S− trials. Each dot represents a listener. Each gray line represents paired comparison between S+ and S− trials within a listener. (B) Breakdowns of trials in which the pragmatic likelihood of the chosen item is 0.5 (relative specificity, 1:1). Examples on the x axis are not sorted, as all are associated with the same value estimate of pragmatic likelihood. (C) Mean fMRI activity in the vmPFC ROI (y axis) ranked by relative specificity (x axis), conditional on whether trials involved determined or uncertain references. The inset presents vmPFC betas w.r.t. pragmatic likelihood estimates separately extracted for determined and uncertain trials. (D) Breakdowns of trials with determined references. Examples are sorted by relative specificity between the received versus alternative expressions (e.g., blue versus circle) for the chosen item (e.g., the blue circle). Error bars represent intersubject SEM. Circle sizes represent sample sizes. Each gray line represents a listener. ***P < 0.001; n.s., not significant.

An alternative explanation for this finding is that, because S+ trials are generally more difficult than S− ones and because choice difficulty may fluctuate within S− trials, the observed vmPFC encoding may reflect the varying level of task difficulty rather than mental simulation signals. Contrary to this interpretation, we found no significant correlation between vmPFC activation and RT, a widely used measure for task difficulty and/or decision confidence (β = 0.002 ± 0.09, t40 = 0.02, P = 0.98; Fig. 4E and fig. S8D). To unpack this finding, we assessed two contrasting situations in which the influence of mental simulation can be dissociated from that of choice difficulty. The first situation contained trials associated with the same pragmatic likelihood estimates of the chosen item, but varying in choice difficulty. If vmPFC signals reflected pragmatic likelihood estimates above and beyond task difficulty, we should expect restricted fluctuation in vmPFC activity across those trials, despite the variation in difficulty. The second situation contained trials associated with a similar level of choice difficulty but differing in pragmatic likelihood estimates. Contrary to the first situation, vmPFC activity should vary according to the value estimates of pragmatic likelihood, despite the stably distributed difficulty level.

For the first situation, we focused on trials in which the pragmatic likelihood estimates of the chosen item were equal to 0.5 [i.e., P(received expression|listener chosen item, context) = 0.5, corresponding to relative specificity = 1:1]. These contained four different subtypes, varying in context configurations and in whether or not mental simulation was required for resolving communicative uncertainty (see examples in Fig. 5B). A closer examination at these trials suggested that these trials varied significantly in the posterior probability for the chosen item (F3,120 = 7388, P < 2 × 10−16), prior probability for the chosen item (F3,120 = 7462, P < 2 × 10−16), and choice difficulty as reflected by RT (F3,120 = 18.7, P = 5.1 × 10−10). Despite these variances, there was no significant difference in vmPFC activity across trials, either according to an analysis of variance (ANOVA) within listeners across four subtypes (F3,120 = 0.4, P = 0.76) or based on the paired comparison between the S+ and S− trials (the first subtype versus the latter three subtypes combined, as shown in the examples in Fig. 5B) (vmPFC activity in S+ = −0.50 ± 0.15, S− = −0.62 ± 0.13; t40 = 0.94, P = 0.35). The null result could not be attributed to the insufficient statistical power arising from testing a fraction of decisions, as pronounced S+ versus S− differences were presented in the same set of decisions in ROIs selected by posterior, prior, trial type, and RT, in a manner consistent with their respective behavior patterns (fig. S13).

For the second situation, we considered trials involving referring expressions that were perfectly specific and uniquely denoted a referent in a given context (see examples in Fig. 5D). Resolving such a “determined” expression does not require mental simulation and is always associated with the same posterior probability (model prediction for the chosen item = 1; listeners’ actual choices = 0.99 ± 0.002) and a similar level of choice difficulty as reflected by RT (F2,80 = 0.35, P = 0.71). In line with our prediction, both the whole-brain and ROI analyses showed that activity in the listener vmPFC robustly tracked the model-derived pragmatic likelihood estimates for determined references (whole-brain: P < 0.05 FWE-corrected, fig. S14; ROI: β = 1.05 ± 0.28, t40 = 3.69, P = 6.6 × 10−4, Fig. 5, C and D), with an effect size similar to decisions associated with uncertain references (∆β = 0.51 ± 0.37, t40 = 1.39, P = 0.17). The observed effect of determined referents could not be attributed to the correlation with the difficulty in sensory processing that varied according to context complexity. Using joint entropy computed from the numbers of different color and shape in each context as an approximation for visual complexity (40, 41), we found no correlation between vmPFC activity and visual complexity either across all decisions (β = 0.12 ± 0.10, t40 = 1.19, P = 0.24) or within decisions involving determined references (β = −0.01 ± 0.21, t40 = −0.05, P = 0.96).

Besides RT, we examined a number of behavioral measures that have been associated with choice confidence and/or task difficulty in previous studies on decision-making. None of them could explain the same portion of the vmPFC signals, compared with pragmatic likelihood estimates (fig. S15). These data were consistent with the hypothesis that the vmPFC encodes a neural signature of pragmatic likelihood estimates, above and beyond more general cognitive factors underlying communication. Findings related to S− trials, especially those involving determined references, further highlight the possibility that pragmatic likelihood estimates are actively represented in the listener’s brain, even when such simulation is irrelevant for utterance interpretation.

Functional coupling between the vmPFC and nodes on the mentalization network is associated with referential interpretation

The above results thus raise the question of what neural systems inform or facilitate the simulation signals observed in the vmPFC. On the basis of previous studies (8), we hypothesized that pragmatic likelihood computations likely involve the communication between the listener vmPFC and mentalization network. Under this possibility, activity in brain regions typically implicated in mentalization, such as the dorsomedial prefrontal cortex (dmPFC) and temporoparietal junction (TPJ) (42), may influence the encoding of pragmatic likelihood in a manner consistent with RSA predictions.

We explored this hypothesis by splitting listener decisions into two sets, according to whether a listener’s actual choice was the one predicted by RSA with the highest posterior probability (following RSA recommendations) or with a lower posterior probability (violating recommendations). A psychophysiological interaction (PPI) analysis was performed between the vmPFC and four ROIs independently defined for mentalization [dmPFC, left TPJ (LTPJ), right TPJ (rTPJ), and precuneus (PC)] at the time when referring expressions were presented to listeners (Fig. 6A; Materials and Methods). In line with our prediction, following RSA recommendations, compared to violating recommendations, was associated with enhanced functional coupling between the vmPFC and ROIs in the dmPFC and TPJ, but not in the precuneus (Fig. 6B; dmPFC: β = 0.39 ± 0.13, t40 = 2.90, P = 0.024; LTPJ: β = 0.56 ± 0.18, t40 = 3.19, P = 0.012; rTPJ: β = 0.46 ± 0.17, t40 = 2.70, P = 0.04; PC: β = 0.22 ± 0.14, t40 = 1.62, P = 0.44; all Bonferroni-corrected). Moreover, consistent with our finding that the vmPFC represents pragmatic likelihood estimates even when not required, we observed no systematic differences in the functional coupling of vmPFC with nodes on the mentalization network in S+ versus S− trials (Fig. 6C; all P > 0.05, uncorrected).

Fig. 6 Functional connectivity between the vmPFC and nodes in the mentalization network.

(A) The vmPFC cluster as identified in Fig. 4A, and four ROIs defined from Neurosynth using the term “theory of mind” [dmPFC, LTPJ, rTPJ, and precuneus (PC, not shown)]. (B) Enhanced functional coupling between vmPFC and ROIs of dmPFC and TPJ, but not PC, when the listener followed compared to when she violated RSA recommendations. (C) There is no systematic difference in functional connectivity between the vmPFC and the same four ROIs, when listeners face S+ versus S− trials. Error bars represent intersubject SEM. *P < 0.05; all Bonferroni-corrected.

Generality and sensitivity of the listener vmPFC encoding

To what extent does the vmPFC activity reflect the mental simulation of speakers in other communicative situations? Results from two additional experimental conditions suggested that, whereas varying communicative stimuli did not affect the vmPFC encoding (symmetric-garment condition, Fig. 7, A to C, and fig. S10, middle row), altering the knowledge shared between communicators could substantially perturb signals in the vmPFC (asymmetric condition, Fig. 7, D to F, and fig. S10, bottom row). In the latter condition, we tested whether the vmPFC encoding was sensitive to perturbing the epistemic states of speakers, by exploiting the common ground effect well established in referential communication (4, 6).

Fig. 7 Generality and sensitivity of the pragmatic likelihood representation in the listener vmPFC.

(A) Schematic of symmetric-garment condition, designed for testing whether the vmPFC encoding depends on the specific eliciting stimuli. Both the whole-brain (B) and ROI (C) analyses reveal a significant correlation between the listener vmPFC activity and the pragmatic likelihood estimates of the chosen object (whole brain: P < 0.05 cluster-wise FWE-corrected, cluster-forming threshold P < 0.001, only positive activation is presented for confirmatory purposes; see also table S2 for full activation lists). By design, the symmetric-garment condition contains only the S+ trials at the relative specificity 1:1 (Supplementary Materials). (D) Schematic of asymmetric condition, designed for evaluating the vmPFC sensitivity to common ground information, by providing a privileged, rather than common, perspective to the listeners. Both the whole-brain (E) and ROI (F) analyses reveal a diminished correlation between vmPFC activity and pragmatic likelihood estimates derived from the matching symmetric condition (whole brain: P < 0.05 cluster-wise FWE-corrected for positive activation, cluster-forming threshold P < 0.001). The same vmPFC ROI is used as in the symmetric condition. Gray dots represent mean fMRI activity for each relative specificity in the symmetric condition. Error bars represent intersubject SEM. Circle sizes represent sample sizes.

Specifically, we performed the experiment in the same group of listeners, but with one important difference: instead of three geometric objects in each trial, speakers were able to see only the target when selecting references. Importantly, listeners underwent the same reasoning task inside the fMRI as in the other conditions but were told that speakers faced only the target during their decisions (Fig. 7D, see also Materials and Methods for details). If a listener models the utterance selection process from the speaker’s perspective, then the listener would expect a speaker with a restricted perspective to choose between candidate expressions randomly, regardless of context. This implied that a listener should demonstrate flattened vmPFC activation in the asymmetric condition relative to that in the symmetric condition.

As expected, we found that the behavior of listeners was sensitive to the experimental manipulation, such that now only 67.02 ± 0.15% of targets were correctly identified by listeners, a success rate similar to that of listeners choosing literally in response to random speakers (literal recovery rate = 66.88%; t40 = 0.88, P = 0.38) (fig. S3). Consistent with our prediction, the listener vmPFC showed blunted responses to the pragmatic likelihood estimates derived from the matching symmetric condition, either at the whole-brain level (Fig. 7E; P < 0.05 cluster-wise FWE-corrected) or within the ROI obtained in the original symmetric condition (Fig. 7F; β = 0.29 ± 0.19, t40 = 1.48, P = 0.15). Within listeners, the neural beta with respect to pragmatic likelihood estimates was significantly lower in the asymmetric condition relative to that in the symmetric and symmetric-garment conditions (fig. S16). Across listeners, significant correlation was observed in neural betas of pragmatic likelihood between the symmetric and symmetric-garment conditions, but not between the symmetric and asymmetric conditions (fig. S17).

Last, the blunted vmPFC responses observed in the asymmetric condition could not be attributed to insufficient detecting power, as the symmetric and asymmetric conditions contained the same number of trials (n = 152) and the same context configurations (see the “Experimental stimuli” section). In comparison, the significant vmPFC effect in the symmetric-garment condition was detected on the basis of a much smaller set of observations (n = 72) and a variant of context compositions (Materials and Methods).

DISCUSSION

Dating back to Grice’s cooperative principle (1), understanding what is meant from what is said in context is thought to involve an inferential process guided by the expectation that the speaker communicates cooperatively. While it is often assumed that such expectations arise from an internal generative process that simulates speaker behavior, direct evidence has been lacking. The challenge relates to not only the inherent complexity in human communication but also the lack of quantitative instruments that can decompose the underlying neurocognitive operations in a principled way, without resorting to reverse inferences (43). Here, we address this challenge by combining methods from neuroeconomics and computational pragmatics. The consistent results across three experimental conditions provide substantial evidence that the listener vmPFC encodes mental simulations of the speaker choice process, complied with rational cooperative principles, inferred from a specific context, independent of eliciting stimuli, and irrespective of whether such a simulation is required for utterance interpretation. The rational simulation signal in the vmPFC is likely supported by inputs from the mentalization network. Importantly, the finding that the vmPFC signal resembles a Bayesian likelihood function, together with the fact that the listener’s striatal activity correlates with the update from the Bayesian prior probability to posterior probability, supports a mechanism by which the frontal-striatal circuits are engaged in building and then inverting a choice model of the speaker to produce pragmatic interpretations, in a manner mimicking Bayesian inferences.

The vmPFC has been previously implicated in a variety of functions, such as signaling reward expectation and decision difficulty (confidence), participating in social cognition, and representing cognitive maps that reflect latent structures of task-relevant components (4446). In the current task, the expected reward for the chosen item is expressed in units of posterior probability and encoded in areas other than the vmPFC (e.g., the striatum), after controlling for pragmatic likelihood estimates (figs. S7A and S8B). Similarly, we find no evidence that difficulty-related signals can uniquely explain the vmPFC activation, on the basis of either RT or some other difficulty measures used in previous studies (fig. S15). In contrast, our data suggest that the listener vmPFC tracks the trial-by-trial changes in mental predictions about conversational partners, independent of more general factors that may contribute to communication. First, this model-derived simulation parameter uniquely explains activity in the listener vmPFC, above and beyond a variety of potential confounds emphasized in the vmPFC literature (Figs. 4 and 5 and figs. S8 to 10). Second, the vmPFC encoding is highly robust, even in trials containing no fluctuation in reward prediction, task difficulty, and choice uncertainty (Fig. 5D and fig. S14). Moreover, in line with the assumption of other-predictive signals, activity in the listener vmPFC can accurately and specifically predict the actual behavior of speakers, outperforming the best-fitting RSA model in some situations (fig. S12).

This effect corroborates previous findings that the vmPFC is involved in calibrating social actions by processing implied, rather than explicit, social information (4749). Our data extend these past findings by characterizing the computational role of the vmPFC in coding the speaker’s intention-action contingency, inferred from a specific context, and by specifying the generality and sensitivity of the vmPFC representation. Our findings also complement recent research implicating the involvement of the vmPFC in the production of communicative signals. These studies reveal that the vmPFC is involved in representing the intention to speak to a conversational partner (50), and damage to this region impairs one’s ability to tailor a message for a specific audience (49). Together with our data, these findings raise an interesting possibility that the vmPFC is involved in both encoding and decoding communicative intents, perhaps through coordinating the neural structures that process belief inferences central to communicative reasoning. More generally, the finding that the vmPFC tracks the relationship between action, intention, and context—independent of specific communicative stimuli—echoes the view that the vmPFC is part of the neural system involved in representing an abstract “cognitive map” of a task, which is believed to facilitate flexible, inferential processing underlying a great variety of behaviors (45, 51, 52).

The Bayesian approaches to communication, including RSA, have been proposed largely as descriptive rather than mechanistic models (10). Although we cannot rule out the possibility that other decision strategies that do not rely on mental simulation may also contribute to referential interpretation, the identification of pragmatic likelihood signals in the vmPFC lends weights for the generative account. While it is an open question whether the vmPFC is directly involved in constructing mental predictions or simply reflects the prediction computed elsewhere, the parametric encoding of action-intention contingencies sheds light on a possible representation on which Bayesian-like inferences may operate in service of communication. The projection from the vmPFC to the striatum would allow simulation-related signals to update prior beliefs for building posterior inferences about the most probable intent.

A related line of debate concerns whether the internal generative process plays a fundamental or secondary role in communication. For example, past studies have hypothesized that reasoning from the speaker’s perspective does not always take part in referential communication but is invoked only when confusion or misunderstanding emerges from egocentric processing (6, 53). In contrast to this view, our data suggest that the vmPFC represents belief inferences simulated from the speaker’s viewpoint, even when the listener’s egocentric perspective offers sufficient, or even unambiguous, information for reference resolution. This finding has close parallels with past research suggesting an automatic representation of reward and confidence in the vmPFC (5457). It is also compatible with an emerging perspective that internal predictive systems are actively engaged in the sensory and higher-level processing in communication, facilitating the online construction and updating of prospective evaluation during an ongoing conversation (12, 13, 58, 59). On the other hand, the current study focuses on the situations where active mental simulation incurs small to moderate cognitive cost, as in the current experiment, the goal of communication is well defined, the speaker’s strategy space is confined, and the dynamic nature of real-life communication is removed. We do not know, therefore, whether the active prediction process can generalize to other, more naturalistic settings, or whether the brain switches to alternative strategies, such as hierarchical Bayesian (14, 60) or efficient coding (61), to account for neurobiological constraints on communicative systems when confronted with complex settings. Future studies relaxing experimental restrictions will be invaluable in addressing the ecological relevance of rational simulation in communication.

The current study provides preliminary neural evidence that predicting speaker choices depends on the mutual information shared by conversational partners. However, these data do not directly speak to the exact process by which common ground information supports mental simulation. Prior studies have provided sophisticated experimental designs and relevant cognitive signatures (53, 6264) that, when combined with computational modeling and neuroscientific methods, will have the potential to reveal how the brain recognizes, represents, and uses information of other’s epistemic state in service of communication.

The mentalization network, particularly the dmPFC and TPJ, is often activated in neuroimaging studies of pragmatic comprehension (8), but the exact role of these regions in communication remains unclear. Our results suggest that the dmPFC and TPJ do not directly code the simulation content per se but rather support intention resolution by interacting with the vmPFC. The strength of this functional connectivity is higher, when a listener follows versus violates the RSA prediction, raising a number of possibilities regarding how vmPFC connectivity contributes to communicative effectiveness. One possibility is that a fundamental distinction between successful and unsuccessful pragmatic interpretation may be the degree to which the mentalization network modulates vmPFC. An alternative possibility is that, rather than playing a primary modulatory role, the fluctuation in vmPFC connectivity reflects the influences of other cognitive aspects involved in communication, such as the degree of cognitive engagement (e.g., attention). To clarify the neural substrates necessary for communitive effectiveness, future research may combine the current framework with methods of brain stimulation and test whether disrupting activity in regions critical for mentalization or for cognitive control will modify vmPFC connectivity, thereby affecting communicative performances.

More broadly, our study provides novel insights into neural mechanisms underlying social and strategic decision-making. First, these results may help to explain why people coordinate and cooperate with strangers in the novel, one-shot situations. Past research on cooperation has typically focused on how the brain anticipates partners’ choices by learning from direct experiences, such as repeatedly interacting with the same partner within the same decision context (21, 22, 65). In contrast, by focusing on one-shot communication with no feedback, our study suggests a neural system for simulating another’s behavior based on rational principles that may substitute for learned expectations, consistent with psychological and economic theories regarding the role of strategic mentalization in a range of mutually beneficial behaviors (18, 29, 66).

Second, the referential game resembles a class of strategic environments extensively studied in the game theory literature. In particular, signaling games, characterized by asymmetric information and multistage (as opposed to simultaneous) decision-making, have long been proposed to account for goal-directed information transmission in evolutionary biology (67) and economics (7, 68) and are tightly connected to equilibrium concepts grounded in Bayesian inferences (69). While little is known about whether such normative solutions can map onto the actual data-generating processes within communicators, our research sheds light on the neural implementation of Bayesian reasoning in an important class of signaling scenarios widely used for investigating how context shapes meaning in information transmission.

By highlighting the utility of connecting tools and ideas from neuroeconomics and those of computational pragmatics, the present study constitutes an initial step toward a neural mechanistic understanding of pragmatic interpretation. Future studies are needed to address whether, and under what circumstances, these findings generalize to other communicative environments. Moreover, results from the current study raise exciting questions regarding the degree to which neurocognitive substrates of communication are shared by social decision-making and whether behaviors such as detecting sarcasm or interpreting humor can be modeled as strategic, cooperative choices in the brain and brain-inspired artificial intelligence.

MATERIALS AND METHODS

fMRI participants

A total of 46 healthy, right-handed volunteers [26 females; age = 20.2 ± 1.32 years (mean ± SD)] were recruited for the fMRI experiment from the Neuroeconomics Lab subject pool at Peking University, China. All participants reported having normal or corrected-to-normal eye vision, no colorblindness, and no history of neurological or psychiatric illnesses. Five subjects were excluded from data analyses due to excessive motion (N = 4) and a technical problem with the stimuli display (N = 1). Informed consent was obtained by the ethics committee at Peking University, China.

Experimental procedure

Subjects participated in a referential game adapted from previous studies (34, 35). We first conducted a behavioral session in which 60 anonymous subjects participated in the referential game in the role of speakers. Forty-six neuroimaging subjects were separately recruited and subsequently played the role of listeners with speakers under a random matching protocol. That is, a listener and a speaker were matched pseudo-randomly at the beginning of each round. The listener received a referring expression previously selected by a speaker and needed to recover the intended referent from the received expression. The random matching between speakers and listeners ensured that the probability of repeated interactions was small, thereby preventing communicators from developing hierarchical mental models to collude with their partners.

No feedback was provided to either communicator during the experiment. That is, speakers did not know which items listeners selected in response to the referring expressions, and listeners did not know whether their choices of referents were correct after each decision.

Before the experiment, all subjects (listeners and speakers) were truthfully and identically instructed to ensure that the task setup was shared by all communicators and known to be shared by all. Subjects were instructed that the speakers’ choices were to be sent to listeners in the subsequent experiment. Following the instruction, subjects completed a quiz and three practice trials to ensure comprehension. Subjects were informed that both communicators would be rewarded if a referent was successfully recovered by the listener in a trial. Subjects were paid at the end of the study, on the basis of the total payoff of 100 randomly chosen trials and a show-up fee (150 CNY for fMRI listeners and 40 CNY for speakers).

Experimental conditions

The fMRI experiment included three conditions: symmetric, asymmetric, and symmetric-garment conditions. In the symmetric condition (152 trials divided into two scanning sessions), a set of three geometric objects were presented to both the speakers and listeners in each trial and were known to be presented to both. The speaker was additionally presented with an arrow, which was randomly distributed among the three displayed items, indicating the target object that the speaker needed to refer to and the listener needed to recover on the basis of the received expression. In the asymmetric condition (152 trials divided into two scanning sessions), we reduced the common knowledge shared between the communicators such that speakers saw only the target object, whereas listeners were informed of all three objects and the fact that speakers were able to see only the target. We also included a symmetric-garment condition (72 trials, one scanning session) only for listeners as a robustness check for whether the main neuroimaging result depended on specific eliciting stimuli.

During scanning, the symmetric and asymmetric conditions were presented in a block-wise manner with a counterbalanced order [i.e., two successive sessions of the symmetric condition followed by (or following) two sessions of the asymmetric condition]. The symmetric-garment session was always administered at the end. Within each scanning session, the trial order was randomly shuffled with a unique order per listener.

Experimental stimuli

A schematic representation of the referential game and the timeline of the experiment is shown in fig. S1. On each scanning trial in the symmetric and asymmetric conditions, a listener is presented with a new context consisting of three geometric items displayed horizontally, with item locations fixed for the listener and the speaker within each context. All listeners faced the same 304 contexts that contained a total of 16 different geometric objects, generated from four colors (red, green, blue, and yellow) and four shapes (diamond, square, circle, and trapezoid). All color/shape features can be denoted by a two-character noun in Chinese. We constructed the 304 contexts pseudo-randomly by drawing 3 items out of 16, with replacement, without distinguishing between drawing orders or item locations (fig. S2). A full list of experimental stimuli is included in the Supplementary Materials. The target location was randomly distributed. The 304 contexts were evenly and pseudo-randomly split between symmetric and asymmetric conditions, with 152 each. In the asymmetric condition, only the target was revealed to the speaker as the two distractors were covered by gray masks.

Stimuli in the symmetric-garment condition were created in a similar fashion and included nine items, generated by three garment types (top, pants, and sneakers) and three brand names (Adidas, Nike, and Li-Ning). Each of these features was associated with a two-character Chinese noun (Supplementary Materials).

Prior probability evaluation

To capture the idea that the prior probability distribution reflects the common knowledge about a communicative context shared among interlocutors, we followed previous studies (34, 35) and empirically measured the prior probability distribution of target items in 304 contexts, using online surveys (www.wjx.cn) in a separate sample of Chinese participants (N = 900). In each trial, survey participants were presented with three geometric items and asked to infer the referent on the basis of an unknown expression in a foreign language (fig. S6A). We instructed subjects to follow their intuition and make a guess if they did not know the meaning of the expression. Answers elicited by this method have been thought to reflect the relative saliency among items at not only visual but also social and communicative levels (34, 35, 70). To monitor the performances of online participants, 32 sanity check questions were included and evenly distributed throughout the survey, where subjects needed to identify the referent on the basis of a Chinese referring expression that uniquely denoted an item in the context. Ninety-eight survey participants who answered incorrectly on more than 30% of sanity check questions were excluded from the data analysis, whereas the remaining participants answered sanity check questions with an accuracy rate of 91.63 ± 0.32%.

We calculated the trial-wise prior probability distribution of targets by averaging the choices of each item within each context across survey participants. Within each context, the calculated prior distribution differed significantly from a uniform distribution, as assessed by the information entropy derived from each context (fig. S6B). Across contexts, the empirically measured priors varied substantially, revealing complex sensitivity patterns in response to the changes in color, shape, and object position that could be of perceptual and cultural relevance (fig. S6C). In addition, in line with the common knowledge assumption, we found that these priors reflected not only individual responses but also variances in attitude shared among online participants, such that subjects in randomly divided subgroups demonstrated highly correlated priors (fig. S6D). Although these priors were independently measured in a separate sample, they showed a significant modulation effect of listeners’ behavior when included in a random-effect logistic regression predicting listeners’ decisions while controlling for the model-derived pragmatic likelihood estimates (fig. S6E). Together, these results suggest that these empirically measured priors are commonly shared, context-sensitive, and choice-relevant. These priors likely reflect the trial-by-trial variance in attitudes for items in context, over and beyond visual or motor reactions that could be more random and idiosyncratic.

The prior probabilities were subsequently used for fitting listener choices in the symmetric condition and imaging data analyses. No prior probability data were collected for the asymmetric or symmetric-garment conditions.

Computational modeling

We applied the RSA model (10, 34) to characterize listener behavior observed in the symmetric condition. Listeners make their decisions based on a Bayesian inferential process that can be formalized asP(ie,c)=P(ei,c)P(i)icP(ei,c)P(i)where P(ie, c) is the posterior probability of a listener choosing a particular item i upon receiving an expression e in context c; P(ei, c) is the likelihood that the speaker selects expression e to refer to item i in context c; and P(i) is the prior probability that item i is the target referent. According to the Bayesian setup, listeners need to predict how speakers generate their choices for each possible target in a context in the form of conditional probability distributions P(ei, c), which we refer to as pragmatic likelihood.

The RSA model assumes that pragmatic likelihood is computed by simulating speaker choices through a rational, goal-directed decision-making model. Specifically, listeners expect that when selecting referring expressions, speakers choose an expression to help the recipient recover the target. In the symmetric condition, this corresponds to choosing the maximally specific (informative) reference within a given context, which can be quantified using an information-theoretic measure, self-information, I(e;c)=log(N(e;c)N), where N denotes the total number of items contained in a context c (thus, N = 3 in our experimental setting), and N(e; c) denotes the number of objects that an expression e can denote in context c. For example, if an expression e can describe all three items in a context [i.e., N(e; c) = 3], e is not at all informative [i.e., I(e; c) = 0] and will not help the listener narrow down possible referents.

To convert self-information of candidate expressions into choice probabilities, the model assumes that speaker choices follow a logit or softmax formula widely used in decision-making researchP(ei,c)=11+exp(α[I(e;c)I(e;c)])=11+Rαwhere R=N(e;c)N(e;c) reflects the relative specificity between the expression e and its alternative e′, and α reflects how sensitive speaker choice probability is to the relative specificity between competing expressions, or the “inverse temperature” of the softmax function (e.g., α = 0 means listeners expect speakers to select randomly between e and e′). For example, if an expression e is more specific than its alternative e′ in referring to a target i (i.e., expression e can denote fewer items than e′), a rational cooperative speaker should be more likely to select e over e′ [i.e., P(ei, c) ≥ 0.5]. That is, pragmatic likelihood P(ei, c) is a nondecreasing function with respect to the relative specificity R, as demonstrated in Fig. 1C.

Model estimation

To calibrate the RSA parameter α with listener behavior observed in the symmetric condition, we estimated the behavioral model using both pooled estimation and hierarchical Bayesian analysis. For pooled estimation, we assumed that the choices of all listeners were generated by a single, shared α, and we applied the maximum likelihood estimation with grid search over a large nonnegative domain for α, because the likelihood function may not be globally concave. Specifically, we fit listener choice data by maximizing the log of posterior probability of observed listener choices, ktlogP(ik,tek,t,ck,t), pooled over listeners k and trials t.

Second, to account for individual differences in referential interpretation, we also calibrated individual listener parameters using the well-established hierarchical Bayesian model estimation method. We assumed that the parameter α for each listener was randomly drawn from a normal distribution governed by group-level mean and variance [i.e., αk~N(μ, σ)], whereas the group-level parameters were independently sampled from a uniform prior distribution taking values from 0 to infinity. We computed the posterior likelihood of observing listener choices with the Markov chain Monte Carlo (MCMC) method implemented in RStan (71). Two MCMC chains were simulated with 2500 iterations after 2500 burn-ins, resulting in 2500 posterior samples for each parameter in each chain. All parameters were checked for convergence both visually (from the trace plot) and through the Gelman-Rubin test (all R̂<1.01).

In the model estimation and subsequent behavioral and neuroimaging analyses, the Bayesian prior probability distribution was empirically measured using an independent online sample. In addition, we excluded trials in which listeners made literal mistakes for data analyses (e.g., choosing a red circle upon receiving an expression “blue”) from data analyses to avoid zero probability for the chosen item according to model prediction. This resulted in the removal of 0.50 ± 0.12%, 0.33 ± 0.07%, and 0.58 ± 0.17% trials in the symmetric, asymmetric, and symmetric-garment conditions, respectively. According to the pooled estimation result, the best-fitting α is 4.97, and the log likelihood of listener choices observed in the symmetric condition is −2351.63. For hierarchical Bayesian analysis, the individual parameter αk~N(5.93, 2.17), and the deviance information criterion is 107.36 ± 2.93.

Model comparison

To further verify the plausibility of the RSA model and test for alternative decision strategies that may have been used by listeners, we compared the RSA model with the following models representing competing hypotheses regarding how listeners recognize speaker intentions.

Literal listener model. This model assumes that listeners interpret received expressions literally and randomly choose among the items that the received reference can denote within the context. This model contains no free parameter and serves as a baseline for model comparison.

Flat prior model. This model assumes that a flat prior probability distribution is used for the Bayesian inferential process within RSA, serving to test the assumption that the empirically measured prior probabilities contribute to referential interpretation. This model also contains a single parameter α as in the original RSA.

Sophisticated listener model. This model assumes that speakers think one step further than maximizing reference specificity as proposed by the RSA by taking into account the possible decisions made by a more sophisticated listener who best responds to a specificity-maximizing speaker. The model serves to test for the potential involvement of higher-level reasoning in communication that the population setup and random matching protocol in our experimental design failed to remove. In particular, following the well-established cognitive hierarchy approach (72), we assume that sophisticated listeners derive the most probable referents through a Bayesian inferential process that can be characterized asPL(ie,c)=PS(ei,c)P(i)icPS(ei,c)P(i)where PL(ie, c) is the posterior probability of sophisticated listeners choosing a particular item i given the expression e and context c, and PS(ei, c) is the probability of a speaker selecting an expression e to refer to i in a context c. Different from the assumption in RSA, here, the speaker is assumed to cooperate with an RSA listener according to the following softmax decision rulePs(eT,c)=11+(PRSA(Te,c)PRSA(Te,c))αwhere PRSA(Te, c) is the probability of an RSA listener recovering a target T from an expression e in context c, and the so-called RSA listener is a listener who uses Bayesian inferences to derive the target based on the expectation that speakers seek to maximize the specificity of the chosen reference. The sophisticated listener model also contains a single free parameter, α, reflecting the choice sensitivity of speakers to the difference between alternative referring expressions.

We compared the fits of listener choices among the competing models and found the highest predictive power by RSA using either pooled estimation or Bayesian model selection (73), according to both in-sample and out-of-sample measurements of goodness of fit (fig. S4).

S+ and S− trials

We classified choices faced by listeners in the symmetric condition into two categories, depending on whether mentally simulating speaker choices was critical for referential interpretation. Besides the data-driven approach introduced in Fig. 2 (C and D), we also provided here the formal definition for the two types. In particular,

if P(ie,c)=P(ei,c)P(i)icP(ei,c)P(i)=P(i)i{items can be described by e}P(i), simulation− (S−) type; otherwise, simulation+ (S+) type.

Put in other words, in the S− trials, the posterior probability distribution that an object would be referred to is independent of the pragmatic likelihood. Mathematically, this is because for the items that cannot be described by a received expression, the associated pragmatic likelihood is always 0 in our task, whereas for items that can be described, the pragmatic likelihoods can be reduced from the numerator and denominator of the Bayesian formula in S− trials. Figure S5 presents two examples illustrating this point.

Under the current experimental setup, the S− type contains trials where a received reference denotes either a single item (“determined” reference; fig. S5, top) or a number of identical items (i.e., same color and shape; “uncertain” reference; fig. S5, bottom). Together, there are 60 trials with determined references and 38 with uncertain references within the S− category.

fMRI data acquisition and preprocessing

We collected the fMRI images for each listener using a 3T Siemens Prisma scanner and a 32-channel head coil at the Center for MRI Research at Peking University. Images were acquired using echo-planar T2* images with blood oxygenation level–dependent (BOLD) contrast and angled 30° relative to the anterior commissure-posterior commissure (AC-PC) line to minimize susceptibility artifacts in the orbitofrontal area. The scanning parameters are as follows: repetition time (TR) = 2000 ms, echo time (TE) = 30 ms, flip angle = 90°, field of view (FoV) = 192 × 192 mm2, slice thickness = 4 mm, slice gap = 0.4 mm, voxel size = 3 × 3 × 4 mm3, and 32 slices. A high-resolution T1-weighted structural image was acquired using a magnetization-prepared rapid gradient echo sequence with the following parameters: TR = 2530 ms, TE = 2.98 ms, flip angle = 7°, FoV = 224 × 256 mm2, slice thickness = 1 mm, slice gap = 0.5 mm, voxel size = 0.5 × 0.5 × 1 mm3, and 192 slices.

Imaging preprocessing and analyses were performed in SPM12 (www.fil.ion.ucl.ac.uk/spm/software/spm12/) with MATLAB R2016b. For each fMRI session, the raw images were first slice-timing corrected and then aligned to the first volume to correct participants’ head motion. After that, the images were spatially normalized into the Montreal Neurological Institute (MNI) template with a final image resolution of 3 × 3 × 3 mm3 and smoothed using a 6-mm full width at half maximum Gaussian kernel. All images were temporally filtered using a high-pass filter with a width of 128 s.

fMRI data analysis

We implemented a GLM for model-based fMRI analysis widely used in the field of decision neuroscience. The best-fitting RSA model parameter from the pooled estimation was used to calculate the trial-wise pragmatic likelihood, posterior probability, and posterior-prior update (with prior probability obtained in a separate online sample) for each listener. These values were then used as parametric modulators in the model-based fMRI analysis for listener brain activity observed in the symmetric condition. To examine the robustness of neural encoding of pragmatic likelihood, we also included a single scanning session of the symmetric-garment condition, where we computed the corresponding pragmatic likelihood values for each trial and each listener, assuming listeners in the symmetric-garment condition shared the same α estimate as in the symmetric condition. Last, to test whether altering common ground between communicators modified the neural encoding of pragmatic likelihood, we included the asymmetric condition, where we entered the same pragmatic likelihood value from the matching symmetric trial as the parametric modulator for the fMRI analyses.

In GLMs, each trial was modeled as four discrete events—item onset, expression onset, choice submission, and fixation onset—all as stick functions (i.e., duration = 0). Regressors were convolved with the canonical hemodynamic response function and entered into a regression analysis against each listener’s BOLD responses. We were specifically interested in listeners’ brain activity when they received the referring expression from the speaker; thus, variables of interest were entered into GLMs as parametric modulators associated with expression onset. The six vectors of head motion parameters derived from preprocessing were also included as nuisance regressors in all analyses.

In particular, the first GLM served to establish the validity of the RSA model at the neural level by testing the neural encoding of an error-like update signal. Thus, the regression model included the posterior-prior update estimate associated with the chosen object as the parametric modulator for trials in the symmetric condition (Fig. 3A).

The second GLM served to look for clusters of brain activity whose variance was uniquely explained by pragmatic likelihood estimates, while simultaneously controlling for other potentially confounding decision variables. We thus included the following variables as parametric modulators at the expression onset: pragmatic likelihood for the chosen item [i.e., P(received expression|chosen item, context)], posterior probability of the chosen item [P(chosen item| received expression, context)], prior probability [P0(chosen item| context)], trial type (S+ = 1; S− = 0), and RT. All regressors were orthogonalized against one another to remove any shared variances. Thus, the regression coefficient reflects the unique contribution of each regressor in explaining the variances in neural signals (Fig. 4A and fig. S8).

In the third GLM, we examined the robustness of pragmatic likelihood encoding by expanding the set of parametric modulators included for the expression onset. This regression model was identical to the second GLM except that it additionally contained the following variables of no interest: dummy variable for message type (color/shape), dummy variable for context configuration (1A1B/2A2B/1A2B; see fig. S2A for details), dummy variable for choice (L/M/R), and two measures related to choice confidence [i.e., (i) the distance between the posterior probability of the chosen item and 0.5 and (ii) the difference in posterior probability between the best versus the second-best option according to the RSA prediction] (fig. S10B).

In the fourth (fifth) GLM, we investigated whether vmPFC encoded other notions of pragmatic likelihood estimates that were not directly associated with listeners’ choices. The parametric modulator for expression onset was the model-derived pragmatic likelihood estimates associated with either the item on the left of the context (GLM4; fig. S11 cyan) or the item with the highest prior probability compared to other items in the same context (GLM5; fig. S11, magenta).

In the sixth GLM, we investigated whether vmPFC encoded pragmatic likelihood estimates in S− trials. The regression model included pragmatic likelihood estimates associated with S+ and S− trials as separate parametric modulators at expression onset (Fig. 5A and fig. S10C).

In the seventh GLM, we investigated whether vmPFC encodes pragmatic likelihood estimates in trials with determined references. The regression model was the same as that in GLM6 except that it included pragmatic likelihood estimates for trials with determined and uncertain references as separate parametric modulators (Fig. 5C and fig. S14A).

Regression betas from each listener were averaged across sessions within each condition and then taken into random-effects group-level analyses. All whole-brain analyses were thresholded and displayed at the FWE-corrected P value (PFWE) of 0.05 at the cluster level, with a cluster-forming threshold of Punc. < 0.001, as reported by SPM. In addition, similar whole-brain results were obtained with a nonparametric thresholding approach applied to the second-level analyses using default settings in SnPM13 (74) (i.e., 5000 permutations, cluster-forming threshold of 0.001, PFWE < 0.05).

fMRI time course analyses

For each scanning session, we extracted the preprocessed BOLD time series as the average of voxels within the vmPFC ROI identified in Fig. 4A. The extracted BOLD series were further regressed out the head motion to control for potential motion-related artifacts, applied a high-pass filter (cutoff, 128 s) to remove low-frequency drifts, and oversampled by a factor of 20 to get a time resolution of 0.1 s. For each trial, a 13-s window (130 time points) time locked to expression onset (3 s before and 10 s after) was applied. To get the parameter estimate time course, we first performed linear regression for each time point to estimate the effects of variables of interest (e.g., pragmatic likelihood and posterior) on the extracted brain activity and then concatenated the regression betas across time points. The time course plotted in this paper (i.e., Fig. 4B and fig. S11) only serves an illustrative purpose, with no statistic tests being performed on these data.

Functional connectivity analyses

To test whether the listener vmPFC differentially connected with areas within the well-established theory-of-mind (ToM) network according to behavioral model predictions, we analyzed functional connectivity between listener vmPFC and ROIs that were a priori selected using Neurosynth (www.neurosynth.org) for the term “theory of mind”. The vmPFC cluster identified in Fig. 4A was used as the seed region for PPI analyses. Four ToM ROIs were defined by 6-mm spheres around peaks of the map automatically generated by Neurosynth for “theory of mind” [dmPFC: (4, 58, 24); LTPJ: (−54, −54, 22); rTPJ: (58, −54, 20); and PC: (−2, −56, 40)].

We performed two PPI analyses using SPM12. The first PPI model included the following regressors for the event of expression onset: (i) the average BOLD time series extracted from the vmPFC cluster, (ii) the dummy variable indicating whether a listener choice follows the RSA model recommendation (i.e., the choice is assigned with the highest posterior probability by the best-fitting model), and (iii) the interaction term between the average vmPFC time course and the dummy variable. The second PPI model was identical to the first one, except that instead of looking for how functional connectivity differed between following and violating model recommendations, it tested whether the vmPFC connectivity varied according to S+ versus S− trials.

SUPPLEMENTARY MATERIALS

Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/7/10/eabe6276/DC1

https://creativecommons.org/licenses/by-nc/4.0/

This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial license, which permits use, distribution, and reproduction in any medium, so long as the resultant use is not for commercial advantage and provided the original work is properly cited.

REFERENCES AND NOTES

Acknowledgments: We thank the National Center for Protein Sciences at Peking University and the high-performance computing platform at the Peking-Tsinghua Center for Life Sciences for facilitating data acquisition and computation. Funding: This work is supported by NSFC (32071095, 31671171, and 31630034 to L.Z.). Author contributions: Q.M. and L.Z. designed the study. Q.M. conducted the experiments. All authors analyzed the data and wrote the manuscript. Competing interests: The authors declare that they have no competing interests. Data and materials availability: All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. Code, behavioral, and imaging data are available at Open Science Framework (https://osf.io/jktuq/). Additional data related to this paper may be requested from the authors.

Stay Connected to Science Advances

Navigate This Article