1. Introduction
Linguists and philosophers have long noticed distinct yet overlapping roles for ‘linguistic knowledge’ and ‘world knowledge’ in language comprehension (Frege, Reference Frege1948). Linguistic knowledge refers to information that is internal to language such as grammatical agreement or semantic constraints. A sentence such as ‘the professor suggested the student the idea’ is ungrammatical because the verb suggested does not permit a dative construction without the preposition to (Chomsky, Reference Chomsky1957). World knowledge, by contrast, refers to facts about the world itself which make a sentence true or false. The sentence ‘Charlie Chaplin suggested the theory of relativity to Albert Einstein’ is perfectly grammatical, but (as far as we know) false.
A variety of studies indicate that world knowledge has an impact on how we understand language (Warren & Dickey, Reference Warren and Dickey2021). We read false sentences more slowly than true ones (Garrod et al., Reference Garrod, Freudenthal and Boyle1994; Milburn et al., Reference Milburn, Warren and Dickey2016), use visual information to resolve ambiguous references (Tanenhaus et al., Reference Tanenhaus, Spivey-Knowlton, Eberhard and Sedivy1995), and produce similar N400 responses to false sentences as we do to semantically implausible ones (Hagoort et al., Reference Hagoort, Hald, Bastiaansen and Petersson2004). Collectively, these kinds of results suggest that understanding language involves rapidly accessing and integrating arbitrary general knowledge about the world, which has important implications for theories of language comprehension (Barsalou, Reference Barsalou1999; Garnham, Reference Garnham2001; Talmy, Reference Talmy2000).
In general, these studies work by manipulating whether or not a sentence is consistent with world knowledge and measuring changes in a relevant processing variable, such as reading time. If comprehenders read consistent sentences faster than inconsistent ones and relevant linguistic factors have been controlled for, we can infer that the difference in reading time must be caused by the comprehender’s sensitivity to world knowledge itself. While experimenters generally control for traditional linguistic confounds such as word length and frequency, an important confound that has rarely been controlled for is the distributional likelihood of the expression. Words are distributed non-randomly in language and some sequences of words appear more frequently than others. In particular, because language describes the world, scenarios that are plausible in the world are also more likely to produce probable sequences of words. A growing body of work shows that comprehenders are sensitive to the distributional likelihood of expressions, above and beyond the lexical frequency of individual words (Arnon & Snider, Reference Arnon and Snider2010; Goodkind & Bicknell, Reference Goodkind, Bicknell, Sayeed, Jacobs, Linzen and van Schijndel2018; Michaelov et al., Reference Michaelov, Coulson and Bergen2022).
Until recently, state-of-the-art language models were underpowered to accurately quantify distributional likelihood in experimental stimuli (Jurafsky & Martin, Reference Jurafsky and Martin2014). However, rapid improvement in computational resources and architecture (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, Guyon, Von Luxburg, Bengio, Wallach, Fergus, Vishwanathan and Garnett2017) has led to large language models (LLMs), which use neural networks to generate probability distributions over word sequences (Radford et al., Reference Radford, Wu, Child, Luan, Amodei and Sutskever2019). LLMs serve as helpful baselines to measure the extent to which variance in a given phenomenon can be accounted for by distributional likelihood. They learn purely from statistical patterns in language and have no access to other innate, sensory, memory, or reasoning resources that might underlie more traditional conceptions of world knowledge (Frege, Reference Frege1948; Johnson-Laird, Reference Johnson-Laird1989). Thus, if an LLM can account for experimental effects that have been attributed to world knowledge, it suggests that distributional information is sufficient in principle to explain the effect in humans, and undermines the claim that world knowledge is necessary to explain that effect.
In the present work, we focus on the role of world knowledge in a specific linguistic phenomenon: ambiguous pronoun resolution. The phenomenon is particularly useful as it allows us to examine the effects that world knowledge can have on a comprehender’s interpretation of a sentence. While other paradigms show that world knowledge violations can lead to processing difficulty, this does not imply that they influence the eventual product of the comprehension process (Ferreira & Yang, Reference Ferreira and Yang2019). In the case of pronominal ambiguities, however, world knowledge could fundamentally alter the propositional meaning of a sentence – the comprehender’s understanding of who did what to whom. One’s response to the question Can you throw an egg at a concrete floor without cracking it? will differ depending on how one resolves the ambiguous pronoun, it. This not only highlights the importance of explaining these ambiguities, it also makes them easier to study. Differing pronoun interpretations can produce discrete and radically different understandings of the sentence, often more cleanly than other types of ambiguity like polysemy.
In two experiments, we use LLMs as a distributional baseline (DeLong et al., Reference DeLong, Trott and Kutas2023; Jones et al., Reference Jones, Chang, Coulson, Michaelov, Trott and Bergen2022) to test whether the effects of world knowledge on interpretation can be explained by distributional linguistic information. If LLMs are able to account for knowledge effects, it would suggest that human comprehenders could also, in principle, use distributional information to resolve pronouns. This would undermine claims that non-linguistic general world knowledge is necessary for human language processing. In contrast, if world knowledge has an effect over and above distributional information, it would suggest that human comprehenders are using resources that are not available to the model when resolving pronouns, such as sensory information, embodied cognition, or general reasoning processes. This, in turn, would imply an up-front limit on the capabilities of text-only LLMs and suggest that non-linguistic information is a necessary component of human language comprehension.
In Section 1.1, we briefly survey theories of pronoun interpretation, focusing on evidence for the role of world knowledge. In Section 1.2, we discuss theoretical and empirical support for the idea that distributional information could influence human language comprehension and discuss the ways in which LLMs could be used to measure this. In Section 1.3, we briefly outline the two experiments and how their results relate to the research question.
1.1. Theories of pronoun interpretation
Words alone often fail to convey intended meanings. A reader of (1), for instance, might understand that either the baseball or the bat broke, due to the ambiguity of the pronoun it.
A variety of linguistic features have been found to influence ambiguous pronoun resolution. Comprehenders prefer to resolve pronouns to the subject of the previous clause (Crawley et al., Reference Crawley, Stevenson and Kleinman1990) or to a noun phrase that is in the same grammatical case as the pronoun (grammatical parallelism; Chambers & Smyth, Reference Chambers and Smyth1998). Other linguistic factors, such as the semantic class of verbs, have also been found to influence pronoun resolution, including the implicit causality of verbs (Garvey & Caramazza, Reference Garvey and Caramazza1974). Although some researchers interpret implicit causality effects as resulting from knowledge about the typical causes of events (Pickering & Majid, Reference Pickering and Majid2007; Van den Hoven & Ferstl, Reference Van den Hoven and Ferstl2018), others argue that they result from purely linguistic knowledge about verbs (Hartshorne, Reference Hartshorne2014). Finally, some pragmatic features, such as the coherence relations between sentences, have been found to alter pronoun interpretation. In a sentence completion task, Kehler and Rohde (Reference Kehler and Rohde2013) found that continuations of the prompt John passed the comic to Bill. He … were more likely to interpret he as referring to John if the continuation elaborated on the first sentence, but to Bill if the continuation described a subsequent event. While the process of inferring a coherence relation between clauses might itself rely on non-linguistic world knowledge (Kehler et al., Reference Kehler, Kertz, Rohde and Elman2008), there are other cases in which surface features such as conjunctions or grammatical structure can influence coherence relations, which in turn can influence pronoun resolution.
The idea that linguistic features will govern pronoun resolution is intuitively appealing. These features are explicitly available to both producer and comprehender, minimizing the potential for miscommunication. They are also easily accessible. Memory-based models of discourse comprehension, such as the minimalist hypothesis (McKoon & Ratcliff, Reference McKoon and Ratcliff1992, Reference McKoon, Ratcliff, Cook, O’Brien, Lorch and Robert2015), argue that comprehenders should only make expensive knowledge-driven inferences when they are necessary to maintain the local coherence of the text. Thus comprehenders should make use of structural features to resolve pronouns where this does not lead to incoherence.
However, in some cases, these linguistic features fail to account for our intuitions about how pronouns should be resolved. In (1), for instance, the grammatical subjecthood and parallelism biases, as well as surface features suggesting an occasion coherence relation between the clauses, all favour the subject of the previous clause (the baseball) as the antecedent of it. A reader who is familiar with baseballs and bats, however, might know that the bat is more likely to break in this case, and have an intuition that the pronoun should be resolved to the object (the bat). These kinds of cases have motivated researchers to posit that comprehenders can access and deploy arbitrary general knowledge during sentence parsing in order to rapidly determine which of the possible interpretations of the sentence is most plausible (Graesser et al., Reference Graesser, Singer and Trabasso1994; Hobbs, Reference Hobbs1979; Sanford & Garrod, Reference Sanford and Garrod1998).
There is a wide range of theoretical and empirical support for the idea that world knowledge can have this kind of influence. Constructivist theories of discourse processing argue that comprehenders routinely deploy their world knowledge to form a coherent understanding of the described situation (Graesser et al., Reference Graesser, Singer and Trabasso1994; Sanford & Garrod, Reference Sanford and Garrod1998), and that pronouns are inevitably resolved as a by-product of this process (Garnham, Reference Garnham2001; Hobbs, Reference Hobbs1979). Related psycholinguistic research shows that world knowledge can interact with other pragmatic phenomena, such as scalar implicature and the informativity of labels (Degen et al., Reference Degen, Tessler and Goodman2015, Reference Degen, Hawkins, Graf, Kreiss and Goodman2019). Although many theoretical accounts of knowledge-driven pronoun resolution do not specify detailed mechanisms, some more general models of world knowledge influence provide promising candidate mechanisms for the phenomenon. One type of mechanism proposes that a comprehender’s initial interpretation of a sentence is validated against world knowledge (O’Brien & Cook, Reference O’Brien and Cook2016), and will be rejected or revised if an inconsistency is discovered. For example, a reader of (1) might initially use structural cues to interpret it as referring to the baseball. Upon validating this inference, the reader would recognize the inconsistency with world knowledge and revise their interpretation of it as referring to the bat. Alternatively, world knowledge might influence expectations about how the text will unfold before an initial interpretation has been selected (Sanford & Garrod, Reference Sanford and Garrod1998; Venhuizen et al., Reference Venhuizen, Crocker and Brouwer2019). In our example, a comprehender may increase their calculated probability of broke(bat) even before they encounter the ambiguous pronoun. Although linguistic bias will later encourage the comprehender to resolve it to the subject of the previous clause (the baseball), this might not overcome the prior, knowledge-driven bias toward the bat.
Several empirical studies provide support for knowledge-driven pronoun resolution specifically. Marlsen-Wilson and colleagues (Tyler & Marslen-Wilson, Reference Tyler and Marslen-Wilson1982a, Reference Tyler and Marslen-Wilson1982b) used contexts such as (2)–(3) to test whether participants were faster to name a congruous (her) or incongruous (him) completion following (4).
While participants can use explicit name and gender information in (4a) and (4b) to resolve the subject to Philip, participants who heard (4c) must make the inference that the old woman was unable to run and hence is unlikely to be the subject of the clause. Nevertheless, participants showed a similar-sized delay for incongruous vs congruous probes in all three conditions. This suggests that knowledge-driven inferences can be used to resolve ambiguous references even in the absence of linguistic cues.
In a pilot study, Gordon and Scearce (Reference Gordon and Scearce1995) found that pronoun interpretation is influenced by modulating the verb in sentences like Bill wanted John to look over some important papers… Unfortunately he never [sent/received] them. More recently, Bender (Reference Bender2015) established a human baseline for the Winograd Schema Challenge: an artificial intelligence (AI) benchmark consisting of pronoun resolution problems designed to require world knowledge. Given a pair of sentences such as (5), comprehenders tended to resolve the pronoun she to Ann in (5a), but to Mary in (5b).
While the test is used to evaluate AI models under the assumption that the knowledge-consistent answer is correct, the human baseline of 92% provided by Bender (Reference Bender2015) establishes empirically that human comprehenders’ responses conform to the test designers’ intuitions in tending to be sensitive to the plausibility of interpretations.
Although these results are consistent with the hypothesis that world knowledge influences pronoun resolution, they are also open to alternative interpretations that cannot be ruled out based on the design of the studies. First, these studies did not measure or control for other factors known to influence pronoun resolution, including the implicit causality of verbs (Garvey & Caramazza, Reference Garvey and Caramazza1974; Hartshorne, Reference Hartshorne2014) and conjunctions that alter coherence relations (Kehler & Rohde, Reference Kehler and Rohde2013, e.g., (5)). Effects that have been attributed to world knowledge could therefore be caused by uncontrolled variance in these other factors, just as selectional restrictions and co-occurrence statistics have been found to account for world knowledge effects in other domains (Warren & Dickey, Reference Warren and Dickey2021; Willits et al., Reference Willits, Amato and MacDonald2015).
Second, these studies do not provide any independent measure of world knowledge plausibility. The experimenter, relying on their intuition to label one antecedent as more plausible, might inadvertently be influenced by pragmatic and lexical information that was not controlled for. Third, methods in existing studies (explicit comprehension questions and cross-modal probing) could induce unnatural task demands on comprehenders, which might encourage them to deploy world knowledge more readily than they would in a more naturalistic language comprehension scenario (Ferreira & Patson, Reference Ferreira and Patson2007). Even theories that propose a limited role for world knowledge in language comprehension acknowledge that strong task-specific incentives can motivate strategic knowledge-driven inferences (McKoon & Ratcliff, Reference McKoon, Ratcliff, Cook, O’Brien, Lorch and Robert2015). This weakens how informative existing evidence is on the stronger claim that world knowledge is deployed automatically in the course of understanding language. Finally, existing work does not control for the distributional confound: the possibility that distributional cues, learnable from co-occurrence statistics in language, could explain the proposed effect. We turn to this account in more detail in the next section.
1.2. Distributional information
In addition to generalizable linguistic features that influence ambiguity resolution, the rich signal of natural language provides a panoply of subtler cues. Some sequences of words appear more frequently than others and comprehenders might use their implicit knowledge of these patterns to select interpretations that are more statistically likely. The way that words are distributed in language implicitly encodes information about the world. If baseball bats are more likely to break than baseball are, then the word breaks might be more likely to follow bat than baseball. Even in cases where the exact sequence has never been observed before, a distributional learner can learn that bat breaks is more likely based on other similar contexts in which bat and break are used (Firth, Reference Firth1957; Mikolov et al., Reference Mikolov, Sutskever, Chen, Corrado, Dean, Burges, Bottou, Welling, Ghahramani and Weinberger2013). A comprehender could use this statistical knowledge to resolve it in (1) to the bat by asking which of the two noun phrases is more likely to appear in the context that surrounds the ambiguous pronoun.
Although such distributional accounts of language understanding are not new (Firth, Reference Firth1957; Harris, Reference Harris1954, see Lenci, Reference Lenci2018 for discussion), the recent success of large language models has created renewed interest in these theories. Language models learn to assign probabilities to word sequences based on statistical patterns in the way that words are distributed in language. While early n-gram models simply learned transition probabilities between one sequence of words and the next, modern language models use neural networks to represent words in a multidimensional meaning space, allowing them to generalise to sequences they have never observed before (Jurafsky & Martin, Reference Jurafsky and Martin2014). Additionally, they contain attention mechanisms that allow them to relate words in the input stream to one another and represent each word differently depending on its context (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, Guyon, Von Luxburg, Bengio, Wallach, Fergus, Vishwanathan and Garnett2017). Modern large language models are neural language models with billions of parameters trained on corpora of hundreds of billions of words or more. Some LLMs are additionally fine-tuned using reinforcement learning from human feedback (RLHF), to make their responses to input prompts safer and more useful for downstream tasks (Ouyang et al., Reference Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, Zhang, Agarwal, Slama, Ray, Schulman, Hilton, Kelton, Miller, Simens, Askell, Welinder, Christiano, Leike and Lowe2022).
Not only do LLMs provide an explicit computational operationalization of the distributional hypothesis, but a spate of recent work shows that they are predictive of a number of human behavioural measurements, lending credence to the idea that distributional information might be sufficient to explain some aspects of human language comprehension. LLMs accurately predict a variety of measures including word relatedness judgements (Li & Joanisse, Reference Li and Joanisse2021; Trott & Bergen, Reference Trott and Bergen2021), visual similarity ratings (Lewis et al., Reference Lewis, Zettersten and Lupyan2019), category-membership judgements (Lenci, Reference Lenci2018), N400 amplitude (Michaelov et al., Reference Michaelov, Coulson and Bergen2022) and reading time (Goodkind & Bicknell, Reference Goodkind, Bicknell, Sayeed, Jacobs, Linzen and van Schijndel2018). Schrimpf et al. (Reference Schrimpf, Blank, Tuckute, Kauf, Hosseini, Kanwisher, Tenenbaum and Fedorenko2021) find that transformer-based LLMs predict nearly 100% of explainable variance in neural responses to sentences (fMRI and ECoG) and suggest that LLMs ‘serve as viable hypotheses for how predictive language processing is implemented in human neural tissue’ (p. 8).
Even in cases where we might expect world knowledge and contextual reasoning to be crucial, LLMs show an uncanny ability to mimic human response patterns. Nieuwland and Van Berkum (Reference Nieuwland and Van Berkum2007) show that human comprehenders show a large N400 response to implausible sentences such as ‘The peanut was in love’, except when they are preceded by a motivating context (e.g. a story about an animate peanut meeting an almond). The typical explanation of such a result is that comprehenders can use contextual information and world knowledge to process unlikely and otherwise implausible sentences. However, Michaelov et al. (Reference Michaelov, Coulson and Bergen2023) find that distributional models replicate the human effect, preferring the animate critical sentence to an inanimate control sentence when given the motivating story as context. This suggests that a sufficiently sensitive distributional learner can recognize that even a very globally unlikely sequence can become probable in the correct context.
To the extent that LLMs can predict human responses, it suggests that distributional information is sufficient to generate these responses. Although human comprehenders could still be using alternative mechanisms to reach the same results, evidence for the sufficiency of distributional information undermines claims that other resources – such as innate capacities, sensory input, or world knowledge – are necessary to produce the relevant behaviour. This matters because existing evidence for world knowledge influence is implicitly based on the assumption that – known linguistic factors having been controlled for – differences in responses between conditions must be attributable to non-linguistic world knowledge. A distributional language learner, however, might infer that agents who are described as old or have previously been the subject of fall are unlikely to later be the subject of the verb to run. Such an agent might assign a much lower probability to incongruous completion of (4), which could explain the observed reading time effect in humans (Marslen-Wilson et al., Reference Marslen-Wilson, Tyler and Koster1993).
While previous work (Kehler et al., Reference Kehler, Appelt, Taylor and Simma2004) found that predicate-argument frequency statistics did not improve the accuracy of a morphology-based pronoun resolution system, the size and complexity of modern LLMs might allow them to exploit subtler and more nuanced statistical relationships. Winograd Schemas were initially very challenging for computational models due to the deep and complex knowledge apparently required to solve them correctly. Recent advances, however, have allowed LLMs to perform as well as humans at this challenge (Kocijan et al., Reference Kocijan, Cretu, Camburu, Yordanov, Lukasiewicz, Korhonen, Traum and Màrquez2019, Reference Kocijan, Davis, Lukasiewicz, Marcus and Morgenstern2023; Sakaguchi et al., Reference Sakaguchi, Le Bras, Bhagavatula and Choi2020). If computational models are able to resolve these ambiguous pronouns with access only to distributional information, additional evidence would be required to make a case that human comprehenders are drawing on non-linguistic world knowledge directly, rather than using the same distributional information available to language models.
1.3. The present study
We present two experiments designed to control for potential confounds in existing work in order to provide a more robust estimate of the influence of non-linguistic world knowledge on pronoun resolution. We
-
1. develop a set of stimuli similar to (1), varying the plausibility of different ambiguous pronoun interpretations while holding linguistic factors constant;
-
2. norm stimuli for their degree of linguistic and world knowledge bias;
-
3. measure the distributional likelihood of different pronoun interpretations in our stimuli using GPT-3, an LLM;
-
4. explicitly probe how comprehenders resolve ambiguous pronouns using comprehension questions (experiment 1);
-
5. measure spontaneous pronoun resolution in the absence of explicit task demands using a self-paced reading paradigm (experiment 2);
-
6. predict responses in each experiment using the world knowledge bias norms, controlling for the influence of linguistic bias and distributional likelihood.
We are interested in three distinct questions, each of which has different implications for the theories discussed above. First, do we see a significant effect of world knowledge bias on pronoun resolution decisions after controlling for linguistic bias? Accounts that explain pronoun resolution decisions on the basis of syntactic factors (Chambers & Smyth, Reference Chambers and Smyth1998; Crawley et al., Reference Crawley, Stevenson and Kleinman1990) or lexical semantics (Hartshorne, Reference Hartshorne2014) do not predict a marginal effect of world knowledge as the predictive features in these theories have been held constant across conditions in our experiments. Although these theories do not claim that structural features exhaustively determine resolution decisions, a marginal effect of world knowledge would point to a systematic way in which these theories collectively fail to predict pronoun interpretation. Empirical work suggests that pragmatic biases such as scalar implicature can attenuate potential world knowledge effects (Degen et al., Reference Degen, Tessler and Goodman2015), and so we might expect to see a similar attenuation for pronoun interpretation where informative structural cues are available.
Second, does this effect of world knowledge persist when controlling for the distributional likelihood of interpretations? If LLM predictions are sufficient to explain away world knowledge effects, it would undermine the claim that humans must be using non-linguistic world knowledge to resolve these ambiguities and raise the possibility that humans could also be exploiting distributional statistics (Michaelov et al., Reference Michaelov, Coulson and Bergen2022; Schrimpf et al., Reference Schrimpf, Blank, Tuckute, Kauf, Hosseini, Kanwisher, Tenenbaum and Fedorenko2021). In contrast, however, if world knowledge continues to have an independent effect on pronoun interpretation, it will provide robust evidence that non-linguistic world knowledge influences comprehenders’ interpretation in a way that cannot be captured by current state-of-the-art distributional models, and suggest a way in which these models may need to be augmented in the future if they are to achieve human-like understanding of language.
Finally, do the effects of world knowledge persist in a self-paced reading paradigm without cues to resolve the pronoun (experiment 2)? Theories which posit that expensive knowledge-driven inferences are only made strategically in response to a break in coherence (McKoon & Ratcliff, Reference McKoon and Ratcliff1986, Reference McKoon, Ratcliff, Cook, O’Brien, Lorch and Robert2015) might predict an effect of world knowledge in experiment 1 (where comprehenders are encouraged to deliberate on their interpretation by a comprehension question), but not in experiment 2 (where comprehenders could form an alternative coherent interpretation of the passage without drawing on world knowledge. A marginal effect of world knowledge in experiment 2 would suggest that non-linguistic world knowledge is deployed spontaneously, even in the absence of specific task demands or cues (Garnham, Reference Garnham2001; Hobbs, Reference Hobbs1979; O’Brien & Cook, Reference O’Brien and Cook2016; Sanford & Garrod, Reference Sanford and Garrod1998; Venhuizen et al., Reference Venhuizen, Crocker and Brouwer2019).
2. Experiment 1
In experiment 1, we tested whether knowledge about the plausibility of physical events would influence pronoun resolution. Participants in the main experiment read sentences such as (6a) or (6b) and then responded to comprehension questions that indirectly probed their interpretation of the pronoun (e.g. What broke?).
In each sentence, we refer to the first noun phrase (e.g. the vase in (6a)) as NP1 and the second noun phrase (e.g. the rock in (6a)) as NP2. Collectively we refer to these noun phrases (NPs) as the candidate antecedents. The only difference between the two versions of the sentence is that the order of the NPs is swapped. We are interested in the proportion of participants who resolve the pronoun to NP2 in each case.
We held linguistic factors discussed in Section 1.1.1 constant across the versions of each item. In both cases, NP1 is the subject of the previous clause, meaning it is favoured by both the subject assignment and grammatical parallelism biases. The lexical semantics of the two sentences are identical except for the fact that the positions of rock and vase are reversed, so any semantically-induced subject or object biases should be identical between the sentences. Finally, the surface features influencing coherence relation between the clauses were identical across versions. We refer to these non-world knowledge factors as linguistic bias. Orthogonally, to the extent that a comprehender’s commonsense knowledge about the physical world influences their pronoun resolution decision, each sentence also has some latent world knowledge bias. For example, if the participant knows that vases are more fragile than rocks then they might be biased toward NP1 in (6a), but toward NP2 in (6b). We refer to this knowledge-driven influence as the world knowledge bias.
In order to independently measure the strength of the linguistic and world knowledge biases we ran two norming studies using modified versions of the stimuli. For the linguistic bias norming study, we replaced the NPs in each experimental item with two NPs deemed equally likely to participate in the critical event, such as in (7).
There is no commonsense reason why a purple vase should be more or less likely to break than a green vase, so participants’ pronoun resolution decisions should be wholly driven by linguistic factors (such as grammatical role). We confirmed this by checking that there were no large differences between responses to version a) and version b) (i.e. the linguistic bias for each version should be the same). We operationalized the linguistic bias for an item as the proportion of participants who responded with NP2 in the linguistic bias norming study.
Second, in order to measure the world knowledge bias for each item, we reframed the pronoun resolution problem as an explicit hypothetical reasoning question:
Here, linguistic factors ought to have no influence as participants are explicitly encouraged to reason about the physical situation using their knowledge about the world. Again we can confirm this by checking that the bias is inverted between versions (if the bias for version (a) is 0.1, the bias for version (b) ought to be around 0.9).
If participants are guided purely by the surface cues discussed above, then there should be no difference in the proportion of participants who respond with NP2 between (6a) and (6b). Furthermore, their responses should be predicted by the linguistic bias values elicited in the norming study. In contrast, if comprehenders are deploying physical world knowledge in order to select the most plausible interpretation, they will select the same antecedents as the participants who were asked explicit reasoning questions in the world knowledge norming study. That is, there will be a positive effect of world knowledge bias on pronoun interpretation, even when controlling for the influence of linguistic bias.
We used LLMs as a distributional baseline to control for the possibility that effects could be driven by uncontrolled variance in the probability of word sequences. We included LLM responses for each item as a predictor in our regression model and tested whether world knowledge explained independent variance, just as one might control for word frequency in a lexical decision task. To the extent that participants are using distributional knowledge to resolve pronouns, probabilities assigned to sequences by an LLM should explain variance in human responses. Yet if humans are still using non-linguistic world knowledge – not learnable from language alone – to resolve pronouns, then we expect that world knowledge bias will explain additional variance even when controlling for the LLM responses.
2.1. Norming studies
2.1.1. Method
2.1.1.1. Participants
All research was approved by the UC San Diego Institutional Review Board. We recruited 35 native English-speaking undergraduate students from the UC San Diego Psychology Department subject pool, who provided informed consent using a button press and received course credit as compensation for their time. All participants successfully answered $ \ge 2/3 $ attention check trials. We excluded 1 participant who indicated they were not a native English speaker and 1 participant who took over 1 hour to complete the experiment. We excluded 43 trials where the response time was <500 ms (indicating guessing), and 55 trials where the response time (offset by 191 ms per syllable of question length) was >10 s (indicating inattention or excessive deliberation). We used 191 ms/syllable based on an estimate of the mean reading speed for English (Trauzettel-Klosinski et al., Reference Trauzettel-Klosinski, Dietz and Group2012). We retained 892 trials (463 world knowledge, 429 linguistic) from 33 participants (17 world knowledge, 16 linguistic; 23 female, 8 male, 2 non-binary; mean age = 20.3, $ SD=1.8 $ ). The world knowledge norming study lasted 7.3 min on average ( $ SD=2.2 $ ), while the linguistic norming lasted 20.6 min on average ( $ SD=6.4 $ ). The difference in duration was due to the inclusion of filler items in the linguistic norming study.
2.1.1.2. Materials
We created two alternate versions of each of the critical items from the main experiment (see Section 2.2). To elicit linguistic bias norms, we replaced the candidate antecedents with two objects that we deemed equally physically plausible. We used either modifiers that did not alter the physical properties relevant to the plausibility of the candidate, or different objects that were similar in relevant properties. To elicit world knowledge norms we reframed the pronoun resolution problem as an explicit reasoning task (see Table 1). All materials, data, and analysis code that support these results are available on the Open Science Framework at https://osf.io/v8rjm/.
Note: In the main experiment (rows 1–2), we measured the proportion of responses that resolved an ambiguous pronoun to the second of two noun phrases (NP2, in bold). In the linguistic norming study (rows 3–4) we replaced experimental NPs with two NPs that were similar in relevant physical characteristics, in order to measure how the linguistic structure of the sentence biased interpretation. In the world knowledge norming study (rows 5–6), we reframed the pronoun resolution problem as an explicit physical reasoning task, to measure the plausibility of interpretations.
2.1.1.3. Procedure
The experiment was designed using jsPsych (De Leeuw, Reference De Leeuw2015) and hosted online. Passages were presented for 250 ms + 191 ms/syllable (Trauzettel-Klosinski et al., Reference Trauzettel-Klosinski, Dietz and Group2012). A question then appeared below the passage with two response options. In the world knowledge norming study, the question was presented immediately and the response options were revealed after a delay. Participants used the keyboard to indicate their responses. Participants saw two examples with instructions on how to respond in each case. The examples were counterbalanced with respect to presentation order, and (in the linguistic bias norming) did not require the use of physical inference to resolve. Participants in both norming tasks were presented with 30 critical items and 3 attention check trials. Participants saw a randomly selected version of each critical item (e.g. either (7a) or (7b)). In attention check trials, participants answered simple binary questions (e.g. ‘which word contains more letters: elephant or dog’). In the linguistic norming study, 45 filler items were included in order to mask the purpose of the study from participants. Filler items were taken from other pronoun resolution studies (Bender, Reference Bender2015; Crawley et al., Reference Crawley, Stevenson and Kleinman1990; Smyth, Reference Smyth1994). Filler items did not encourage physical inference and were balanced with respect to NP1/NP2 bias. The presentation order of items was randomized. The position of response options was also randomized so that the NP1 response appeared on the right in half of the trials.
2.1.1.4. Results
Responses were aggregated by item to find the proportion of NP2 responses in each norming study. Results for a single item are shown in Table 1. Items in the linguistic bias norming study elicited responses that were heavily skewed toward NP1 (see Figure 1). This is likely due to subjecthood bias (as NP1 was often the subject) and grammatical parallelism (as ambiguous pronouns were often grammatical subjects). We confirmed that differences between the NPs were not influencing decisions by calculating the mean absolute difference in the proportion of NP2 responses when the order of the NPs was reversed ( $ M=0.189, SD=0.178\Big) $ . Most responses in the world knowledge norming study elicited $ 0\% $ or 100% NP2 responses, indicating high agreement and reflecting the fact that reversing the order of each item effectively reverses its bias with respect to NP1/NP2-coding. We confirmed that the order of the two NPs was not influencing decisions in the world knowledge norming study by checking that the mean absolute difference in proportion of NP2 responses between item versions was close to 1 ( $ M=0.900, SD=0.160 $ ).
2.2. Main experiment
2.2.1. Methods
2.2.1.1. Participants
Participants were recruited, excluded, and compensated in the same manner as described for the norming studies. 48 participants were recruited, and 7 were excluded (5 non-native English; 1 failed $ \ge 2/3 $ attention check trial; 1 with completion time $ >1 $ h) leaving 41 (25 female, 13 male, 1 non-binary, 2 prefer not to say; mean age = 20.3, $ SD=2.6 $ ). Mean completion time was 20.6 minutes ( $ SD=7.4 $ ). We excluded trials where response time was <500 ms (46) or >10 s (+191 ms/syllable, 105) leaving 1079.
2.2.1.2. Materials
Thirty critical items were designed so that each featured an introductory clause that referred to two objects (the candidates) and an ambiguous pronoun that referred back to one of the candidates in a later clause. The latter clause described a physical event in which one of the candidates was a more plausible participant than the other, such as in (6). We used a variety of situations, which would require invoking different physical properties to infer the most plausible candidate, including mass, velocity, momentum, brittleness, mass distribution, surface area, scratch hardness, indentation hardness, melting point, and flammability. We created novel stimuli to minimize the risk of dataset contamination: the possibility that LLMs have already been exposed to the stimuli in their pre-training dataset. All items were designed so that the candidates could be switched and the order of the candidates was randomized across participants, forming pairs (see Table 1, rows 5–6).
2.2.1.3. Procedure
The main experiment proceeded exactly as the linguistic bias norming study, described above (including the same instructions and filler items).
2.2.1.4. LLM analysis
We elicited predictions for each item using an LLM, GPT-3 (Brown et al., Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter and Amodei2020). We selected GPT-3 because it is one of the best-performing LLMs that is available to the general public, and because it performs particularly well in a zero-shot setting, where it is not fine-tuned on a specific task. More specifically, we used GPT-3 text-davinci-002, a 175bn parameter model that has been pre-trained on more than 200bn words and additionally fine-tuned on user requests. We chose not to use later models in the GPT series because they have been additionally fine-tuned using RLHF. RLHF introduces an additional training signal beyond the likelihood of word sequences in language, making these models unsuitable for measuring how far language statistics alone can account for an effect. We accessed GPT-3 text-davinci-002 (henceforth, GPT-3) through the OpenAI API. Following the method used for pronoun resolution problems by Brown et al. (Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter and Amodei2020), we replaced the pronoun in each stimulus with each of the candidate antecedents and elicited the sum log probability of the tokens that followed the pronoun. For (6a), this meant finding:
Importantly, the model is not asked to estimate the likelihood of the candidate antecedent itself. Instead, the model’s estimate of the completion of the sentence is conditioned on the pronoun being replaced by the antecedent. This allows us to measure the likelihood that the model assigns to the completion of the sequence, given that the pronoun is taken as referring to a given antecedent. This method has been found to be effective for knowledge-driven pronoun resolution in other settings, such as the Winograd Schema Challenge (Kocijan et al., Reference Kocijan, Davis, Lukasiewicz, Marcus and Morgenstern2023; Radford et al., Reference Radford, Wu, Child, Luan, Amodei and Sutskever2019). We used the logistic function to transform the log odds ratio ((9b)–(9a)) into a probability of the model selecting NP2.
2.2.2. Statistical analysis
We constructed mixed-effects logistic regression models using the lme4 package (v1.1.31; Bates et al., Reference Bates, Sarkar, Bates and Matrix2007) in R (v4.2.2; R Core Team, 2013). Regression models predicted the proportion of NP2 responses for each item version in the main experiment. We fit a maximal random effects structure (Barr et al., Reference Barr, Levy, Scheepers and Tily2013) in order to minimize the risk of spurious explanatory power being attributed to our fixed effects. Each model contained random slopes for world knowledge bias and linguistic bias by participant, and random intercepts by participant and by item-version nested within item. We used Likelihood Ratio Tests to perform nested model comparisons that measured the predictive value of adding additional predictors to a null model with random effects only. Our full model structure was as follows:
2.3. Results
No significant effect of linguistic bias was detected compared to a null model with only random effects ( $ {\chi}^2(1)=0.387 $ , $ p=0.534 $ ; marginal $ {R}^2=0.002 $ ; see Figure 2). Distributional likelihood (operationalized as GPT-3 probabilities) significantly improved the fit of a model with only linguistic bias as a predictor ( $ {\chi}^2(1)=20.1 $ , $ p<0.001 $ ; marginal $ {R}^2=0.121 $ ). World knowledge had a significant effect on responses when controlling for linguistic bias only ( $ {\chi}^2(1)=65.2 $ , $ p<0.001 $ ; marginal $ {R}^2=0.549 $ ), and when controlling for both linguistic bias and distributional likelihood ( $ {\chi}^2(1)=50.6 $ , $ p<0.001 $ ; marginal $ {R}^2=0.555 $ ).
The full model showed a significant positive effect of world knowledge bias ( $ \beta =5.56,p<0.001 $ ), and nonsignificant effects of GPT-3 predictions ( $ \beta =1.38,\hskip0.5em p=0.171 $ ) and linguistic bias ( $ \beta =0.028,\hskip0.5em p=0.967 $ , see Table 2). The result shows that world knowledge bias explains additional variance in responses which is not accounted for by linguistic or distributional information. Consequently, world knowledge appears to affect interpretation in ways that cannot be explained away by existing linguistic models or the distributional knowledge account.
Note: There was a significant effect of world knowledge even after controlling for the other predictors.
Bold typeface indicates p-values < 0.05.
We performed follow-up analyses to better understand the divergence between world knowledge and distributional information. There was a fairly strong correlation between the distributional likelihood of a response and the world knowledge bias toward it ( $ r=0.548 $ ). GPT-3 predictions did not improve the fit of a model with linguistic bias and world knowledge as predictors, indicating that distributional information does not explain independent variance from these measures ( $ {\chi}^2(1)=2.80 $ , $ p=0.094 $ ). In order to test whether the world knowledge variable benefited from being less graded (and hence more decisive) than distributional likelihood, we ran a follow-up analysis with transformed GPT-3 probabilities, that had been binned into the number of unique values in the world knowledge variable (13).Footnote 1 However, the pattern of results was very similar (DL vs LB: $ {\chi}^2(1)=19.6 $ , $ p<0.001 $ , marginal $ {R}^2=0.120 $ ; WK vs DL + LB: $ {\chi}^2(1)=50.7 $ , $ p<0.001 $ , marginal $ {R}^2=0.556 $ ).
Overall GPT-3 preferred the more physically plausible antecedent on 73% of items, compared to 85% for human comprehenders. Of the 16 items where GPT-3 produced an answer that was inconsistent with physical world knowledge, 8 were paired versions from the same 4 item templates (i.e. GPT-3 produced knowledge-inconsistent answers on both version A and version B of that item, suggesting that its implicit representation of the physical world was inconsistent with that of human comprehenders). All of these items involved relatively complex physical interactions that took place over time in sealed containers (e.g. whether a lime or a can of tomatoes would be squashed in a shopping bag; a shirt or a book would be creased in a suitcase; a cardboard or steel box would be crushed in a moving van; keys or coins would create a hole in a pocket), suggesting that the model’s representations may not be sufficiently fine-grained to infer the results of more involved physical interactions. The other 8 errors were caused by distinct items (i.e. GPT-3 produced a knowledge-consistent answer on the reversed counterpart version). In each case, GPT-3 predicted NP2 for both item versions. In 6/8 cases this was inconsistent with the linguistic bias as measured in the norming study. This suggests that GPT-3 was also making use of some structural cues (though different ones than human comprehenders) to make predictions and that in some cases these cues were strong enough to override any influence of physical plausibility.
2.4. Discussion
Participants were more likely to select an NP as the antecedent of a pronoun if the NP was judged to be a more plausible participant in the described event. In contrast, the linguistic bias of the sentence – exerted by grammatical features and measured in the linguistic norming study – did not show a significant effect on pronoun resolution decisions. Although the distributional likelihood of an interpretation (measured using GPT-3) had a significant effect on comprehenders’ responses, world knowledge bias improved model fit when controlling for both linguistic and distributional information.
The results suggest that non-linguistic world knowledge does exert an influence on pronoun resolution. They also provide more robust evidence that pronoun resolution cannot be explained purely by syntactic, lexical, and discourse coherence factors (Crawley et al., Reference Crawley, Stevenson and Kleinman1990; Grosz et al., Reference Grosz, Joshi and Weinstein1995; Hartshorne, Reference Hartshorne2014). Moreover, they suggest that while LLMs can implicitly represent some of the world knowledge comprehenders are using to resolve ambiguities, a large portion of the effect of world knowledge is not currently captured by these models. The result is inconsistent with the distributional hypothesis and the claims that large language models are approximating the human language comprehension process (Schrimpf et al., Reference Schrimpf, Blank, Tuckute, Kauf, Hosseini, Kanwisher, Tenenbaum and Fedorenko2021). Instead, the effect confirms the prediction of accounts which argue that comprehenders activate relevant world knowledge during language comprehension in order to resolve ambiguities in the linguistic signal by selecting the most plausible interpretations (Hobbs, Reference Hobbs1979; Sanford & Garrod, Reference Sanford and Garrod1998).
There are several limitations of the study, however, which limit the generalizability of the finding. First, the passages are very short (1–2 sentences) and so they might lead participants to engage special strategies which are not representative of more ecologically typical reading behaviour of longer runs of text (van den Broek et al., Reference van den Broek, Bohn-Gettler, Kendeou, Carlson, White, Schraw, McCrudden and Magliano2011; Zwaan & Van Oostendorp, Reference Zwaan and Van Oostendorp1993). Second, we probe participants’ pronoun resolution decisions by asking them explicit comprehension questions. This provides participants with a crucial opportunity and motivation to reason deliberatively about the plausibility of the interpretation. It may be this question-induced reasoning that leads to the deployment of world knowledge, rather than the ambiguous pronoun itself (McKoon & Ratcliff, Reference McKoon, Ratcliff, Cook, O’Brien, Lorch and Robert2015). In an attempt to address some of these concerns while replicating this result, we ran a follow-up study with several modifications: i) We embedded critical sentences within longer passages in order to lower the salience of the ambiguous pronoun, ii) We presented comprehension questions on a separate page from the passage so participants could not re-read the passage after reading the question, and iii) we included two filler comprehension questions in order to lower the salience of the critical comprehension question. The pattern of results was the same as the original experiment and world knowledge bias explained additional variance when controlling for linguistic and distributional information ( $ {\chi}^2(1)=48.3 $ , $ p<0.001 $ ; see Appendix A). However, this replication continued to provide participants with a crucial opportunity for strategic reasoning by asking a comprehension question about the critical pronoun. We addressed this limitation in experiment 2 by using self-paced reading to detect participants’ spontaneous pronoun resolution decisions more indirectly.
3. Experiment 2
Theories of language comprehension distinguish between strategic and automatic inferences (Long & Lea, Reference Long and Lea2005; McKoon & Ratcliff, Reference McKoon and Ratcliff1992). Automatic processes are fast, outside of conscious control, and insensitive to contextual factors. Strategic processes are slow, deliberate, and sensitive to the specific goals of the reader. Determining whether a process is automatic or strategic is crucial for understanding whether an observed effect is an invariant component of the language comprehension system or an artefact of specific task demands (McKoon & Ratcliff, Reference McKoon, Ratcliff, Cook, O’Brien, Lorch and Robert2015).
The results of experiment 1 could be driven by a process that automatically activates world knowledge and selects the most plausible interpretation of the pronoun: when answering comprehension questions, participants would simply recall the entity that they have encoded as the referent of the pronoun (Hobbs, Reference Hobbs1979; Sanford & Garrod, Reference Sanford and Garrod1998). Alternatively, the effect could result from specific features of the task that motivate strategic reasoning about physical plausibility. Specifically, participants might not perform any knowledge-driven inference during reading and only deploy world knowledge when they are presented with the comprehension question. Previous work has suggested that comprehenders do not always uniquely resolve pronouns (Greene et al., Reference Greene, McKoon and Ratcliff1992), and may produce only a “Good Enough” interpretation of the text unless specific cues or task demands require them to process it more deeply (Ferreira & Patson, Reference Ferreira and Patson2007). It could be that comprehenders would ordinarily forego expensive knowledge-driven pronoun resolution unless they are specifically incentivized to strategically deploy this process.
Fortunately, these interpretations make divergent predictions. If participants are automatically deploying world knowledge to resolve the pronoun, then the results of their inference should be available soon after reading the critical sentence and should influence how they interpret later sentences. For example, after reading (10), a comprehender might infer that it refers to the vase, and therefore that the vase is broken. This comprehender should have no difficulty in subsequently integrating the assertion in (11a), which is consistent with their current situation model. However, if (10) is instead followed by (11b), the comprehender will encounter a contradiction. The vase, which they had inferred was broken, is still intact.
Existing work demonstrates that comprehenders read more slowly when a text contradicts their current situation model (Albrecht & O’Brien, Reference Albrecht and O’Brien1993; van Moort et al., Reference van Moort, Koornneef and van den Broek2018). Therefore, theories which claim that comprehenders are automatically deploying world knowledge predict that participants will read continuations like (11b) more slowly than control sentences that contain no inconsistency. In contrast, if comprehenders deploy world knowledge only strategically during question answering, we should observe no such slowdown. We conducted a self-paced reading study to test whether participants were slower to read continuations which contradicted more plausible pronoun interpretations.
3.1. Methods
3.1.1. Participants
A total of 205 participants were recruited and compensated as described in experiment 1. A larger sample was used due to an increase in the number of experimental conditions from 2 to 8. We excluded 37 participants for indicating they were not native English speakers; 14 who were inaccurate in $ >50\% $ of attention check questions; 5 who took over an hour to complete the experiment (indicating inattention); 7 who indicated they did not have normal or corrected-to-normal vision; and 1 who indicated they were dyslexic, retaining 141 (88 female, 47 male, 6 non-binary; mean age = 21.6, $ SD=2.9 $ ). Mean completion time was 19.8 minutes ( $ SD=7.2 $ ). From 4,230 trials we excluded 283, retaining 3,947. We excluded trials where passage reading times were <50 ms/syllable (91) or <350 ms/syllable (97), indicating inattention. We also excluded trials where the reading time for any recorded region was <100 ms (52), >5 s (38).
3.1.2. Materials
Thirty stimulus passage templates were designed based on the stimuli from experiment 1. Each passage contained six sections (see Figure 3). The introduction section (2–4 sentences) mentioned each candidate exactly once in the same grammatical role. Half of the passages mentioned the more plausible candidate first. The setup sentence – a buffer between the introduction and critical sentence – did not refer explicitly to either candidate. The critical sentence was identical to its respective experiment 1 stimulus and the critical spillover ensured that participants had time to make the pronoun resolution inference. It did not mention either candidate or any information that would influence interpretation of the pronoun. The continuation sentence described one of the candidates in a state that was inconsistent with it having been the antecedent of the critical pronoun and the continuation spillover was used to record delayed reading slowdowns. One comprehension question was designed for each passage: a statement about the passage that was either true or false and was not relevant to the critical or continuation sentences. Half of the comprehension statements were false.
We created 8 versions of each passage template by factorially varying i) the order of the two NPs, ii) whether the continuation referred to NP1 or NP2, and iii) whether the critical sentence was ambiguous or unambiguous (see Table 3). As in experiment 1, we counterbalanced the order of the two NPs to ensure linguistic and world knowledge biases were not correlated. We varied whether the continuation referred to NP1 or NP2 in order to measure the effect of contradicting a more or less plausible interpretation of the pronoun. Finally, for each critical sentence, we generated a consistent unambiguous control sentence by replacing the ambiguous pronoun with an explicit reference to whichever NP was not mentioned in the continuation sentence. We did this to control for the possibility that an effect might be caused by the continuation sentence itself, rather than the inconsistency between the continuation and the interpretation of the pronoun. For instance, imagine that comprehenders read the continuation in row 4 of Table 3 (the vase was still intact) more slowly than the continuation in row 3 (the rock was still intact). This could either be because the continuation sentence in 4 contradicts the comprehender’s earlier inference that the vase is broken, or because the continuation causes slower reading per se. If the difference is caused by the continuation sentence itself, we should see an equivalent slowdown for row 8 vs row 7, where the critical sentence unambiguously states that the rock broke, ensuring there is no inconsistency. If instead the slowdown is caused by an inconsistency between the continuation sentence and the pronoun interpretation, any difference in reading time between rows 3 and 4 should not be explained by reading times for unambiguous versions.
Note: Versions varied across three dimensions: whether the reference in the critical sentence was ambiguous or unambiguous; the order of the two NPs in the critical sentence; and the NP to which the continuation referred.
Texts were divided into regions of 2–5 words for self-paced reading presentation. Breaking the text into smaller regions (rather than entire sentences) ensured that our measurement was sensitive to smaller or more temporary processing difficulties. Region boundaries and linebreaks were consistent across conditions. We recorded reading times for the region in the continuation that contradicted one interpretation of the pronoun (e.g. /was still intact/). We also recorded from the 3 preceding regions to measure baseline reading pace, in order to control for trial-level idiosyncrasies. Finally, we recorded the 3 regions following the critical region in order to capture delayed effects, which are common in self-paced reading studies (Just et al., Reference Just, Carpenter and Woolley1982). We number these regions 1–7, where region 4 is the critical region that contains the potentially contradictory information.
3.1.3. Procedure
The experiment was designed using jsPsych, based on a GitHub repository provided by the Utrecht Institute of Linguistics (Duijndam, Reference Duijndam2020), and hosted online. Participants read 30 passages, broken up into regions. Participants fixated a cross at the location of the first region and pressed the space bar to reveal each region in turn. They were instructed to then read each region at their normal reading speed. Following the moving-window paradigm, only one region was visible at any time (Just et al., Reference Just, Carpenter and Woolley1982). All other regions were replaced with an underscore. After 1/3 of passages, participants were asked to indicate whether a statement about the passage was true or false. Participants completed two practice trials before beginning the main experiment. Participants were prevented from participating in the experiment if their screen size was less than 1,000 px × 650 px or if they were using a mobile device or tablet. The text was 25 px black Open Sans presented on a pale grey background (#f5f5f5). The order of NPs in the critical sentences and the NP to which the continuation referred were randomized within-participant. On average each participant saw 7.5 items from each of the 4 combinations of these conditions. We varied whether the critical sentence was ambiguous between participants in order to prevent participants from recognizing that there were two different types of stimuli and comparing them directly.
3.1.4. LLM analysis
As in experiment 1, we used an LLM to control for the possibility that comprehenders could be using distributional information to resolve pronouns. For each token in each region, we elicited from GPT-3 the surprisal, $ -{\log}_2(p) $ , of the token conditioned on all preceding tokens in the passage (including all preceding tokens in the token’s own region). We then summed the surprisals of each token in the region to find the overall surprisal for the region. This measure attempts to capture the extent to which reading time can be explained by the predictability of a word sequence given the previous linguistic context.
3.2. Statistical analysis
We hypothesised that we might see an effect in any of the regions 4–7, and so we test each region separately and correct for multiple comparisons. We constructed separate linear mixed-effects models to predict reading time for each region. All reported p-values are corrected for multiple comparisons using the Holm-Bonferroni method unless otherwise stated (Holm, Reference Holm1979). In a base model, we predicted log reading time for each region using the following predictors: the NP mentioned in the continuation (NP1 or NP2); the linguistic bias toward the NP mentioned in the continuation; the mean log reading speed for the trial across regions 1–3 (preceding the regions of interest); and the mean reading time for that region on the unambiguous control version of each item. In the full model, we added world knowledge bias toward the NP mentioned in the continuation as a predictor. We attempted to fit a maximal random effects structure with random intercepts and slopes for each predictor by participant and random intercepts by item-version nested within the item (Barr et al., Reference Barr, Levy, Scheepers and Tily2013). The full model did not converge so we iteratively removed random slopes until we found a random effects structure that converged for all regions (random slopes for world knowledge and linguistic bias by participant; random intercepts by participant and by item-version nested within item). We used Likelihood Ratio Tests to compare the fit of models with and without world knowledge bias as a predictor.
In order to test whether distributional information could account for the effect of world knowledge, we re-performed the above analyses, including GPT-3 surprisal as a control predictor (in both the base and full models). For each model – predicting the reading time of a given target region – we included the surprisal of the target region itself as well as the 3 preceding regions (to account for delayed reading time slowdowns in response to surprising information). For example, the region 5 model contained as predictors the surprisal for regions 2, 3, 4, and 5. The formula for the full converging model was as follows:
3.3. Results
World knowledge bias significantly improved the fit of the model for region 5 – the region immediately following the critical region ( $ {\chi}^2(1)=9.94 $ , $ p=0.006 $ ) – but not for any other region (see Table 4 and Figure 4). To ensure that this result was not an artefact of our unambiguous control, we re-performed our analysis without the control predictor and again found a positive effect of world knowledge bias in region 5 ( $ {\chi}^2(1)=9.07 $ , $ p=0.01 $ ) and no effect in other regions. These results indicate that participants read continuations more slowly when they contradict the more physically plausible interpretation of the pronoun. This in turn suggests that comprehenders use world knowledge to resolve ambiguous pronouns automatically; when they encounter a continuation that contradicts their knowledge-driven pronoun interpretation, they interpret it as an inconsistency and their reading is disrupted.
Note: After correcting for multiple comparisons, a positive effect of world knowledge bias was detected in region 5: the region immediately following the potentially contradictory information in the critical sentence.
Bold typeface indicates p-values < 0.05.
Including GPT-3 surprisals in the base and full models did not change the pattern of results: there was a significant positive effect of world knowledge bias on reading time in region 5 ( $ {\chi}^2(1)=9.87 $ , $ p=0.007 $ ) and no significant effect on any other region. The full region 5 model shows no significant effects of GPT-3 surprisal for any of the recorded regions (see Table 5 and Figure 5). This suggests that the effect of world knowledge cannot be captured by distributional statistics insofar as they are learned by GPT-3.
Note: Degrees of freedom and p-values were calculated using the lmerTest package (v3.1.3) in R (Kuznetsova et al., Reference Kuznetsova, Brockhoff and Christensen2015).
Bold typeface indicates p-values < 0.05.
3.4. Discussion
The results from experiment 2 indicate that world knowledge is deployed during reading to resolve ambiguous pronouns. Log reading times for region 5 in the continuation region (the region immediately following the potentially contradictory information) were positively correlated with the world knowledge bias toward the contradicted interpretation. For instance, if the critical sentence was When the rock fell on the vase, it broke, participants were slower to read a continuation that stated the vase was still intact than one that stated that the rock was still intact. This suggests that participants had inferred from the critical sentence that the vase was broken, and so were delayed in processing when they encountered an apparently inconsistent statement.
A potential alternative explanation is that the continuation the vase was still intact is simply more surprising per se than the continuation the rock was still intact. We controlled for this alternative explanation using the unambiguous consistent control, where the critical sentence explicitly referred to one NP in place of the pronoun (e.g. When the rock fell on the vase, the rock broke). If the continuation, the vase was still intact, was itself causing the slowdown in reading, we should expect to see this effect in the unambiguous control, which we do not (see Figure 4). Moreover, the world knowledge bias toward the continuation NP should not explain any additional variance on top of the control predictor, Unambiguous log(rt), which it does (see Table 5). In short, the world knowledge bias effect only occurs in the ambiguous condition, indicating that it is the result of contradicting an earlier pronoun interpretation, not of reading the continuation sentence itself.
As with experiment 1, we used an LLM as a distributional baseline to control for the possibility that participants were using information about the distribution of words in language rather than non-linguistic world knowledge to resolve ambiguous pronouns. The surprisal for each region (elicited from GPT-3) appeared to show some sensitivity to world knowledge consistency (see Figure 5). However, when we include surprisal for each region and 3 preceding regions in a baseline model, we continue to find an effect of world knowledge bias on reading time in region 5 (see Table 5). This suggests that although distributional information might capture some of the physical world knowledge that humans deploy to resolve ambiguous pronouns, it is not sufficient to capture all of this variance.
Unlike experiment 1, the results of experiment 2 cannot be explained as products of a strategic reasoning process prompted by explicit comprehension questions. The results therefore indicate that participants spontaneously inferred during reading that the pronoun referred to the more physically plausible NP and hence that that NP was in some state (e.g. the vase was broken). These results confirm the predictions of the theory that comprehenders spontaneously deploy world knowledge during language comprehension to resolve ambiguous pronouns (Garnham, Reference Garnham2001; Garrod et al., Reference Garrod, Freudenthal and Boyle1994; Hobbs, Reference Hobbs1979). The results are inconsistent with accounts that argue that such world knowledge is only deployed strategically in response to specific motivations such as comprehension questions (McKoon & Ratcliff, Reference McKoon, Ratcliff, Cook, O’Brien, Lorch and Robert2015).
4. General discussion
Together, results from these experiments provide evidence that non-linguistic world knowledge is routinely deployed to resolve referential ambiguity. Independent norms for the physical plausibility of events – established by asking a separate group of participants explicit hypothetical reasoning questions – were found to predict the majority of variance in pronoun resolution decisions (experiment 1). The effect of world knowledge bias persisted when controlling for the linguistic factors which influence pronoun resolution (again using an independent norming study), and the distributional association between candidate antecedents and the critical sentence (using an LLM, GPT-3). Finally, the world knowledge bias norms also predicted reading times for a passage continuation that was inconsistent with one interpretation of the pronoun. Specifically, the more physically plausible the contradicted interpretation was, the slower participants were to read the continuation. This last result suggests that the product of the knowledge-driven pronoun resolution inference is available to comprehenders during reading, and therefore that world knowledge is being deployed routinely and automatically.
These studies differ from previous work in important ways that alter the conclusions that can be drawn. First, they differ from results suggesting that world knowledge violations can cause processing difficulty (Hagoort et al., Reference Hagoort, Hald, Bastiaansen and Petersson2004; van Moort et al., Reference van Moort, Koornneef and van den Broek2018). While these studies show that world knowledge is active and available during comprehension, they do not imply that this knowledge influences the comprehender’s interpretation of the sentence (Ferreira & Yang, Reference Ferreira and Yang2019). Importantly, in our second experiment, the reading slowdown is not caused by a world knowledge violation directly: rather it is in response to an inconsistency between the continuation sentence and the prior interpretation of the pronoun. This implies that world knowledge has already been used spontaneously to alter interpretation (i.e. resolve the pronoun) before any apparent violation was discovered. Second, in contrast to previous studies suggesting that world knowledge could influence pronoun resolution (Bender, Reference Bender2015; Gordon & Scearce, Reference Gordon and Scearce1995; Marslen-Wilson et al., Reference Marslen-Wilson, Tyler and Koster1993), the results here cannot easily be explained by known linguistic confounds or distributional likelihood as these factors were measured and controlled for. This suggests that at least part of the knowledge used in pronoun resolution is not available in language alone, and must come from alternative resources such as embodiment or reasoning processes.
The results have implications for diverse aspects of language research, including theories of pronoun resolution, discourse comprehension, and natural language processing. First, the results imply a need to augment contemporary models of pronoun interpretation to incorporate a more explicit role for world knowledge. The linguistic features proposed by many theories – including grammatical role (Chambers & Smyth, Reference Chambers and Smyth1998; Crawley et al., Reference Crawley, Stevenson and Kleinman1990; Grosz et al., Reference Grosz, Joshi and Weinstein1995) and lexical semantics (Hartshorne, Reference Hartshorne2014) – were held constant across conditions in our stimuli. Under these conditions, world knowledge was found to have a strong and independent effect on interpretation. In order to accurately predict how comprehenders will resolve a given pronoun, and to provide a psycholinguistic mechanism for how an interpretation is reached, models must explicitly articulate how world knowledge influences comprehension above and beyond linguistic factors. Centering theory, for instance (Grosz et al., Reference Grosz, Joshi and Weinstein1995), acknowledges world knowledge as a potential exception, however, these results suggest that it could be a crucial and constitutive part of pronoun interpretation. Similarly, Hartshorne (Reference Hartshorne2014) argues that world knowledge is less plausible as a mechanism for implicit causality effects because its influence is relatively rare and peripheral in pronoun interpretation. In contrast, these results suggest that comprehenders spontaneously use plausibility to resolve ambiguous pronouns, and hence support the idea that this process could also underlie implicit causality effects. Finally, Kehler and Rohde (Reference Kehler and Rohde2018) develop a Bayesian model of pronoun interpretation based on weighting structural cues against pragmatic expectations about which referent is likely to be mentioned next. While this model neatly synthesises diverse observations about pronoun interpretation, the present results suggest a specific way in which it could be augmented: to account for the plausibility of a given interpretation, which may not be clear until after the pronoun is encountered.
Evidence for knowledge-driven pronoun resolution also has implications for discourse processing more generally. The results contrast with predictions of Minimalist accounts of language comprehension, which propose that knowledge-driven inferences are only deployed where knowledge is highly available or there is a break in local coherence (McKoon & Ratcliff, Reference McKoon, Ratcliff, Cook, O’Brien, Lorch and Robert2015). The knowledge needed to make the inferences in the present experiments was not highly available – relevant object properties were not mentioned or otherwise made salient in the text. Moreover, comprehenders would have no way of identifying a break in local coherence unless they had already activated relevant world knowledge. Minimalist accounts therefore do not predict the routine deployment and influence of world knowledge seen in these experiments. Moreover, even models that allow for world knowledge influence, such as Kintsch and Van Dijk’s (Reference Kintsch and Van Dijk1978) text comprehension model, relegate its effect to elaborating on a core interpretation that is produced before world knowledge is activated. Instead, the results presented here support a constitutive role for world knowledge in language comprehension. World knowledge is activated and incorporated routinely, and can influence the core propositional parsing of the sentence (Garnham, Reference Garnham2001; Graesser et al., Reference Graesser, Singer and Trabasso1994; Hobbs, Reference Hobbs1979).
The spontaneous influence of world knowledge raises questions about the mechanism by which it occurs. How are comprehenders able to rapidly integrate arbitrary knowledge and assess the plausibility of different interpretations before a parse for a sentence has been selected? Two more general discourse processes, validation and expectation, provide promising candidate mechanisms. On validation accounts, comprehenders check tentative interpretations of text against their world knowledge, and reject or revise interpretations that are found to be invalid (Isberner & Richter, Reference Isberner and Richter2013; O’Brien & Cook, Reference O’Brien and Cook2016). On expectation-driven accounts, comprehenders use world knowledge to generate predictions about how events will unfold, and use these predictions to guide comprehension (Sanford & Garrod, Reference Sanford and Garrod1998; Venhuizen et al., Reference Venhuizen, Crocker and Brouwer2019). While these accounts are both consistent with the present data, they are fundamentally different mechanisms and further work is needed to adjudicate between them. One approach is to vary the strength of world knowledge bias. The validation account predicts that linguistic biases will govern resolution decisions so long as the structurally preferred candidate is not so implausible as to be rejected. Alternatively, the expectation account predicts that world knowledge will be routinely used to direct interpretation, so that even small world knowledge biases will influence pronoun resolution decisions. Future work along these lines is needed to identify the mechanisms that support world knowledge influence in pronoun disambiguation.
The results also have theoretical and practical implications for distributional theories of language understanding. It is notable that GPT-3 predictions correlated with both world knowledge norms and pronoun resolution decisions. This suggests that the LLM has implicitly encoded some of the world knowledge information that comprehenders use to resolve ambiguous pronouns. However, the influence of world knowledge on pronoun resolution was not fully accounted for by distributional likelihood. While GPT-3 predictions explained around 12% of the variance in human responses in experiment 1, world knowledge explained around 55%, suggesting that a large portion of the influence of world knowledge is not captured by LLMs. Moreover, GPT-3 likelihood was not predictive of reading times at all in experiment 2. These results address an important confound in previous research: the possibility that apparent world knowledge effects were being driven by distributional word knowledge. More generally, the results imply that in order to understand language, human comprehenders make use of information that is not available in the linguistic signal, perhaps because perceptually obvious features are unlikely to be explicitly reported (Shwartz & Choi, Reference Shwartz, Choi, Scott, Bel and Zong2020). This in turn implies an up-front limit on the ability of language-only models to emulate human understanding. In order to understand language in a humanlike way, models may need to be augmented with multimodal data (Zellers et al., Reference Zellers, Lu, Hessel, Yu, Park, Cao, Farhadi and Choi2021b), simulated environments (Bisk et al., Reference Bisk, Holtzman, Thomason, Andreas, Bengio, Chai, Lapata, Lazaridou, May, Nisnevich, Pinto and Turian2020; Liu et al., Reference Liu, Wei, Gu, Wu, Vosoughi, Cui, Zhou and Dai2022; Zellers et al., Reference Zellers, Holtzman, Peters, Mottaghi, Kembhavi, Farhadi and Choi2021a), or human norm data (Lynott et al., Reference Lynott, Connell, Brysbaert, Brand and Carney2019).
The method outlined here – using LLMs as a distributional baseline – can be applied to other linguistic phenomena to understand the extent to which distributional information could account for other aspects of language understanding. Existing work in this vein suggests that distributional information can explain a large proportion of variance in brain activity (Schrimpf et al., Reference Schrimpf, Blank, Tuckute, Kauf, Hosseini, Kanwisher, Tenenbaum and Fedorenko2021), including in response to highly contextual phenomena (Michaelov et al., Reference Michaelov, Coulson and Bergen2023). Other studies suggest that models can only partially account for certain behavioural phenomena, including the influence of sense boundaries on similarity judgements (Trott & Bergen, Reference Trott and Bergen2023), affordances on sensibility ratings (Jones et al., Reference Jones, Chang, Coulson, Michaelov, Trott and Bergen2022), and a character’s knowledge state in the False Belief Task (Trott et al., Reference Trott, Jones, Chang, Michaelov and Bergen2023). Several hybrid theories of semantic grounding argue that comprehenders use a combination of embodied and distributional knowledge to understand language (Barsalou et al., Reference Barsalou, Santos, Simmons, Wilson, de Vega, Glenberg and Graesser2008; Dove, Reference Dove2011; Louwerse, Reference Louwerse2018). The distributional baseline method and the norming studies used here allow us to quantify the extent to which different sources of information can account for specific phenomena. This could allow us to articulate more perspicuous hybrid theories and test claims about the independence or redundancy of embodied and distributional information.
One potential limitation of this finding is that more capable language models may be better at identifying complex statistical relationships that underlie world knowledge. Existing research suggests that as the size and training data of language models increases, so does their performance. A future, truly massive language model may be able to capture all of the variance in responses which here is explained by world knowledge. However, current language models are already psychologically implausible as models of human cognition. Children are estimated to be exposed to around 3–11 million words per year, for a total of 30–110 million words by the time they reach adult-like linguistic competence at age 10 (Hart & Risley, Reference Hart and Risley1992; Hosseini et al., Reference Hosseini, Schrimpf, Zhang, Bowman, Zaslavsky and Fedorenko2022). By contrast, GPT-3 – the model used in our analysis – has been exposed to more than 200 billion words: ~ 2000 times that of a 10 year old (Warstadt & Bowman, Reference Warstadt, Bowman, Lappin and Bernardy2022). While larger and better-trained models may be able to tell us more about what is learnable in principle from distributional information, evidence that this is a possible mechanism for human language comprehension will need to come from more developmentally plausible models.
Finally, the results suggest that non-linguistic information and reasoning abilities exert influence on a core language comprehension process: reference assignment. Comprehenders were able to use a wide variety of physical knowledge to compare the plausibility of events while resolving pronouns. What resources underlie the rapid deployment of this physical knowledge during language comprehension? Battaglia et al. (Reference Battaglia, Hamrick and Tenenbaum2013) propose that humans are equipped with an intuitive physics engine (IPE), which they can use to simulate hypothetical situations and predict their outcomes. Previous research has tested this claim on non-linguistic stimuli, but future work should examine whether the IPE can also explain physical inferences during language comprehension. Similarly, Barsalou (Reference Barsalou1999) proposes that language comprehension involves relating linguistic information to multimodal perceptual symbols grounded in sensorimotor experience. Activation of embodied perceptual symbols provides an intuitively plausible hypothesis about how world knowledge can be leveraged so efficiently to influence language interpretation (Zwaan, Reference Zwaan2016). However, more work is needed to test whether sensorimotor processes are causally involved in comprehension more generally (Ostarek & Bottini, Reference Ostarek and Bottini2021), and in knowledge-driven inference specifically.
Understanding language necessarily involves connecting words to the world around us. However, there has been much debate about whether world knowledge can influence our interpretation of what is said. These results support a tightly integrated model in which comprehenders spontaneously retrieve relevant world knowledge and assess different possible interpretations in order to select the most plausible. However, the results also raise many more questions for future research. Is world knowledge always deployed or is it only activated by some internal or external trigger? Will world knowledge always determine the interpretation of ambiguities or can other factors overwhelm its influence? Finally, do comprehenders make knowledge-driven inferences by performing formal operations on proposition-like statements, or by simulating the sensorimotor implications of different interpretations? Answering these questions will help to illustrate the mechanisms by which we make meaning from words.
Data availability statement
The materials, data, and analysis code that support the findings of this study are openly available on Open Science Framework at https://osf.io/v8rjm/.
Acknowledgements
The authors would like to thank Andy Kehler, Noortje Venhuizen, and two anonymous reviewers for thoughtful feedback on earlier versions of this paper.
Competing interest
The authors declare none.
A. Experiment 1B
A.1. Method
A.1.1. Participants
Forty-three participants were recruited and compensated in the same manner as described for experiment 1. We excluded 6 participants for indicating they were not native English speakers and 9 who were inaccurate in $ \ge 20\% $ of filler questions (indicating inattention) leaving 28 (22 female, 6 male; mean age = 22.5, $ \sigma =3.7 $ ). Mean completion time was 22.6 minutes ( $ \sigma =8.0 $ ). We excluded 61 trials where passage reading times were <50 ms/syllable or >350 ms/syllable, 14 trials where the question response time was <500 ms or >10 s. We retained 775 trials in total.
A.1.2. Materials
Thirty stimulus passages were designed based on the stimuli from experiment 1. Each passage contained four sections (introduction, setup, critical, and continuation). The introduction section (2–4 sentences) introduced the two candidates and provided an appropriate context for the event. Each candidate was mentioned exactly once in the same grammatical role. Passages were balanced with respect to whether the candidate that was favoured by physics bias was mentioned first or second in the introduction. The setup section contained one sentence that acted as a buffer between introducing the candidates and the critical sentence in order to minimise any structural effects of the order in which candidates are mentioned before the critical sentence. The setup did not refer explicitly to either candidate, but could refer to the candidates together using a generic term (such as the objects). The critical sentence was identical to its respective experiment 1 stimulus. The order of candidates was randomly varied among participants as in experiment 1. The continuation section (1–3 sentences) did not mention either of the candidates and was designed not to contain any information that might be more consistent with one interpretation of the ambiguous pronoun than the other. Two filler comprehension questions were designed for each passage. These probed the participants’ understanding of aspects of the passage that were unrelated to the critical sentence.
A.1.3. Procedure
The experiment was designed using jsPsych and hosted online. Participants were instructed to read short passages and then answer comprehension questions about them. Each passage was presented in its entirety. Participants pressed a button when they had finished reading the passage to advance to the comprehension questions. The comprehension questions appeared one at a time. Participants indicated their chosen response using a button press, and the next question appeared immediately. After participants had completed all three comprehension questions, the next passage was presented. The order of the comprehension questions was randomized (to minimize the salience of the critical question).
A.1. Results
We constructed logistic mixed-effects models to predict responses to the critical comprehension questions using the biases elicited in the experiment 1 norming studies. All models had random slopes by participant for the effects of physics and structural bias, and random intercepts by participant and by item-version nested within item.
Including a fixed effect of linguistic bias did not improve model fit over a null model with an intercept and random effects ( $ {\chi}^2(1)=0.230 $ , $ p=0.631 $ ). Adding world knowledge bias as a predictor significantly improved model fit over the linguistic bias model ( $ {\chi}^2(1)=51.6 $ , $ p<0.001 $ ). GPT-3 predictions also improved the fit of a model with linguistic bias only ( $ {\chi}^2(1)=21.5 $ , $ p<0.001 $ ). World knowledge bias significantly improved the fit of a model with both GPT-3 predictions and linguistic bias ( $ {\chi}^2(1)=33.3 $ , $ p<0.001 $ ). These results replicate the effect of world knowledge bias on responses that was observed in experiment 1.