Dissociable frequency effects attenuate as large language model surprisal predictors improve

A long-standing core research agenda of psycholinguistics is to provide an account of the cognitive mechanism underlying human sentence processing. In other words, what are the processes that enable the rapid and efficient comprehension of linguistic input? How is this achieved in real time with limited cognitive resources? Decades of psycholinguistics research have highlighted the role of prediction in sentence processing (for overviews, see Kuperberg and Jaeger, 2016, Staub, 2015), which has received support from a large body of experimental and computational studies showing how predictive processing influences real-time reading behavior. However, there are still largely open questions about the characteristics and scope of prediction in human sentence processing, such as whether the increased processing difficulty observed at low-frequency words is also a by-product of predictive processing or driven by a separate lexical access mechanism.

Previous experiments that factorially manipulate frequency and predictability have reported independent effects of the two factors on reading times (Kretzschmar et al., 2015, Staub and Benatar, 2013), which have provided support for the latter view. However, stimuli in psycholinguistic experiments are usually short and presented in isolation without much surrounding context, which raises concerns about their ecological validity (Hasson & Honey, 2012). This can be addressed through naturalistic experiments, where subjects are instructed to read naturalistic stimuli (e.g. short stories, newspaper articles). The resulting data is then analyzed by first operationalizing constructs of interest like frequency and predictability, and then evaluating how much separate variance they explain. Often, predictability is operationalized by surprisal (i.e. negative log probabilities) from Transformer-based large language models (LLMs), which are artificial neural networks that are trained to predict upcoming words in a corpus. While surprisal from such LLMs has been shown to be predictive of measures of processing difficulty such as self-paced reading times and eye fixation durations (Shain, Meister, Pimentel, Cotterell, & Levy, 2024), the precise hypotheses about predictive processing that they represent remain unclear due to the opaque nature of their computations.

However, it has been shown that factors like the number of model parameters and the amount of training data have a reliable effect on the fit of LLM surprisal to measures of processing difficulty. More specifically, LLMs that have more parameters and are trained on more data generally yield surprisal estimates that are poorer predictors of processing difficulty that manifests in naturalistic reading times (Oh and Schuler, 2023a, Oh and Schuler, 2023b). This appears to be driven by the LLMs’ capability to predict rare words accurately, which is readily learned with more parameters and large amounts of training data (Oh, Yue, & Schuler, 2024). In addition to providing an explanation for the discrepancy between LLMs and human-like predictive processing, this finding has crucial methodological implications for studying whether frequency effects are separable from predictability effects in naturalistic reading (e.g. Goodkind and Bicknell, 2021, Shain, 2019, Shain, 2024). That is, given this strong relationship between word frequency and LLM surprisal, using surprisal from larger models trained on more data is likely to result in an underestimate of predictability effects and an overestimate of frequency effects, as the excessive number of parameters and training data effectively serves to wash out difficulty associated with infrequent words that could otherwise be explained by predictability.

The present article demonstrates this point by conducting regression analyses on multiple reading time datasets that span different languages, modalities, and genres, using LLMs that vary in model size and training data amount. The results reveal a robust positive effect of both model size and training data amount of the LLM on the ability of word frequency to predict human reading times, which indicates that frequency compensates for surprisal to a greater degree as bigger LLMs trained on more data are used to calculate surprisal. Subsequent follow-up analyses examine how the increase in model size helps the prediction of low-frequency tokens. To this end, low-frequency tokens are first categorized according to factors informed by architectural properties of Transformers and contemporary language modeling practices, whose learning trajectories are subsequently analyzed. The results show that the influence of model size is strongest on tokens that are not part of a bigram sequence observed earlier in the context, which cannot be predicted by simply copying. This suggests that the limitations in model size may cause a bottleneck for learning specific associations during training, which results in less accurate predictions of the correct token, improved fit to human reading times, and a correspondingly lower contribution due to frequency.

To contextualize the article, the remainder of this section defines Transformer-based LLMs, draws a theoretical connection to human sentence processing, provides a review of empirical work on the dissociability of frequency effects and predictability effects, and introduces the framework of continuous-time deconvolutional regressive neural network for modeling reading times.

LLMs are a class of language models that are trained on the in-context word prediction objective. These models are typically based on the Transformer neural network architecture (Vaswani et al., 2017), which does not maintain a vector representation of the context that is updated at each timestep (cf. recurrent neural networks; Elman, 1991, Hochreiter and Schmidhuber, 1997) but newly calculates a contextualized representation at each timestep through its self-attention mechanism. More specifically, autoregressive language models (e.g. Brown et al., 2020, Radford et al., 2019) are trained to predict the ‘next’ word given the sequence of previous words, and are therefore closer to the traditional definition of language models. These models are typically trained on large amounts of Internet text, although the exact details about their training data are usually not disclosed.

More recent approaches also employ reinforcement learning and use predicted human preferences for the generated response as a reward to fine-tune language models to general-purpose dialogue agents (e.g. reinforcement learning from human feedback; OpenAI, 2023, Ouyang et al., 2022). However, such methods entail a domain shift in their probability distribution and thereby weakens the interpretation of LLMs as models of next-word prediction trained on large-scale corpora. Therefore, throughout this article, LLMs refer specifically to autoregressive language models that have not been adapted to specific tasks.

There are two conceptually similar senses in which LLMs are relevant for studying human sentence processing. The first is as a computational model based on surprisal theory (Hale, 2001, Levy, 2008), which posits that the processing difficulty of a word in context is proportional to its surprisal (Shannon, 1948), or negative log probability. Surprisal theory views prediction as ongoing probabilistic inference over possible structure- or message-level analyses given the context, which are continuously updated upon observing the bottom-up input (e.g. reading the next word). Assuming that the human comprehender maintains multiple probabilistic analyses in parallel, surprisal of the observed word is equivalent to the Kullback–Leibler divergence between the probability distribution over analyses before observing the word and after observing it (Levy, 2008). Therefore, surprisal has the interpretation of the amount of ‘cognitive effort’ taken to readjust the analyses after observing a word. As such, early surprisal-based processing models explicitly modeled this process of probabilistic inference, mostly in the form of maintaining and updating partial syntactic structures generated by probabilistic incremental parsers. Examples of incremental parsers that have been applied as models of sentence processing include Earley parsers (Hale, 2001), top-down parsers (Roark, Bachrach, Cardenas, & Pallier, 2009), Recurrent Neural Network Grammars (Dyer et al., 2016, Hale et al., 2018), and left-corner parsers (Jin and Schuler, 2020, van Schijndel et al., 2013). Naturally, these models were employed to study the role of syntactic expectation in human sentence processing.

Non-structural ‘sequential’ language models, which do not explicitly maintain multiple structure- or message-level representations of the partial sentence, have also been evaluated as expectation-based models of human sentence processing, as they directly define and estimate a conditional probability distribution necessary for surprisal calculation. As neural networks were increasingly being trained as language models, surprisal estimates from both n-gram language models and those based on neural network architectures1 such as Simple Recurrent Networks (Elman, 1991), Long Short-Term Memory networks (Hochreiter & Schmidhuber, 1997), Gated Recurrent Unit networks (Cho et al., 2014), and Transformers (Vaswani et al., 2017) have been evaluated against behavioral measures of processing difficulty (Aurnhammer and Frank, 2019, Fossum and Levy, 2012, Goodkind and Bicknell, 2018, Hao et al., 2020, Merkx and Frank, 2021, Smith and Levy, 2013, Wilcox et al., 2020). In addition to high-level neural network architectures, this line of research also studies the impact of factors that influence language model probabilities on the fit of surprisal, such as the amount of input context or various decoding strategies (Kuribayashi et al., 2022, Liu et al., 2024).

The other closely related sense in which LLMs are relevant for human sentence processing is quantifying predictability in psycholinguistic experiments, or how predictable the word is given its context. Probabilities derived from data collected through the cloze task (Taylor, 1953) have traditionally been used to demonstrate the effect of predictability on real-time processing in early psycholinguistic research (Ehrlich and Rayner, 1981, Kutas and Hillyard, 1980, Kutas and Hillyard, 1984). However, it is prohibitively expensive to collect data using the cloze task on various stimuli of interest, as large samples are required for reliable estimates of predictability. Additionally, even with large samples, it is often difficult to make fine-grained distinctions at the low end of the predictability scale, as many words are unobserved as responses to the cloze task. In order to mitigate these issues with the cloze task, more recent research has begun to rely on predictability estimates that are approximated using corpus statistics. These estimates include conditional probabilities from bigram (Boston et al., 2008, McDonald and Shillcock, 2003), trigram (Smith & Levy, 2013), 5-gram (Shain, 2019), and Simple Recurrent Network language models (Hofmann, Remus, Biemann, Radach, & Kuchinke, 2022). From this perspective, Transformer-based LLMs are appealing as more accurate approximations of corpus statistics (cf. computational models of predictive processing that relate linguistic input to intermediate representations) in that they are trained to estimate probabilities based on very large amounts of text and can condition on a large number of preceding words, unlike earlier n-gram models. As such, conditional probabilities from LLMs have recently been used to examine the shape of the linking function between predictability and reading times (Hoover et al., 2023, Shain et al., 2024) and whether frequency effects dissociate from predictability effects in naturalistic reading (Shain, 2024).

It is a well-established finding in experimental psycholinguistics that less frequent words take longer to read (Juhasz and Rayner, 2006, Just and Carpenter, 1980, Rayner and Duffy, 1986). However, different theoretical views about sentence processing have posited different explanations for this effect. A procedural view of sentence processing, which emphasizes the role of retrieval, integration, and construction of meaning (Gibson, 2000, Lewis and Vasishth, 2005), argues that this frequency effect is due to differential encoding strength in the mental lexicon, where more frequent words have stronger representations that are easier to retrieve (Coltheart et al., 2001, Just and Carpenter, 1980, Reichle et al., 1998). This view also construes these processes as distinct from prediction, and therefore predicts dissociable frequency effects from predictability effects.

In contrast, an inferential view (e.g. surprisal theory; Hale, 2001, Levy, 2008) emphasizes the probabilistic inference over possible structure- or message-level analyses given the partial sentence, and posits that the contextual predictability of a word determines its processing difficulty. According to this view, frequency effects should be subsumed by predictability effects, because more frequent words are also more predictable than less frequent words in unconstrained contexts (i.e. they have higher prior probabilities).

Modeling studies that aim to answer this question using naturalistic reading times have yielded mixed results. For example, Goodkind and Bicknell (2021) analyzed fixation durations in the Dundee dataset (Kennedy, Hill, & Pynte, 2003) and found dissociable frequency effects from predictability effects that were operationalized by n-gram language models. This is in contrast to the earlier work by Shain (2019), who found that word frequency did not improve fit to unseen data on the Dundee, Natural Stories (Futrell et al., 2021), and UCL (Frank, Monsalve, Thompson, & Vigliocco, 2013) datasets over a baseline including 5-gram surprisal. In light of these conflicting findings, Shain (2024) revisited this question at scale with more reading time datasets and a more flexible modeling approach that relaxes assumptions that are inappropriate for modeling naturalistic reading. Using surprisal estimates from the GPT-2 language model (Radford et al., 2019) to operationalize predictability, Shain (2024) found a dissociable frequency effect on most datasets, which is consistent with the predictions of the procedural view.

While Shain (2024) acknowledges that model size has an impact on the quality of surprisal estimates as predictors of reading times (based on results in e.g. Oh et al., 2022, Shain et al., 2024), the potential influence of training data amount was not considered. More importantly, word frequency was found to modulate the influence of model size and training data amount on the ability of LLM surprisal to predict human reading times (Oh et al., 2024). It is therefore likely that different estimates of frequency effects could be derived depending on the LLM used to operationalize predictability in a modeling study using naturalistic reading times. We illustrate this point through an experiment that closely follows the procedures of Shain (2024), as well as an experiment using multilingual reading time data.

Shain (2024) studies the dissociation of frequency and predictability effects at scale using the modeling framework of continuous-time deconvolutional regressive neural network (CDR-NN; Shain, 2021, Shain and Schuler, 2024), which we also employ in our first experiment. While standard approaches like linear mixed-effects models (LMEM; Bates, Mächler, Bolker, & Walker, 2015) and generalized additive models (GAM; Wood, 2006) are typically used to study human reading times (e.g. Hoover et al., 2023, Oh and Schuler, 2023b, Wilcox et al., 2020), these modeling approaches assume that the current response reading time yi depends solely on the corresponding predictors xi and is independent of any preceding predictors. This limits LMEMs and GAMs in capturing the lingering influence of the current word on future reading times, which is well-known as ‘spillover’ effects in psycholinguistics (Rayner et al., 1983, Vasishth, 2006). While this issue is commonly addressed by including ‘spillover variants’ of predictors from preceding words as predictors of the current response reading time, this may lead to identifiability issues in LMEM/GAMs and additionally makes the assumption that previously processed words are dispersed evenly throughout time.

Continuous-time deconvolutional regression (CDR; Shain & Schuler, 2021) was developed to address these limitations by estimating a parametric continuous-time impulse response function (e.g. the three-parameter shifted Gamma function) in a data-driven manner. CDR-NN models are extensions of CDR models based on deep neural networks that estimate a nonlinear function that relates a set of predictors (e.g. unigram surprisal) to its effect on the parameters of the predictive distribution over the response (i.e. reading times) with some continuous time delay. In doing so, CDR-NN models additionally relax assumptions that the influence of the predictor on the response is linear and homoscedastic (constant error), which are also implausible for modeling the time course of naturalistic reading (Shain & Schuler, 2024).

View original article

JOURNAL OF MEMORY AND LANGUAGE

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Dissociable frequency effects attenuate as large language model surprisal predictors improve

Comments (0)