Segmental contributions to prosodic weight in processing English auxiliary contractions

It is widely recognized that the comprehension of spoken language proceeds through several stages: from auditory processing, through lexical processing and syntactic processing, to higher-level processing such as discourse processing. One important question is what degree of detail is passed on from one level to another. In spoken-word recognition, it has been shown that uncertainty about segment identity is retained and used at later stages (McMurray et al., 2009), so that listeners find it easier to recognize bumpernickel as pumpernickel when the [b] in bumpernickel is closer to the [b]-[p] category boundary. In contrast, it has often been assumed that discourse processing relies on a propositional representation of sentence meaning in which the exact form of the sentence (e.g., what kind of dative construction is used) does not matter. This assumption is based on the finding that although explicit memory for surface sentence structure is indeed quite poor in comparison with recognized words as input in some form, listeners do not appear to rely on the exact forms of those words (Ferreira & Çokal, 2016). This phenomenon may be evident in reading, where the impact of common misspellings (e.g., broccoli spelled as brocoli with only one ‘c’) on sentence-level processing has, to our knowledge, not been clearly documented in the reading literature. This may also be justified in speech processing, as listeners appear to be relatively insensitive to the exact forms of words, even when asked about specific input properties. Listeners may report hearing segments that are absent in a reduced word form in casual speech, which are typically present in clear speech (Kemps et al., 2004). For instance, when hearing a reduced form like ntuuk for the Dutch word natuurlijk (English: naturally), listeners report hearing an /l/ despite its absence in the acoustic input. This is also reminiscent of findings that exact word order is not clearly represented in an explicit manner in sentence memory (e.g., Masson, 1984).

However, relying on such explicit judgements from listeners to determine the relationship between the form of spoken input and sentence processing may be problematic. For instance, human observers are sensitive to the frequency of longer surface strings, known as N-grams (Shaoul et al., 2014), which is likely to influence sentence processing. This suggests that the surface structure of sentences must be remembered (or stored in some form); otherwise, N-grams could not influence processing. Although such information may not be accessible through explicit recall, it is reasonable to assume that it can still influence implicit levels of processing. For instance, Brown-Schmidt and Toscano (2017) elaborated on the findings of McMurray et al. (2009) and showed that listeners retain uncertainty about word identity (between he or she) at the sentence-processing level. In the data presented here, we address a similar question at a more fine-grained level: can the exact segmental phonetic form of a word influence sentence processing?

The phrase segmental phonetic form in this question needs refinement; otherwise, the answer would be obviously yes. It is particularly important to understand how fine phonetic details of a segmental phonetic form are shaped by prosodic structure. It is well-established that the prosodic realization of a word in terms of duration, f0, and intensity influences sentence processing (Bögels et al., 2011, Ito and Speer, 2008, Steinhauer et al., 1999, Weber et al., 2006), as this realization is often used to compute prosodic structure, which frequently correlates with syntactic structure (Bennett & Elfner, 2019; cf. Elfner, 2018). In fact, these prosodic influences form the basis of the current study, as prosodic structure—such as prosodic phrasing and prominence distribution within an utterance—not only affects the realization of typical suprasegmental parameters (e.g., f0, amplitude, and duration) but also shapes the precise segmental details of the phonetic form (e.g., Cho, 2016, Fletcher, 2010). Consequently, fine phonetic details of segmental realization may systematically covary with prosodic parameters, collectively signalling the prosodic structure of a sentence and thereby influencing sentence-level processing. In this regard, we use the term prosodic weight to refer to the degree to which phonetic cues—both suprasegmental and subsegmental—contribute to the listener’s computation of prosodic structure, which in turn modulates sentence processing. Such influences can be subtle and gradient, as demonstrated by Cho et al. (2017), who measured the degree of coarticulatory nasality in the vowel of words such as ban. They found that the extent of vowel nasalization varied with prosodic structure, showing, for example, less vowel nasalization when the word carried a pitch accent compared to when it did not. This pitch accent effect was further influenced by information structure, as the pitch accent was induced on words receiving narrow or contrastive focus. The degree of vowel nasalization was also found to vary depending on the prosodic position or boundary in which the segment occurred (see also Jang et al., 2018, Jang et al., 2022, Li et al., 2020). Taken together, these findings motivate the central theoretical premise of the present study: subphonemic segmental variation, when it systematically patterns with prosodic phrasing and prominence distribution, directly contributes to the construct of prosodic weight.

A more extreme example of prosodic influences on segmental realization can be observed in contraction processes, such as wanna contraction where want to is reduced to wanna. These contractions are often blocked by prosodic boundaries created, for example, by a syntactic trace (Ackema and Neeleman, 2003, Goodall, 2017). Thus, the contraction is possible in the question in (1a), where the trace is at the end of the sentence, but not in (1b), where the trace separates want and to.This pattern suggests that, since the syntactic juncture marked by a trace typically aligns with a prosodic juncture, the contraction is only permitted when want and to occur within the same prosodic phrase. However, as the studies on coarticulatory nasalization cited above demonstrate, prosodic influences on segmental details arise not only from prosodic boundaries, as seen in the case of wanna contractions, but also from information structure related to prominence distribution, which is often marked by pitch accents in English. This forms the foundation of our current study, which examines a different type of contraction involving the auxiliary have, as shown in examples (2a) and (2b).We assume here that the contraction in (2a) signals that they’ve is backgrounded, either because it carries little new information (see also Frank & Jaeger, 2008) or because they have is given. This would apply, for example, if the team was expected to win the game, or if the focus is on the fact that they (only) won the game but not the championship. In both scenarios, the pitch accent typically falls on game, the head of the noun phrase, but is more prominent when the noun phrase contrasts with an element in a preceding question. In contrast, (2b) may better suit a context in which the team was expected to lose, but they did not. This would make the auxiliary verb more likely to receive the pitch accent, as in, Believe it or not, they HAVE won the game. The capitalization of HAVE here indicates that speakers would not only use the uncontracted segmental form of the auxiliary have [hæv] but also produce it with suprasegmental variation, marked by pitch accent correlates such as increased pitch, duration, and amplitude. In this case, pitch accent placement on have can be attributed to information structure, where the interlocutors may have different levels of knowledge or contrasting beliefs. Thus, as noted by Baker (1971), the pitch-accented (stressed) form, which is produced as an uncontracted variant, suggests a correlation between segmental form and information structure, as the formation of prosodic structure—particularly reflected in the distribution of pitch accents within an utterance—is often shaped by information structure. In this paper, we explore this possibility by asking whether segmental cues to prosodic structure—even in the absence of concomitant suprasegmental cues—specifically, the segmental variation of the auxiliary have in its contracted versus uncontracted forms, contribute to prosodic weight, convey information-structural meaning, and thereby support the listener’s inferences.

As such, the question we examine here concerns the interplay between prosodic and segmental processing, and their potential interaction. At this stage, it is useful to consider how our work relates to previous research on this interface. A substantial body of research has shown that prosodic processing influences word recognition. For example, segment duration—a prosodic cue—can influence where listeners place word boundaries, clearly implicating processes in the segmental stream (Davis et al., 2002, Salverda et al., 2003). More recent studies have demonstrated that utterance-level suprasegmental information in the distal (i.e., preceding) context can shape how currently processed (proximal) target words are perceived (e.g., Dilley & McAuley, 2008; Dilley & Pitt, 2010; Dilley et al., 2010; Pitt et al., 2016; see McQueen & Dilley, 2020, for discussion). For instance, distal speech rate influences whether listeners perceive a vowel as a singleton or a (false) geminate in phrases such as minor (or) child [ˈmaɪnɚ(:)tʃaɪɫd] (Dilley & Pitt, 2010), mirroring speech-rate effects on consonant geminate perception (Mitterer, 2018). Interestingly, these effects persist even when proximal cues for segmental structure are present—for example, acoustic correlates of glottalization that signal two [ɚ] vowels (Heffner et al., 2013)—and are further modulated by the overall speech rate of the stimuli during the whole study (Baese-Berk et al., 2014). Eye-tracking evidence also shows that distal speech rate can influence early stages of segmental processing (Brown et al., 2011; Brown et al., 2021; see also Reinisch & Sjerps, 2013), so that fixation to “pan” versus “panda” is determined (Brown et al., 2011). Auditory evoked potentials (Breen et al., 2014) further reveal that, depending on distal speech rate and rhythmic context (e.g., strong–weak patterns), the likelihood that a lexically stressed syllable is perceived as word-initial can be modulated, which in turn influences how ambiguous word boundaries (e.g., tie murder bee vs. timer derby) are resolved.

More recently, Brown, Tanenhaus, and Dilley (2021) proposed a Syllable Inference account of spoken word recognition, in which listeners dynamically generate and evaluate probabilistic hierarchical language models—composed of syllables, words, and phonemes—to predict and interpret unfolding speech. Across three experiments, they demonstrated that subtle temporal cues, especially the relative timing between proximal and distal speech regions, influence whether listeners perceive singular or plural phrases (e.g., saw a raccoon vs. saw raccoons). Critically, they found that perception is not categorical but gradient: manipulating timing cues led to graded shifts in both eye movements and explicit responses, reflecting continuous updates in the likelihood of alternative interpretations. For instance, eye-tracking data showed that listeners gradually shifted their gaze toward singular or plural referents depending on how strongly timing cues supported each model, revealing dynamic probabilistic inference. Experiments 2 and 3 also revealed “right-context” effects, where later-occurring speech altered the interpretation of earlier ambiguous material—supporting the view that listeners continually re-evaluate interpretations as new input arrives. These findings support a predictive, context-sensitive, and gradient model of speech processing, consistent with Dilley and colleagues’ previous studies mentioned above. They provide a framework for understanding how listeners resolve ambiguity in real time under variable speech conditions.

Taken together, these studies demonstrate a clear interaction between prosodic and segmental processing. In this line of research, the main independent variable—typically distal speech rate—is prosodic in nature, and the observed effects are measured in terms of word segmentation and recognition. While such effects on word recognition undoubtedly have downstream consequences for sentence processing, as different words must be integrated into a sentence frame, it would be a stretch to claim that these effects arise at the level of sentence processing itself. Note, however, that some studies clearly involve grammatical structure, as examined in Brown et al. (2021), in relation to fine-grained phonetic detail in both distal and proximal segmental and suprasegmental information. Studies that directly address such effects at the sentence level still remain limited, and those in which segmental information serves as the main independent variable are especially rare—particularly those designed to examine the independent role of segmental detail in shaping online sentence comprehension.

However, exploring the impact of prosodically conditioned segmental detail on sentence processing is not new. Previous studies have investigated how segmental details—shaped by prosodic structure—may influence sentence processing. For instance, in an offline judgment task, Scott and Cutler (1984) used sentences such as The day we met Ann was horrible, which could be interpreted as meaning that either the day or the person Ann was horrible. When the /t/ in met was flapped (including that met and Ann belonged to the same prosodic phrase, which conditions flapping), listeners were more likely to interpret the sentence as [The day we met Ann]#[was horrible] where ‘#’ indicates a syntactic juncture aligned with a prosodic boundary. In another offline judgement task, Mitterer et al. (2021) investigated the processing impact of the glottal stop in Maltese, which serves a dual role as both a segment and a marker of prosodic structure. That is, the glottal stop has phonemic status but also occurs as an epenthetic element at the onset of vowel-initial words.1 Crucially, in the latter function, the epenthetic glottal stop is more robustly realized when the vowel-initial word occurs at a higher prosodic boundary (e.g., a phrase boundary), which may again coincide with a syntactic juncture. Mitterer et al. demonstrated that listeners could indeed use it as a cue to prosodic boundaries when parsing structurally ambiguous phrases of coordinated names (e.g., Daniel and Malcolm or Gordon) in a forced-choice task—i.e., deciding whether the phrase is parsed as [[Daniel and Malcolm] or Gordon] or [Daniel and [Malcolm or Gordon]]. All else being equal, when the glottalization of /u/ (and in Maltese) followed “Daniel,” listeners preferred to separate it from “Malcolm,” resulting in the second parsing.

While these studies demonstrate that segmental detail can inform decisions at the sentence processing level, both rely on overt responses in a forced-choice task. This raises two questions: first, whether this information is used routinely or only in ambiguous contexts; and second, whether it plays a role in online sentence processing or only at the decision-making stage. To address these questions, Mitterer et al. (2024) tested the use of the German verum focus. The German verum focus is a language-dependent strategy to highlight the auxiliary verb in a dialogue in order to underscore the agreement with the given information (Turco et al., 2014). Mitterer et al. (2024) devised a mouse-tracking task using a task model first used by Roettger and Franke (2019). In German, an affirmative answer to a yes–no question can be accompanied by a pitch accent on the auxiliary, as in the example (3b) below, with the capitalization indicating a pitch accent.

Using a similar mouse-tracking experiment to that employed by Roettger and Franke, 2019, Mitterer et al., 2024 developed a web-based mouse-tracking task designed to engage participants and prompt them to anticipate how the sentence would end.2 Their online task was more or less based on the classical arcade game Galaga. Participants heard a dialogue like the one in (3) while viewing a screen with a cartoon spaceship at the bottom and two objects positioned at the top, which then fell slowly toward the bottom of the screen with a wiggling horizontal motion. This motion was intended to encourage participants to track a potential target object even before it was mentioned. Participants were then asked to enact the scene to match the dialogue. To that end, participants could move the spaceship left and right along the bottom of the screen and use a mouse click to “activate a tractor beam” to collect an object if it was directly above the spaceship. The critical dependent variable was the point at which participants moved toward the eventual target and continued tracking its wiggling motion.

There were three experiments in Mitterer et al. (2024); the first showed that the effect found in a lab-based mouse-tracking experiment (Roettger & Franke, 2019) could be replicated in a web-based setting. Specifically, when participants heard a verum focus in the reply, they quickly moved toward the object that had already been mentioned in the question (e.g., a violin, as in 3b), which was always the target. In the absence of a verum focus, however, they moved toward the new object (e.g., an object other than the violin already mentioned in 3a). Their second experiment tested whether segmental variations of the auxiliary haben could trigger similar effects without any changes in suprasegmental features typically associated with a verum focus. The auxiliary typically occurs in three main forms: the full form [habən], the slightly reduced form [habm], and the strongly reduced form [ham]. Importantly, the full form is rarely used in normal speech, while the slightly reduced form serves as the frequent default in spoken communication. The strongly reduced form, on the other hand, is still common but is primarily used in casual, conversational contexts. Just as Cho et al. (2017) demonstrated in their study on vowel nasalization in English, where coarticulatory vowel nasalization was conditioned by prosodic weight (i.e., less coarticulatory nasalization indicates a stronger prosodic position or prominence), Mitterer et al. considered prosodic weight to influence the choice of these forms, with greater prosodic weight resulting in less coarticulation. The authors made the following predictions. The full form [habən], being infrequent and marked, should signal a segmental version of verum focus, which often triggers the use of such a full segmental form. In contrast, the strongly reduced form [ham] should create an expectation of a pitch accent (or focus) elsewhere in the sentence, which in turn backgrounds or de-emphasizes other elements (e.g., haben in this case), making them more susceptible to reduction processes (e.g., Cangemi and Baumann, 2020, Lam and Watson, 2010, Mücke and Grice, 2014). This would align with a contrastive focus on the object (e.g., It was the PEAR, not the violin.), leading to the expectation that the answer would feature the object not mentioned in the question. On the other hand, the slightly reduced form [habm] is expected to be uninformative or relatively neutral, as it does not clearly signal either of the two extremes. Indeed, their results were largely consistent with these predictions. The experiments in Mitterer et al. (2024) showed that participants were faster to turn toward the eventual target (which was always the one consistent with the above predictions) when the phonetic form of the auxiliary was expected to be informative—either in its full form [habən] or its strongly reduced form [ham]—than when it was in the default form [habm]. However, the size of this advantage was only half that observed with prosodic cues to pitch accent.

In a third experiment, they tested whether these results were truly due to participants' expectations based on their experience as native speakers of German, or if the informative forms of the auxiliary haben were simply informative within the specific experimental context. In fact, the full form of the auxiliary was always followed by the given object (i.e., the one mentioned in the question), while the strongly reduced form was always followed by the new object. In contrast, the uninformative form [habm] provided no indication of the experimental context: either the given or the new object could be the eventual target with this form. Thus, it was possible that participants learned the informativeness throughout the experiment, which may have contributed to the observed results. That is, participants may have simply learned the existing co-variation between segmental detail and prosodic weight (i.e., prominence) within the experimental setting. Their third experiment therefore assessed whether the informativeness of the two forms (full or strongly reduced) was a result of learning within the experimental context. A new set of stimuli was created in which the phonetic form of the informative versions of the auxiliary was consistently masked with noise. The reduced form [ham] was masked by a dog bark, while the full form [habən] was masked by a car horn. Because the context remained unchanged, the stimuli with masking noise were just as informative within the experimental setting as in the previous experiment. Thus, the reasoning behind this design is that, if the results of their second experiment—suggesting that segmental detail is linked to prosodic weight and serves as an informative cue for speech processing, were merely an artifact of the experimental setting that facilitated learning, similar effects should be observed when the critical stimuli were replaced with alternative sounds in a consistent way. The results revealed that learning with the alternative sounds does occur over the course of such an experiment. Participants, for example, became faster at moving to a new object when cued by a dog bark sound, which is matched with the reduced form [ham], while to the given object when cued by a car horn sound, which is matched with the full for [habən]. Crucially, however, learning occurred gradually over experimental blocks, with the effect of the alternative sounds stabilizing only in the later part of the third experiment, whereas the effect of the original stimuli was observable at the start of the experiment already and remained stable throughout.

Mitterer et al. (2024) suggest that, while learning effects may contribute to their results, segmental details—independent of suprasegmental features—carry prosodic weight that can be aligned with prosodic structure and are effectively used by listeners when referencing information structure. More broadly, these findings have implications for sentence processing in relation to the syntax-prosody interface, where the prevailing view is that sentence comprehension is modulated primarily by suprasegmental (prosodic) variation (cf., Elfner, 2018, Bennett and Elfner, 2019). However, our understanding of how segmental details are processed in sentence comprehension—specifically as auxiliary cues in computing prosodic structure and its role in shaping information structure—remains underexplored. Therefore, it is essential to investigate these phenomena across different contexts and languages to fully understand the role of segmental details in speech comprehension, especially in conjunction with their dual function in both lexical access and prosodic structuring. Furthermore, it is crucial to validate Mitterer et al.’s (2024) findings, given the potential confounding effects of learning and differential prosodic weight. To address these gaps, we extend this line of research by examining English, where auxiliary contractions serve as an additional test case for how segmental details cue prosodic structure. It is also worth noting that English auxiliary contractions seem to be more standardized than their German counterparts, as the former are frequently used in both spoken and written forms (e.g., ‘They’ve’ in English), whereas the latter (e.g., sie ham for sie haben) are primarily limited to spoken language and are rarely found in formal writing. Thus, it remains to be seen how the two languages may differ in using segmental details to compute prosodic structure in reference to information structure, given their different levels of codification.

In sum, in the current study—while building on, but clearly contrasting with Brown et al. (2021) discussed earlier—the main manipulation is segmental in nature: whether the phrase they have is produced in its full or contracted form. The dependent measure is whether participants interpret this phrase as conveying contrast or agreement within a discourse context—and thereby anticipate an upcoming word—which reflects a core process in sentence-level comprehension. Clearly, such discourse context is analogous to Dilley and colleagues’ distal context in some respects (discussed earlier), but it does not provide the distal prosodic and rhythmic structure. Therefore, while both lines of research explore interactions between segmental and prosodic processing, they target distinct levels of linguistic representation and processing: prior distal-context studies operate primarily at the level of word recognition and segmentation, whereas the present work examines how segmental variation feeds into sentence-level interpretation via information structure. This study thus extends prior work on prosody–segment interaction by shifting the focus toward how fine-grained segmental variation—shaped by prosodic structure—can itself serve as a cue to information structure and influence sentence-level interpretation. Rather than using prosodic variables like speech rate to modulate segmental perception, we test whether variation in segmental form (i.e., full vs. contracted have) can trigger predictive inferences in real time. In doing so, the study contributes to a more nuanced understanding of how phonetic detail and prosodic prominence, in conjunction with information structure, jointly support the construction of meaning during spoken language comprehension.

Comments (0)

No login
gif