Cue-based analysis of speech: Implications for prosodic transcription Stefanie Shattuck-Hufnagel Speech Communication Group Research Laboratory of Electronics MIT
A stark view: Some unanswered questions • What are the contrastive categories of spoken prosody? • How does their phonetic implementation vary systematically with context? • How do they relate to meaning and to interaction?
Prosodic parallels to a feature-cue-based approach to speech processing? 1) Segmental phonology: growing evidence that language users systematically control: • individual acoustic cues to contrastive phonemic segments • contextually appropriate parameter values of these cues 2) Models: representation and processing of surface phonetic information at this level of detail • feature-cue-based processing (Halle, Stevens) 3) Parallels in prosodic phonology? • if so, what are the implications for prosodic transcription?
Instruction giver’s map Instruction follower’s map
Reduction of surface word forms It’s probably the same thing.
probably the
Strengthening/clarification of surface word forms Are you going to have to do that all over again? ProbabLY .
Extremes of variation in word forms
Surface phonetic segments often not appropriate for transcription • Cues not aligned in time – Cues to a feature can be distributed over time • nasality in V preceding a nasal coda C in I can go • duration of V preceding a voiceless coda C in I can’t go – Cues to features of two segments can overlap in time • /n + dh/ of win those interdental nasal • Cues selected individually – Individual cues to features survive ‘deletion’ of segment • Duration of V preceding a ‘deleted’ voiced coda C in cat – Individual cues to features are sometimes added • Glottalized word-final /t/ sometimes also has closure and release burst
Feature-cue-based transcription provides a better fit • Stevens 2002 (extending Halle 1972): Two types of features, two types of cues – Landmarks: abrupt spectral changes as cues to articulator-free features • Consonant, Vowel, Glide, Continuant, Sonorant, Strident – Landmark-related cues: spectral patterns near Landmarks, as cues to articulator-bound features • Labial, Coronal, Velar, Voiced, Nasal etc. – Additional acoustic events
Landmark cues Rapid spectral changes across several energy bands which provide information about articulator-free features Boyce et al. 2013
Landmark labelling captures individual cue patterns
Advantages of Landmark Cues in Speech Perception • Reliably produced – 80% of predicted LMs in AEMT Corpus (Shattuck- Hufnagel & Veilleux 2007) • Robustly detectable (‘auditory edges’) • Highly informative – Articulator-free features (~manner) provide estimate of CV structure of the utterance – Identification of regions rich in cues to other features (place, voicing) – Inter-Landmark times provide estimate of durational markers of prosodic structure
Extension to Production A sketch of an extrinsic timing model Stage 1: a phonological planning stage – symbolic segmental representations are sequenced and slotted into an appropriate prosodic structure – appropriate acoustic cues are selected for each segment’s features in its context Stage 2: a phonetic planning stage – cues are mapped onto sets of articulators – appropriate values for spatial and temporal parameters of movement are computed Stage 3: a motor-sensory implementation stage – articulator movements are generated and tracked. Turk and Shattuck-Hufnagel 2014
Extension to Production A sketch of an extrinsic timing model Stage 1: a phonological planning stage – symbolic segmental representations are sequenced and slotted into an appropriate prosodic structure – appropriate acoustic cues are selected for each segment’s features in its context Stage 2: a phonetic planning stage – cues are mapped onto sets of articulators – appropriate values for spatial and temporal parameters of movement are computed Stage 3: a motor-sensory implementation stage – articulator movements are generated and tracked. Turk and Shattuck-Hufnagel 2014
Evidence for a Feature-Cue-Based production planning model • Evidence that speakers can choose among individual cues – Feature cues left behind in phonetic reduction – New cues in challenging speaking circumstances – Inventory constraints on LM modification • Evidence that speakers compute cue parameter values – Conversational convergence: partial, governed by social values – Covert contrast in development – Inventory constraints on final lengthening
Evidence for a Feature-Cue-Based production planning model • Evidence that speakers can choose among individual cues – Feature cues left behind in phonetic reduction – New cues in challenging speaking circumstances – Inventory constraints on LM modification • Evidence that speakers compute cue parameter values – Conversational convergence: partial, governed by social values – Covert contrast in development – Inventory constraints on final lengthening
Conversational convergence/divergence Neilson 2011
Evidence for a Feature-Cue-Based production planning model • Evidence that speakers can choose among individual cues – Feature cues left behind in phonetic reduction – New cues in challenging speaking circumstances – Inventory constraints on LM modification • Evidence that speakers compute cue parameter values – Conversational convergence: partial, governed by social values – Covert contrast in development – Inventory constraints on final lengthening
Covert contrast in child speech Scobbie 1998; see also Gibbon 1990
Covert contrast for stop voicing Macken & Barton 1980 JCL
Characteristics of the FCBP approach • More complex planning by the speaker – Not ‘choose a surface allophone’ – But instead, ‘choose context -appropriate feature cues and cue parameter values’ • Extensive interpretation by the listener – Which linguistic constituents and structures does the signal contain cues for? – What information about the interaction and the situation does the signal contain cues for?
Parallels in Prosodic Processing? • Individual variation in cue patterns – Irregular pitch periods at prosodic boundaries and prominences (Pierrehumbert & Talkin 1992, Dilley et al. 1996) • New cues in challenging speaking situations – Dysarthric speakers use duration instead of F0 to signal question vs statement (Patel 2003) – Whispered speech in Mandarin shows amplitude variation analogous to F0 shape for tones (Gao 2003) • Interpretation of ambiguous cues in context – Early prominence patterns influence interpretation of ambiguous later prominence (Dilley & Shattuck-Hufnagel 1998) – Early speaking rate influences interpretation of ambiguous cues to function words (Dilley & Pitt 2008)
Parallels in Prosodic Processing? • Individual variation in cue patterns – Irregular pitch periods at prosodic boundaries and prominences (Pierrehumbert & Talkin 1992, Dilley et al. 1996) • New cues in challenging speaking situations – Dysarthric speakers use duration instead of F0 to signal question vs statement (Patel 2003) – Whispered speech in Mandarin shows amplitude variation analogous to F0 shape for tones (Gao 2003) • Interpretation of cues in context – Early prominence patterns influence interpretation of ambiguous later prominence (Dilley & Shattuck-Hufnagel 1998) – Early speaking rate influences interpretation of ambiguous cues to function words (Dilley & Pittt 1998)
New cues in challenging speaking situations: Dysarthric Speech Patel 2003
New cues in challenging speaking situations: Whispered Speech https://lingos.co/blog/mandarin-tones/ Gao 1999
New cues in challenging speaking situations: Whispered Speech Gao 1999
New Cues in challenging speaking situations: Whispered Speech Gao 2003
Implications for Prosodic Transcription? • Determine the contrastive categories • Determine the range of appropriate cues and cue parameter values for each category, across contexts • Determine the relationship of the categories (and cue parameter values) to meaning and to interaction
Implications for Prosodic Transcription? • Determine the contrastive categories • Determine the range of appropriate cues and cue parameter values for each category, across contexts • Determine the relationship of the categories (and cue parameter values) to meaning and to interaction • Can cue-based transcription move us toward these goals?
Some useful steps • Consider prosodic elements in terms of distributed cues to contrastive elements and parameter values for those cues – Rather than as a sequence of surface elements • Develop displays of parameters as compelling as F0 contours – Duration and amplitude as % of typical – Autodetection of irregular pitch periods • Create inventories of contrastive use of prosodic phrasing and prominence across languages • Investigate ‘phonological equivalence’ in prosody
Phonological equivalence
Which differences distinguish contrasts?
Recommend
More recommend