Spontaneous Speech How People Really Talk and Why Engineers Should Care Elizabeth Shriberg 1
Acknowledgments Matthew Aylett Hermann Ney Harry Bratt Mari Ostendorf Ozgur Cetin Fernando Pereira Nizar Habash Owen Rambow Mary Harper Andreas Stolcke Dilek Hakkani-Tur Isabel Trancoso Jeremy Kahn Gokhan Tur Kornel Laskowski Dimitra Vergyri Robin Lickley Wen Wang Yang Liu Jing Zheng Evgeny Matusov Matthias Zimmermann Other SRI and ICSI Colleagues Artwork: Patrick Stolcke 2
Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions Spontaneous speech Most speech produced every day is spontaneous It has been this way for a long time Today Long ago Long, long ago. Natural spoken language precedes written language Speaking requires no special training, is efficient, carries a wealth of information 3
Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions Problems for NLP Most natural language processing, however, based on text Two Problems: Spontaneous speech violates assumptions stemming from text-based NLP approaches Spontaneous speech is rich in information that is often not utilized by spoken language technology Goal of this talk Suggest that technology can work better if we pay attention to special properties of spontaneous speech 4
Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions Four challenge areas 1. Recovering punctuation 3. Allowing real turn-taking 2. Coping with disfluencies 4. Hearing real emotion Humans do these easily, but computers do not Tasks important for a range of computational applications Tasks are interrelated Currently far from “solved” Apply across languages (although focus here on English) 5
Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions Topics cover range from lower level to higher level Punctuation One speaker, basic segmentation Disfluencies Within segments, regions of disfluency Turn-taking Expand from single to multiple speakers Emotion Hearing more than just words In this talk, for the benefit of engineers: more focus on lower-level than higher-level tasks. 6
Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions Punctuation Disfluencies Turn-taking Emotion we -- it drives more like a car anyway . that’s something i i wouldn’t go as far as to say that it’s it’s just like a car . but uh the- that’s what the advertisement would say . ah ok 7-1 7-2 7
Four claims If computers were really listening: They would listen for sentence units, not speech between pauses. They would cope with (and maybe even 1. use) disfluencies. They would model overlap and turns that 2. are not strictly sequential. They would “hear” our emotions. 3. Goal: not “strong AI”. But rather: engineering solutions to model cues that humans use. 8
1. Recovering Hidden Punctuation If computers were really listening, they would listen for sentences instead of speech between pauses. 9
Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions Recovering hidden punctuation In many written languages, punctuation is explicit But in speech, punctuation conveyed by other means Most ASR systems output only a stream of words; punctuation is “hidden” tomorrow is fado here is the banquet tonight where Tomorrow is fado, here. Is the banquet tonight? Where? Problem for downstream natural language processing (NLP) Will focus on sentence-level punctuation, the most important 10
Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions Tasks that need sentence boundaries Processing by humans Humans comprehend transcripts of spoken language more effectively if contain punctuation [Jones et al.] Processing by machine ASR Parsing Information extraction Summarization Translation 11
Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions Segmenting speech: the common approach ASR systems perform better on shorter segments of speech Keep search manageable Avoid insertions in nonspeech regions In some dialog systems, no problem, turns ~ 1 sentence But conversational (& read) speech often have longer turns Current ASR systems typically chop at pauses Pauses easy to detect automatically Approach avoids fragmenting of words 12
Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions Pauses ≠ sentence boundaries Many real sentence boundaries have no pause Speakers use other cues, including intonation Some use “rush-through” to prevent interruption And, some nonboundaries do have a pause (hesitations) Example statistics (Switchboard) 56% of within-turn sentences have no pause 10% of within-turn pauses are not sentence boundaries Focus here on NLP, but FYI: sentences also help ASR: Significant (3% rel.) reduction in WER by segmenting at sentence boundaries instead of pauses 13
Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions Computational models for punctuation Typically involve combining lexical and prosodic cues Language model: N-grams over words & punctuation tokens Prosody model: Features: pauses, duration, F0, turn taking Models: decision trees, neural networks Models improved by sampling and ensemble techniques Prosody and LM combined via HMMs, maximum entropy models or CRFs [Liu et al., 2005] Gains from multiple system combination Research based on reference words or 1-best ASR; recent work uses multiple ASR hypotheses [Hillard et al., 2004] 14
Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions Sentence segmentation: state-of-the-art results Same system of Liu et al just mentioned Systems use lexical, prosodic, POS, ‘turn’ information NIST error rate Baseline Ref ASR (errors per ref sentence) (chance) Words Conv. Telephone Speech 100 29.3 41.9 Broadcast News Speech 100 46.3 54.3 Difficult problem; large degradation for ASR (esp. CTS) Broadcast News has higher NIST error rates, due to Fewer true boundaries (longer sentences) Few 1 st person pronouns, fillers (cues to sentence starts) 15
Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions Sentence segmentation and parsing Parsing useful for many downstream NLP tasks Parsing algorithms need short input units; otherwise the processing becomes too computationally expensive (super-linear algorithmic complexity) Parsing of text can use sentence punctuation; for speech, need to infer the units automatically. Hot off the press: results from JHU 2005 Workshop project on parsing and “metadata” (thanks to M. Harper & Y. Liu). Earlier related work: [Kahn et al., 2004]. 16
Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions Sentence segmentation and parsing [JHU WS-2005; M. Harper, Y. Liu] Charniak parser on true words or ASR (~13% WER) output. Parsing results (bracket F-measure): Sentence segmentation Ref Words ASR Human 83.25 71.42 Pause-based (0.5 sec) 63.09 54.62 Automatic [Liu et al., 2005] 74.34 64.03 Sentences really matter – large effects (1% is significant) Automatic system (words and prosody) results more than halfway from pause-based to reference-based performance 17
Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions Decision threshold depends on task [Y. Liu] Sentence Detection Error Metric 70 80 Parsing: F-measure (Higher = Better) Threshold for sentence task itself 60 may be suboptimal for 75 downstream NLP 50 Optimal threshold for boundary task: wide 70 range in middle 40 But parsers prefer lower decision 65 30 thresholds (shorter units) 20 60 0.4 0.6 0.8 0.2 Decision Threshold 18
Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions Sentence segmentation and other NLP Other areas of NLP have also been using models trained on text containing punctuation: Information Extraction [Makhoul et. al., IS05] Summarization [Murray et al., IS05] Machine Translation (in a moment) As areas become consumers of ASR, problems arise when punctuation must be inferred Automatic segmentation can cause downstream errors Basic assumptions about scoring paradigm violated Little published work in this new area, but new programs (like DARPA GALE) mean we should see some soon 19
Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions Sentence segmentation & machine translation Like parsing, MT requires chopping into small units Some meanings depend on within-sentence context, so need to get the boundaries right For example, suppose Isabel asks Fernando whether the audience gave him a hard time at his keynote ASR + Auto Punctuation MT Output Não . Foram simpáticos. 20-1 No. They were nice! Não foram simpáticos. They were not nice. 20-2 20
Introduction Sentences Disfluencies Turn-Taking Emotion Conclusions Sentence segmentation and MT scoring MT scoring relies on sentence-by-sentence comparison of ref and hypothesized translations. If segmentations differ, sentence level comparison not meaningful In parsing, can string together all ref and all system output into one long ‘sentence’ and then apply standard metrics But in MT, N-gram metrics (BLEU) too forgiving of reorderings; too many spurious far-away matches counted as correct Recent proposed solution [Matusov et al., 2005]: resegment hypotheses according to reference sentences 21
Recommend
More recommend