Various Approaches Various Approaches acoustic classic The Prosody The Prosody measurement and linguistic of Turn- -Taking Taking of Turn hypothesis testing methods machine dialog system perception of conversation learning user studies synthesized stimuli analysis
A Case Study in the Identification A Case Study in the Identification of Prosodic Cues to Turn- -Taking Taking of Prosodic Cues to Turn - Back Back- -Channeling in Arabic Channeling in Arabic - - - Nigel Ward and Yaffa Yaffa Al Al Bayyari Bayyari Nigel Ward and University of Texas at El Paso University of Texas at El Paso Interspeech 2006
The Second Channel The Second Channel Form gesture gaze prosody ... Content uncertainty, novelty, dialog control ... Value efficiency satisfaction + (Shriberg 2005) ...
1. Project Aims 1. Project Aims Discover the rules governing back- channeling in Arabic to teach soldiers how to “show you’re listening” • a qualitative description to use for teaching, plus � a quantitative description to drive the characters
2. Problem Formulation 2. Problem Formulation r mutual o f g n i n n a l p s joint action models l a o g g o l a i d co-construction if X then back-channel bfgklfm ... tvzyxyv ... � using only past information (no look-ahead) � using only features computable from the signal (no hand-labeling)
3. Corpus Preparation 3. Corpus Preparation All the usual issues ... big enough to • find a good dialog UTEP Corpus of Iraqi Arabic • do proper evaluation 112 minutes 689 back-channel tokens
4a: Feature Discovery 4a: Feature Discovery unclear where to look unclear where to look likely location of proximate cause distant back-channel causes -1000 ms -200 ms time Complications: • time from cue to back-channel varies (complicates Machine Learning) • salient events can obscure the cues (complicates perceptual analysis) Example: what do the following have in common?
4b: Feature Discovery 4b: Feature Discovery the overwhelming multitude of features the computed over prosody (pitch, energy, timing), voicing ... (possibly in combination) for example: • height of highest pitch peak in the last 400 ms relative to the baseline over the past 2000 ms • first coefficient of a second-order approximation to the pitch curve over the last three syllables before a pause of at least 200 milliseconds • presence of a 150 millisecond region with the pitch consistently below the 26 th percentile ...
4c: Feature Discovery: 4c: Feature Discovery: harnessing perception harnessing perception audio visual neither both inspection inspection • perceive lots • perceive only • no subjectivity • perceive lots of information what’s • no insight of information specific places graphically • navigate • hard to focus salient quickly on specific • easy to focus • focus easily features on specific • hard to scan to features • need tools • easy to scan to specific places
A Custom Tool for Integrated Analysis A Custom Tool for Integrated Analysis Didi
4d: Feature Discovery 4d: Feature Discovery quantifying perceptions quantifying perceptions label occurrences perceptually identified program feature identify good yes feature match acoustic detector correlates ? (in C, alas) no Since some features are pervasive, hence un-informative, listen casually first, to get familiar with the pervasive patterns.
5. Feature Combination 5. Feature Combination pitch downdash pause no flat substantial back-channel pitch speech region -1000 ms -200 ms time feature combination is tricky, since features not always synchronized
6. Hypothesis Refinement 6. Hypothesis Refinement ideas refining the quantitative description refining the qualitative description (by programming and debugging) (by listening and looking) plus evaluation against corpus missed predictions a back-channel cue and false alarms in Spanish a false alarm
7. Hypothesis Tuning 7. Hypothesis Tuning hill-climbing suffices (iff the previous steps were done well) Resulting rule: If • an utterance has lasted at least 1.2 seconds, and • contains a pitch downdash • lasting at least 40 milliseconds, with • a pitch drop of at least 0.7% every 10 ms ... then • predict a back-channel in response, 300 ms later
8. Evaluation 8. Evaluation • by native-speaker acclamation • by interacting with it • by correspondence to the corpus (51% coverage, 16% accuracy)
Summary Summary An integrated answer (qualitative + quantitative) • achievable • costly (~$90,000)
What Next? What Next? • need more usable tools • need more feature-rich tools acoustic measurement and classic hypothesis testing linguistic The Prosody The Prosody methods of Turn of Turn- -Taking Taking perception of machine conversation dialog system synthesized learning analysis user studies stimuli
An Integrated Method An Integrated Method Eight steps to discovery of a prosodic cue 1. Project aims 2. Problem formulation 3. Corpus preparation 4. Feature discovery 5. Feature combination 6. Hypothesis refinement 7. Tuning 8. Evaluation
Fostering Progress Fostering Progress let’s build tools! let’s look at the same elephants!
Why Engineers Should Care Why Engineers Should Care • Spontaneous speech is different, in ways that affect recognition (Shriberg 2005) • Dialog systems are pervasive but unnatural and disliked • Intrinsic scientific interest • Language teaching applications
3. Corpus Preparation 3. Corpus Preparation Corpus size is a Goldilocks question 5 minutes, 25 tokens 80 minutes, 50 hours, 400 tokens 20,000 • results not general tokens • can analyze too deeply • can find a good dialog • can evaluate properly • labeling too expensive • can’t listen to the data this corpus this corpus this corpus is too big is too small is just right (for us)
Applications Applications Making Machines more like People • acknowledgements in tutorial systems • adapting pace in information-delivery systems • noticing user reactions in persuasive systems Making People more like People • learning to show you’re listening ... actively
Recommend
More recommend