Speaker State Elizabeth Shriberg Speech Technology and Research Lab SRI International, Menlo Park, CA May 7-8, 2015 NSF Speech Science Workshop
Overview • Umbrella term covering variations within an individual – Emotional – Cognitive (uncertainty, engagement) – Health (stress, fatigue, Parkinson’s…) – Mental health (depression, PTSD, MCI, mTBI) – Social, pragmatic (engagement, entrainment) • Synergy with some of the other talks here: Anton, Tom, Julia, Florian …. • Standard approach – Annotate data “gold standard” – Extract features from speech (words, acoustic, prosodic, discourse) – Machine learning to predict annotations – Range of metrics for evaluation • Funding: some govt, some commercial; but limited • Interest from industry – e.g. call centers, but largely ASR based and data is often proprietary. May 7-8, 2015 NSF Speech Science Workshop
Impact for Speech Technology 1. Detection of state from speech – For adaptation / action of system / filtering – For monitoring / filtering – Massively applicable, including for passive speech, especially with increases in mobile phone use and apps – Growing interest in industry in emotion, but speech content analysis is generally behind that of text and video 2. Improvement of speech recognition (via modeling of context for better train/test data matching) May 7-8, 2015 NSF Speech Science Workshop
Challenges • Major effects of speaker, context, semantics but almost no understanding of effects • Hundreds of papers/year, but we start over with each data set • Small data sets • Annotation issues – validity, reliability, unit of analysis • Common evaluations — have been great service to community but focus has been on large feature sets + deep learning we’re adding layers, not understanding • Feature sets biased toward those available from ASR • Metrics and evaluation • Sensitive data sets can’t be shared May 7-8, 2015 NSF Speech Science Workshop
Future Directions • Core pursuits – Understand how to decouple effects of speaker from state, and context – Go smaller, not bigger. What’s the minimum feature set and what can we learn from it? – Value generalization across data sets – learn what features and approaches transfer to new data – Explore robustness in real world data – studies often assume better audio than we can get in real applications – Understand role of lexical, visual, physiological information – increasingly available and need to understand where speech offers added value • Needs – Invest in longitudinal data with real-world spontaneous speech – Add spontaneous collection to studies in medical community – Community focus on annotations and meaningful metrics – working group support if no government evaluations – User studies that involve real-world end applications May 7-8, 2015 NSF Speech Science Workshop
Recommend
More recommend