meetings research at icsi
play

Meetings Research at ICSI Barbara Peskin reporting on work of: - PowerPoint PPT Presentation

Meetings Research at ICSI Barbara Peskin reporting on work of: Don Baron, Sonali Bhagat, Hannah Carvey, Rajdip Dhillon, Dan Ellis, David Gelbart, Adam Janin, Ashley Krupski, Nelson Morgan, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, Chuck


  1. Meetings Research at ICSI Barbara Peskin reporting on work of: Don Baron, Sonali Bhagat, Hannah Carvey, Rajdip Dhillon, Dan Ellis, David Gelbart, Adam Janin, Ashley Krupski, Nelson Morgan, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, Chuck Wooters International Computer Science Institute Berkeley, CA M4 Meeting, Sheffield 29-30 January 2003 1

  2. Overview • Automatic Speech Recognition (ASR) Research – Baseline performance – Language modeling exploration – Far-field acoustics – Speech activity detection • Sentence Segmentation & Disfluency Detection • Dialogue Acts: Annotation & Automatic Modeling M4 Meeting, Sheffield 29-30 January 2003 2

  3. ASR Research: Baselines Meeting data formed a track of NIST’s RT-02 evaluation • Eval data (and limited dev) was available from 4 sites – test on 10-minute excerpts from 2 meetings from each site – only 5 transcribed meetings for dev (not all sites represented) – evaluation included both close-talking and table-top recordings – close-talking test used hand-segmented turns; far-field used automatic chopping • We used a Switchboard-trained recognizer from SRI – no Meeting data was used to train the models! – waveforms were downsampled to 8 kHz (for telephone bandwidth) – recognizer used gender-dependent models, feature normalization, VTLN, speaker adaptation (MLLR) and speaker-adaptive training (SAT), bigram lattice generation with trigram expansion, then interpolated class 4-gram LM N-best rescoring, … (fairly standard Hub 5 evaluation system) M4 Meeting, Sheffield 29-30 January 2003 3

  4. Baselines (cont’d) word error rates (WER) on Meeting track of RT-02: Data source ⇒ ICSI CMU LDC NIST all SWB close-talking mic 25.9 47.9 36.8 35.2 36.0 30.2 table-top mic * 53.6 64.5 69.7 61.6 61.6 --- • Performance on close-talking mics quite comparable to SWB • Table just shows bottom-line numbers, but incremental improvements at each recognition stage parallel those on SWB • Overall, far-field WER’s about twice as high as close-talking • CMU data worst for close-talking (they used lapel mics, not headset) but difference disappears on far-field * note: table-top mic system was somewhat simplified (bigram LM, etc.) – insufficient gains from full system to justify added complexity M4 Meeting, Sheffield 29-30 January 2003 4

  5. A Language Modeling Experiment Problem: RT-02 recognizer does not speak the Meetings language (many OOV words, unknown n-grams, etc.) Experiment: – train Meeting LM on 270k words of data from 28 ICSI meetings (excluding RT-02’s dev & eval meetings) – include all words from these meetings in recognizer’s vocabulary (~1200 new words) – interpolate Meeting LM with SWB-trained LM – choose interpolation weights by minimizing perplexity on 2 ICSI RT-02 dev meetings – test on 2 ICSI eval meetings using simplified recognition protocol SWB LM Interpolated LM WER 30.6% 28.4% OOV 1.5% 0.5% M4 Meeting, Sheffield 29-30 January 2003 5

  6. Far-Field Acoustics • Far-field performance was improved by applying Weiner filtering techniques developed for the Aurora program – On RT-02 dev set, WER dropped 64.1% → 61.7% • Systematically addressed far-field acoustics using Digits Task – Model as convolutive distortion (reverb) followed by additive distortion (bkg noise) – For additive noise: used Weiner filtering approach, as above – For reverb: used long-term log spectral subtraction (similar to CMS but longer window) – See [D. Gelbart & N. Morgan, ICSLP-2002 ] for details baseline noise reducn log spec subtr both near 4.1% 3.6% 3.1% 2.7% WER on Mtg Digits far 26.3% 24.8% 8.2% 7.2% • Also explored PZM (high-quality) vs “PDA” (cheap mic) performance – “PDA” performance much worse, but above techniques greatly reduced difference – Error rates roughly comparable after processing as above M4 Meeting, Sheffield 29-30 January 2003 6

  7. Speech Activity Detection Detecting regions of speech activity is a challenge for Meeting data, even on close-talking channels (due to cross-talk, etc.) • Standard echo cancellation techniques ineffective (due to head movement) • We devised an algorithm which performs SAD on close-talking channel, using information from all recorded channels – First, detect speech region candidates on each channel separately, using a standard two-class HMM with min duration constraints – Then compute cross-correlations between channels and threshold them to suppress detections due to cross-talk – Key feature is normalization of energy features on each channel not only for channel min but also by average across all channels • Greatly reduces error rates Frame error rate for speech/nonspeech detection: 18.6% → 13.7% → 12.0% – – WER for SWB-trained recognizer: within 10% (rel) of hand-segmented result; (cf. unsegmented waveforms 75% higher largely due to cross-talk insertions) Note: details can be found in [T. Pfau, D. Ellis, and A. Stolcke, ASRU-2001 ]. M4 Meeting, Sheffield 29-30 January 2003 7

  8. “Hidden Event” Modeling • Detect events implicit in speech stream (e.g. sentence and topic breaks, disfluency locations, …) using prosodic & lexical cues • Developed by Shriberg, Stolcke, et al. at SRI (for topic and sentence segmentation of Broadcast News and Switchboard) • 3 main ingredients – Hidden event language model built from n-grams over words and event labels – Prosodic model built from features (phone & pause durations, pitch, energy) extracted within window around each interword boundary; classifies via decision trees – Model combination using HMM defined from hidden event LM and incorporating observation likelihoods for states from prosodic decision tree posteriors • Meetings work used parallel feature databases (true words, ASR) to detect sentence boundaries and disfluencies – for true: LM better than prosody – for recognized: prosody better than LM – combining models always helps, even when one is much better Note: details can be found in [D. Baron, E. Shriberg, and A. Stolcke, ICSLP-2002 ]. M4 Meeting, Sheffield 29-30 January 2003 8

  9. Dialogue Acts To understand what’s going on in a meeting, we need more than the words ⇒ DA’s tell the role of an utterance in the discourse; use to spot topic shifts, floor grabbing, agreement / disagreement, etc. e.g. Yeah. (as backchannel) Yeah. (as response) Yeah? (as question) Yeah! (as exclamation) • Hand labeling now with goal of automatic labeling later • Using set of 58 tags refined for this work, based on SWB-DAMSL conventions • Using cues from both prosody and words • Currently more than 20 meetings (over 20 hours of speech) hand labeled • Started work on automatic modeling (in collaboration with SRI) A draft of the DA spec is available at our Meetings website: http://www.icsi.berkeley.edu/Speech/mr/ M4 Meeting, Sheffield 29-30 January 2003 9

  10. Summary • Meetings support an amazing range of speech & language research (nearly “ASR complete”) • We are just starting to tap some of the possibilities, including – Automatically transcribing natural, spontaneous multi-party speech – Enriching language models to handle new / specialized topics – Detecting speech activity, segmenting speech stream, labeling talkers – Dealing with far-field acoustics – Moving beyond the words to model • hidden events such as sentence breaks and disfluencies • dialogue acts and discourse structure • We look forward to continued collaboration with the M4 community to tackle the challenges posed by Meeting data M4 Meeting, Sheffield 29-30 January 2003 10

Recommend


More recommend