automatic prosody labeling
play

Automatic Prosody Labeling Final Presentation Andrew Rosenberg - PowerPoint PPT Presentation

Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN 6820 - Speech and Audio Processing and Recognition 4/27/05 Overview Project Goal ToBI standard for prosodic labeling Previous Work Method Results


  1. Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN 6820 - Speech and Audio Processing and Recognition 4/27/05

  2. Overview • Project Goal • ToBI standard for prosodic labeling • Previous Work • Method • Results • Conclusion

  3. Project Goal: • Automatic assignment of tones tier elements – Given the waveform, orthographic and break index tiers, predict a subset/simplification of elements in the tones tier. – Distinct experiments for determining each of pitch accents, phrase tones, and phrase boundary tones

  4. ToBI Annotation • Tones and Break Index (ToBI) labeling scheme consists of a speech waveform and 4 tiers: – Tones • Annotation of pitch accents and phrasal tones – Orthographic • Transcription of text – Break Index • Pauses between words, rated on a scale from 0-4. – Miscellaneous • Notes about the annotation (e.g., ambiguities, non-speech noise)

  5. ToBI Transcription Example

  6. ToBI Examples • Pitch Accents (made3.wav): – H*, L*, L+H* • Boundary Tones (money.wav): – L-H%, H-H%, L-L%, H-L%, (H-, L-)

  7. Previous Work • Ross: “Prediction of abstract prosodic labels for speech synthesis” 1996 – BU Radio News Corpus (~48 minutes) • Public news broadcasts spoken by 7 speakers – Uses decision tree output as input to an HMM for pitch accent identification; Decision trees for phrase/boundary tone identification – Employs no acoustic features. • Narayanan: “An Automatic Prosody Recognizer using a Coupled Multi- Stream Acoustic Model and a Syntactic-Prosodic Language Model” 2005 – BU Radio News Corpus – Detects stressed syllables (collapsed ToBI labels) and all boundaries. – Uses CHMM on pitch, intensity and duration to track these “asynchronous” acoustic features, and a trigram POS/stress-boundary language model • Wightman: “Automatic Labeling of Prosodic Patterns” 1994 – Single speaker subset of BNC and ambiguous sentence corpus (read speech). – Like Ross, uses decision tree output as input to HMM – Uses many acoustic features

  8. Method • JRip – Classification rule learner – Better at working with nominal attributes – Easier to read output • Corpus – Boston Direction Corpus • 4 speakers • ~65 minutes of semi-spontaneous speech • Original Plan: – HMMs and SVMs • SVMs took a prohibitive amount of time to learn and performed worse. • HMM implementation problems, and not enough time to implement my own

  9. Method - Features • Min, max, mean, std.dev. F0 and Intensity • # Syllables, Duration, approx. vowel length, POS • F0 slope (weighted) • zscore of max F0 and intensity • Phrase-length F0, intensity and vowel length features • Phrase position

  10. Results - Tasks • Pitch Accent – Identification – Detection • Phrase Tone identification • Boundary Tone identification • Phrase/Boundary Tone – Identification – Detection

  11. Results - Pitch Accent Identification • Accuracy Best No Breaks Base Ross* Acc. 79.2% 78.0% 58.8% 80.2% • Relevant Features – # syllables, duration (previous 2), vowel length (prev, next 2), POS, max & stdev F0, slope F0, max & stdev intensity, zscore of F0, phrase level zscore of F0 and intensity *Ross identifies a different subset of ToBI pitch accents

  12. Results - Pitch Accent Detection Narayanan Best No Ross Wightman Breaks Acc. 85.7% 83.9% 82.5% - - T/F 83.2/ 80.1/ - 79.5/ 83/ 12.4 14% 13.2% 14% Baseline: 58.9% On BNC, human agreement of 91%, in general 86-88% Idenical relevant features as id task

  13. Results - Phrase Tone • Accuracy Best Base No Break Base Acc. 72.4% 57.9% 86.7% 77.4% • Relevant Features – Duration of next word, max, min, mean F0. – Linear slope F0, zscore of intensity, phrase zscores of F0 and intensity

  14. Results - Boundary Tone Identification • Accuracy Best Base No Break Base Acc. 73.2% 65.1% 91.3% 84.5% • Relevant Features – Quadratically weighted F0 slope

  15. Results - Phrase/Boundary Tone Identification • Accuracy Best Base Ross Base Acc. 54.7% 33.8% 66.9% 56.3% • Relevant Features – Duration of next two words, POS (current and 2 next), max, mean and slope (all weighting) of F0, mean intensity, phrase zscores of F0 and intensity, – zscore of difference in max intensity in the current word and the phrase.

  16. Results – Phrase/Boundary Tone Detection • Accuracy Best Narayanan Wightman T/F 82.5/3.9% 80.9/16.0% 77/3% • Human agreement (in general): 95% • Best agreement: 93.0% over 77% baseline • Relevant Features – Vowel length (current and next word) – POS of the next word

  17. Conclusion • Relatively low-tech acoustic features and ml algorithms can perform competitively with more complicated NLP approaches • Break index information was not as helpful as initially suspected. • Potential Improvements: – Sequential Modeling (HMM) – Different features • More sophisticated pitch contour feature • Content-based features (similar to Ross)

Recommend


More recommend