thai speech processing activities at nectec
play

Thai Speech Processing Activities at NECTEC Chai Wutiwiwatchai, - PowerPoint PPT Presentation

Thai Speech Processing Activities at NECTEC Chai Wutiwiwatchai, Ph.D. National Electronics and Computer Technology Center (NECTEC) NSTDA-TITECH Workshop - November 2006 1 Outline Brief history Current activities - Speech corpora -


  1. Thai Speech Processing Activities at NECTEC Chai Wutiwiwatchai, Ph.D. National Electronics and Computer Technology Center (NECTEC) NSTDA-TITECH Workshop - November 2006 1

  2. Outline • Brief history • Current activities - Speech corpora - Automatic speech recognition (ASR) - Text-to-speech synthesis (TTS) - Other related topics - Demonstration & problems • Future plan NSTDA-TITECH Workshop - November 2006 2

  3. Brief History SST SID ASR ASR TTS TTS TTS 1997 2000 2005 • SID : Speaker identification • TTS : Text-to-speech synthesis • ASR : Automatic speech recognition • SST : Speech-to-speech translation NSTDA-TITECH Workshop - November 2006 3

  4. ASR Project • ASR resources • “iSpeech” toolkit • Robust ASR • Thai LVCSR NSTDA-TITECH Workshop - November 2006 4

  5. ASR Resources Name Year Collab. Purpose Detail NECTEC- 2002 ATR, Various Thai - 5000 freq. words ATR Japan speech for - Phone-balanced utts. ASR research - Hotel reservation utts. - Read, 54 hrs. 48 LOTUS 2005 PSU & 5000-word - Phone-balanced utts. spks. MU, dictation - 5000-covered utts. Thailand system - Read, 70 hrs. 48 http://www.nectec.or.th/rdi/lotus http://www.nectec.or.th/rdi/lotus spks. VoiceCom 2005 - Isolated - Common isolated commands commands - 24 spks NSTDA-TITECH Workshop - November 2006 5

  6. “iSpeech” Toolkit • Version 1.0 (2005) - Isolated word recognition - Monophone model • Version 1.5 (2006) - Model selection for robust ASR - Automatic endpoint detection • Version 2.0 (2006) - Regular grammar model - Cross-word triphone model • Website http://www.nectec.or.th/rdi/ispeech NSTDA-TITECH Workshop - November 2006 6

  7. Robust ASR (1) • General approaches for robust ASR - Robust parameterization - Model selection - Robust topology - Combination NSTDA-TITECH Workshop - November 2006 7

  8. Robust ASR (2) • Wavelet-based denoising Wavelet High-band thresholding H coefficients Speech Denoised Low-band speech L Wavelet coefficients thresholding 60 Baseline Accuracy % 50 Denoising 40 30 20 Clean Waterfall Fan Computer Shaving NSTDA-TITECH Workshop - November 2006 8

  9. Robust ASR (3) • Model selection Speech Noise-specific Speech Noise acoustic models recognition classification - Feature: MFCC, LSF, NLS (+ PCA) Result - Classifier: SVM, ANN, HMM 80 Accuracy % 70 60 50 40 No robustness Multiconditioned PCA-NLS & ANN 100% Noise acoustic model classification NSTDA-TITECH Workshop - November 2006 9

  10. Robust ASR (4) • Tree-based model selection Automatic noise All noises clustering/merging All SNRs MLLR transformation matrix / Node GMM-based similarity measure Noise1 Noise1 NoiseN SNR 1 SNR 2 SNR N NSTDA-TITECH Workshop - November 2006 10

  11. Thai LVCSR (1) • Phoneme inventory optimization Consonant p t k c ph th kh ch b d m n ng w j r l z h Basic phonemes Vowel i ii e ee x xx v vv q qq a aa u uu o oo @ @@ Initial p t k c ph th kh ch b d m n consonant ng w j r l z h pr tr kr phr thr khr kl phl khl kw khw Syllable- Vowel i ii e ee x xx v vv q qq a structured aa u uu o oo @ @@ ia iia phonemes va vva ua uua Final P T K M N NG W J consonant NSTDA-TITECH Workshop - November 2006 11

  12. Thai LVCSR (2) • 5K-word dictation system - Acoustic modeling: 40 hrs. 48 spks. - Language modeling: 0.07 Mwords - Perplexity: 140 - Evaluation: 460 utts. 10 spks. 80 Word accuracy % 70 60 50 40 No LM LM by Original LM by Realigned Transcription Transcription NSTDA-TITECH Workshop - November 2006 12

  13. TTS Project • “Vaja” TTS engine • TTS resources • Prosody prediction • Text processing • Space reduction NSTDA-TITECH Workshop - November 2006 13

  14. “Vaja” TTS Engine • Version 2.0 (2000) - Demisyllable concatenation • Version 3.0 (2003) - Corpus-based unit-selection • Version 4.0 (2006) - Multithread - Client/server • Version 5.0 (2007) - Naturalness improvement - Space reduction • Website http://www.nectec.or.th/rdi/vaja NSTDA-TITECH Workshop - November 2006 14

  15. TTS Resources Name Year Purpose Detail ORCHID 1997 Thai text corpus - 27,000 sentences for text processing - Word segmentation - POS-tagged TSynC-1 2003 Thai speech corpus - Triphone, tritone covered for unit-selection - 13 hrs., a fluent female speech synthesis - Prosody tagged NSTDA-TITECH Workshop - November 2006 15

  16. Prosody Prediction (1) • Sentence/Phrase breaking • Syllable-duration modeling NSTDA-TITECH Workshop - November 2006 16

  17. Prosody Prediction (2) • Sentence/Phrase breaking Preprocessed text Feature - POS of current and neighboring words extraction - No. of syllables/words from previous break - C4.5, RIPPER, CART, Machine learning Neural network, POS n-gram Break/Non-break NSTDA-TITECH Workshop - November 2006 17

  18. Prosody Prediction (3) • Syllable-duration modeling Duration-tagged Speech samples Factors: Factors: - Phoneme - Phoneme Regression Regression - Tone - Tone analysis analysis - Position - Position Regression model Regression model gives a fair precision of duration prediction (0.73 correlation to references) NSTDA-TITECH Workshop - November 2006 18

  19. Text Processing (1) • Word segmentation • Part-of-speech tagging • Grapheme-to-phoneme (G2P) conversion NSTDA-TITECH Workshop - November 2006 19

  20. Text Processing (2) • G2P difficulties - Context-dependent segmentation ambiguity (CDSA) NOWHERE � |NOW|HERE| or |NOWHERE| - Context-independent segmentation ambiguity (CISA) TOGETHER � |TOGETHER| or |TO|GET|HER| - Homograph ambiguity LEAD � /l i d/ or /l e d/ %Acc Trigram Bayesian Winnow CDSA 73.0 93.2 95.7 CISA 98.3 99.7 99.7 Homograph 52.5 94.3 96.5 NSTDA-TITECH Workshop - November 2006 20

  21. Space Reduction 100 5 90 % Space Reduction Mean Opinion Score 80 4 70 60 3 50 40 2 30 20 1 % Space Reduction 10 Mean Opinion Score 0 0 1 10 20 50 100 200 500 All Maximum frequency of diphone NSTDA-TITECH Workshop - November 2006 21

  22. 22 SST Project (1) NSTDA-TITECH Workshop - November 2006 • 2006 SST prototype

  23. SST Project (2) • 2006 SST prototype - English-to-Thai - Travel domain - Push-to-talk - ASR : CMU Sphinx III - MT : Nectec Parsit, a rule-based MT - TTS : Nectec Vaja NSTDA-TITECH Workshop - November 2006 23

  24. Conclusion Thai Speech Technology at NECTEC ASR TTS SST Text Toolkit Robust LVCSR Corpora Engine Prosody Corpora process Isolated Robust Phone Nectec- Unit Phrase Word TSynC-1 word feature inventory ATR selection break segment Regular Model Transcript Space LOTUS Duration G2P grammar selection system reduction NSTDA-TITECH Workshop - November 2006 24

  25. Future Plan ASR • “iSpeech-N” : N-gram based ASR • Telephone conversational corpus & model • Modified tree-based model selection TTS • Incorporating prosodic models • TSynC-2 • HMM-based TTS NSTDA-TITECH Workshop - November 2006 25

  26. Future Plan SST • Two-way SST • A travel domain parallel corpus • Example-based MT & Translation memory • Spoken language MT NSTDA-TITECH Workshop - November 2006 26

  27. Tentative Collaborative Projects HMM-based TTS • An available large speech corpus • Producing highly smoothed speech • The first system for Thai ASR for Spontaneous telephone speech • Corpus under developing • Highly spontaneous dialogues • Telephone channel & environmental noises NSTDA-TITECH Workshop - November 2006 27

  28. 28 Thank you for your attention NSTDA-TITECH Workshop - November 2006

Recommend


More recommend