Thai Speech Processing Activities at NECTEC Chai Wutiwiwatchai, - PowerPoint PPT Presentation

Thai Speech Processing Activities at NECTEC Chai Wutiwiwatchai, Ph.D. National Electronics and Computer Technology Center (NECTEC) NSTDA-TITECH Workshop - November 2006 1

Outline • Brief history • Current activities - Speech corpora - Automatic speech recognition (ASR) - Text-to-speech synthesis (TTS) - Other related topics - Demonstration & problems • Future plan NSTDA-TITECH Workshop - November 2006 2

Brief History SST SID ASR ASR TTS TTS TTS 1997 2000 2005 • SID : Speaker identification • TTS : Text-to-speech synthesis • ASR : Automatic speech recognition • SST : Speech-to-speech translation NSTDA-TITECH Workshop - November 2006 3

ASR Project • ASR resources • “iSpeech” toolkit • Robust ASR • Thai LVCSR NSTDA-TITECH Workshop - November 2006 4

ASR Resources Name Year Collab. Purpose Detail NECTEC- 2002 ATR, Various Thai - 5000 freq. words ATR Japan speech for - Phone-balanced utts. ASR research - Hotel reservation utts. - Read, 54 hrs. 48 LOTUS 2005 PSU & 5000-word - Phone-balanced utts. spks. MU, dictation - 5000-covered utts. Thailand system - Read, 70 hrs. 48 http://www.nectec.or.th/rdi/lotus http://www.nectec.or.th/rdi/lotus spks. VoiceCom 2005 - Isolated - Common isolated commands commands - 24 spks NSTDA-TITECH Workshop - November 2006 5

“iSpeech” Toolkit • Version 1.0 (2005) - Isolated word recognition - Monophone model • Version 1.5 (2006) - Model selection for robust ASR - Automatic endpoint detection • Version 2.0 (2006) - Regular grammar model - Cross-word triphone model • Website http://www.nectec.or.th/rdi/ispeech NSTDA-TITECH Workshop - November 2006 6

Robust ASR (1) • General approaches for robust ASR - Robust parameterization - Model selection - Robust topology - Combination NSTDA-TITECH Workshop - November 2006 7

Robust ASR (2) • Wavelet-based denoising Wavelet High-band thresholding H coefficients Speech Denoised Low-band speech L Wavelet coefficients thresholding 60 Baseline Accuracy % 50 Denoising 40 30 20 Clean Waterfall Fan Computer Shaving NSTDA-TITECH Workshop - November 2006 8

Robust ASR (3) • Model selection Speech Noise-specific Speech Noise acoustic models recognition classification - Feature: MFCC, LSF, NLS (+ PCA) Result - Classifier: SVM, ANN, HMM 80 Accuracy % 70 60 50 40 No robustness Multiconditioned PCA-NLS & ANN 100% Noise acoustic model classification NSTDA-TITECH Workshop - November 2006 9

Robust ASR (4) • Tree-based model selection Automatic noise All noises clustering/merging All SNRs MLLR transformation matrix / Node GMM-based similarity measure Noise1 Noise1 NoiseN SNR 1 SNR 2 SNR N NSTDA-TITECH Workshop - November 2006 10

Thai LVCSR (1) • Phoneme inventory optimization Consonant p t k c ph th kh ch b d m n ng w j r l z h Basic phonemes Vowel i ii e ee x xx v vv q qq a aa u uu o oo @ @@ Initial p t k c ph th kh ch b d m n consonant ng w j r l z h pr tr kr phr thr khr kl phl khl kw khw Syllable- Vowel i ii e ee x xx v vv q qq a structured aa u uu o oo @ @@ ia iia phonemes va vva ua uua Final P T K M N NG W J consonant NSTDA-TITECH Workshop - November 2006 11

Thai LVCSR (2) • 5K-word dictation system - Acoustic modeling: 40 hrs. 48 spks. - Language modeling: 0.07 Mwords - Perplexity: 140 - Evaluation: 460 utts. 10 spks. 80 Word accuracy % 70 60 50 40 No LM LM by Original LM by Realigned Transcription Transcription NSTDA-TITECH Workshop - November 2006 12

TTS Project • “Vaja” TTS engine • TTS resources • Prosody prediction • Text processing • Space reduction NSTDA-TITECH Workshop - November 2006 13

“Vaja” TTS Engine • Version 2.0 (2000) - Demisyllable concatenation • Version 3.0 (2003) - Corpus-based unit-selection • Version 4.0 (2006) - Multithread - Client/server • Version 5.0 (2007) - Naturalness improvement - Space reduction • Website http://www.nectec.or.th/rdi/vaja NSTDA-TITECH Workshop - November 2006 14

TTS Resources Name Year Purpose Detail ORCHID 1997 Thai text corpus - 27,000 sentences for text processing - Word segmentation - POS-tagged TSynC-1 2003 Thai speech corpus - Triphone, tritone covered for unit-selection - 13 hrs., a fluent female speech synthesis - Prosody tagged NSTDA-TITECH Workshop - November 2006 15

Prosody Prediction (1) • Sentence/Phrase breaking • Syllable-duration modeling NSTDA-TITECH Workshop - November 2006 16

Prosody Prediction (2) • Sentence/Phrase breaking Preprocessed text Feature - POS of current and neighboring words extraction - No. of syllables/words from previous break - C4.5, RIPPER, CART, Machine learning Neural network, POS n-gram Break/Non-break NSTDA-TITECH Workshop - November 2006 17

Prosody Prediction (3) • Syllable-duration modeling Duration-tagged Speech samples Factors: Factors: - Phoneme - Phoneme Regression Regression - Tone - Tone analysis analysis - Position - Position Regression model Regression model gives a fair precision of duration prediction (0.73 correlation to references) NSTDA-TITECH Workshop - November 2006 18

Text Processing (1) • Word segmentation • Part-of-speech tagging • Grapheme-to-phoneme (G2P) conversion NSTDA-TITECH Workshop - November 2006 19

Text Processing (2) • G2P difficulties - Context-dependent segmentation ambiguity (CDSA) NOWHERE � |NOW|HERE| or |NOWHERE| - Context-independent segmentation ambiguity (CISA) TOGETHER � |TOGETHER| or |TO|GET|HER| - Homograph ambiguity LEAD � /l i d/ or /l e d/ %Acc Trigram Bayesian Winnow CDSA 73.0 93.2 95.7 CISA 98.3 99.7 99.7 Homograph 52.5 94.3 96.5 NSTDA-TITECH Workshop - November 2006 20

Space Reduction 100 5 90 % Space Reduction Mean Opinion Score 80 4 70 60 3 50 40 2 30 20 1 % Space Reduction 10 Mean Opinion Score 0 0 1 10 20 50 100 200 500 All Maximum frequency of diphone NSTDA-TITECH Workshop - November 2006 21

22 SST Project (1) NSTDA-TITECH Workshop - November 2006 • 2006 SST prototype

SST Project (2) • 2006 SST prototype - English-to-Thai - Travel domain - Push-to-talk - ASR : CMU Sphinx III - MT : Nectec Parsit, a rule-based MT - TTS : Nectec Vaja NSTDA-TITECH Workshop - November 2006 23

Conclusion Thai Speech Technology at NECTEC ASR TTS SST Text Toolkit Robust LVCSR Corpora Engine Prosody Corpora process Isolated Robust Phone Nectec- Unit Phrase Word TSynC-1 word feature inventory ATR selection break segment Regular Model Transcript Space LOTUS Duration G2P grammar selection system reduction NSTDA-TITECH Workshop - November 2006 24

Future Plan ASR • “iSpeech-N” : N-gram based ASR • Telephone conversational corpus & model • Modified tree-based model selection TTS • Incorporating prosodic models • TSynC-2 • HMM-based TTS NSTDA-TITECH Workshop - November 2006 25

Future Plan SST • Two-way SST • A travel domain parallel corpus • Example-based MT & Translation memory • Spoken language MT NSTDA-TITECH Workshop - November 2006 26

Tentative Collaborative Projects HMM-based TTS • An available large speech corpus • Producing highly smoothed speech • The first system for Thai ASR for Spontaneous telephone speech • Corpus under developing • Highly spontaneous dialogues • Telephone channel & environmental noises NSTDA-TITECH Workshop - November 2006 27

28 Thank you for your attention NSTDA-TITECH Workshop - November 2006

Thai Speech Processing Activities at NECTEC Chai Wutiwiwatchai, - PowerPoint PPT Presentation

Thai Speech Processing Activities at NECTEC Chai Wutiwiwatchai, Ph.D. National Electronics and Computer Technology Center (NECTEC) NSTDA-TITECH Workshop - November 2006 1 Outline Brief history Current activities - Speech corpora -

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Thai Union Frozen Thai Union Frozen Thai Union Frozen Thai Union Frozen Products Products

Muay Thai Muay Thai Muay Thai Regatta Regatta Golf Biking 11 December 2015 4 NONG KHAI

Thai Oil Public Company Limited 1 Thai Oil Public Company Limited Presentation to Investors

Thai Oil Public Company Limited Thai Oil Public Company Limited Presented to Investors Presented

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing for Speech Processing for Unwritten Languages Unwritten Languages Alan W

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

OUTLINE OVERVIEW ON THAI CHICKEN INDUSTRY RESPONSE OF THE THAI POULTRY RESPONSE OF THE

Chapter 1 Introduction to Speech Signal Processing 1 Outline The

Thai Universities as World Class Thai Universities as World Class Universities Prasit

Becoming an eHealth Champion Part One. Tips and Tricks Models that work Gillian Leach

CONTENT LOCALIZATION HUB by ABOUT US POSTMODERN is the largest post-production and localization

Clearing the Haze of Marijuana and other Drugs in the Workplace Norm Keith, Partner, LL.M., CRSP

Investor presentation May, 8 th 2018 COMPANY BUSINESSMODEL 3 2017 TURNOVER TV SPOTS EVENTS

The debate around the extension of concessions term. The case of Brazilian energy

Data Protection Regulation (GDPR) Robertas T amosaitis Microsoft Business Solution Sales

JUNE 2017 HPS - Crime Information Analysis Unit L OCATION Ward 11 Division 2 Ward 11 @ Division

Low-latency RNN inference using Cellular Batching Jinyang