cyberon voice com m ander
play

Cyberon Voice Com m ander - PowerPoint PPT Presentation

Cyberon Voice Com m ander 2007/ 08/ 16 1 Cyberon Profile One of the leading em bedded speech solution One of the leading em bedded speech solution


  1. Cyberon Voice Com m ander 多國語言語音命令系統開發經驗談 賽微科技研發部協理 劉進榮 2007/ 08/ 16 1

  2. Cyberon Profile • One of the leading em bedded speech solution • One of the leading em bedded speech solution providers w orldw ide providers w orldw ide • Establishm ent : Jan, 2000 • Headquarter : Hsin-Tien City, Taipei, Taiwan China Office : XiaMen • Em ployees : 36 (R&D: 27) • More than 1 5 m illion units shipped w orldw ide • 2 0 0 6 Revenue: NTD 87million , EPS: NTD 13.2 2

  3. 3 Pow ered By Cyberon – Grand China

  4. 4 Pow ered By Cyberon - W orldw ide

  5. Cyberon’s Solutions • Speaker-Dependent Voice Recognition � Spead Dial engine � Cyberon Voice Spead Dial (CVSD) • Speaker-Independent Voice Recognition � CListener engine � Cyberon Voice Dialer (CVD) � Cyberon Voice Com m ander ( CVC) • Text-To-Speech � CReader engine � Cyberon Talking Tutor (CTT) � Cyberon Talking Dictionary (CTD) 5

  6. Cyberon Voice Com m ander • A Voice Dialing and Command&Control Application � Name Dial/ Digit Dial � Phone Book Lookup � Program Launch � Media Player Control � E-mail/ SMS/ Calendar reader � Callback, Redial, Time... etc. � Voice Feedback � Bilingual Speech Recognition • Technology � Speaker-I ndependent Com m and-Based SR � Text-To-Speech � Continuous Digit Recognition � Speaker-Dependent SR (Voice Tag) � Speaker Adaptation (for Digit Model) 6

  7. Supported Language European Region European Region Am erican Region Am erican Region UK English UK English Czech Czech Northern American English Northern American English German Turkish German Turkish Brazilian Portuguese Brazilian Portuguese French Danish French Danish Southern American Spanish Southern American Spanish Italian Swedish Italian Swedish Spanish Finnish (07’ ’Q3) Q3) Spanish Finnish (07 Portuguese Norwegian (07’ ’Q3) Q3) Portuguese Norwegian (07 Asian Region Asian Region Russian Greek (07’ ’Q3) Q3) Russian Greek (07 Traditional Chinese Traditional Chinese Dutch Dutch Slovak (07’ Slovak (07 ’Q4) Q4) Simplified Chinese Simplified Chinese Polish Polish Hungarian (07’ Hungarian (07 ’Q4) Q4) Chinese Accent English Chinese Accent English Ukrainian (07’ Ukrainian (07 ’Q4) Q4) Korean Korean Thai Thai Cantonese Cantonese Japanese Japanese 7

  8. 8 Speaker-I ndependent Speech Recognition

  9. Architecture Grammar Lexicon Feature Voice Signal Vectors Result Feature Search Algorithm Extraction Vocabulary Acoustic Database Search Algorithm Size -Isolated Word -Small : tens Speaker Dependence -Discrete Speech -Middle: hundreds -Speaker-Dependent (SD) -Large: thousands -Continue Speech -Speaker-Independent (SI) -Very Large: ten thousands -Keyword Spotting -Speaker Adaptation (SA) Approach -Neural Network -HMM Unit -Word based -Phoneme based 9

  10. Feature & Gram m ar • Feature � Input Signal: 8k Hz, 16-bit PCM � 8-Dim MFCC and 8-Dim Delta MFCC � 100 Frames Per Second � Cepstral Mean Subtraction • Grammar 人名 住家、公司、手機 打電話 住家、公司、手機 人名 打電話 開啟 應用程式名 應用程式名 開啟 start end start end 歌曲名 播放 播放 歌曲名 其他單詞命令 其他單詞命令 10

  11. Lexicon, Model & Search • Lexicon � Word-to-Phone Conversion � Several approaches for different languages � 30 KB ~ 250 KB per language • Model � Phoneme-Based HMM � 3 Left-to-Right States for a Phoneme Model � Decision-Tree Triphone Model � Forward-Backward Training � 180KB ~ 220 KB per language • Search Algorithm � Viterbi Search � Word transition governed by Grammar 11

  12. Language Developm ent • Procedure � Define Phoneme Set � Wikipedia, SAMPA, Language Learning Web Site, ... � Build Lexicon Module � Rule: Academic Paper, Language Learning Web Site... � Pronunciation Dictionary: LDC, ELRA, other research organizations... � Design Recording Scripts � News Web Site � Collect Speech Data � Local Agents � Train Model & Test • 3 ~ 6 months for developing a language 12

  13. Lexicon Module • Basic Approaches � Rule � Simple Letter-to-Phone Rules � Ex: Italian, Spanish, Portuguese... etc. � Hardcode � Ex: Chinese, Korean... etc. � Decision Tree � Trained by a pronunciation dictionary � Accuracy: inside 92% ~ 98% , outside 60% ~ 75% � Ex: English, German, French... etc. • Hybrid for Most Languages 13

  14. Data Collection • Corpus � 100 ~ 800 Informants Per Language � Per Speaker • 40 ~ 60 short words for booting model • 200 ~ 300 sentences (25 ~ 30 min) for training • Accent Issue � Collect data in big cities � Try to enlarge the coverage of accents • Verification & Phoneme Transcription � Done by tools 14

  15. Engine Sim ulation Test � Vocabulary: 200 full names � Tester: 4 ~ 6 native speakers � Device: Dopod 900 (HTC Universal) � Add several degrees AURORA CAR noise to source data � Accuracy (% ) S/N Clean 15dB 10dB 5dB 0dB Language Taiwan Mandarin 98.03 97.04 96.37 93.09 75.33 China Mandarin 96.62 96.21 95.21 90.33 71.67 Cantonese 95.36 94.01 93.97 88.01 71.62 US English 98.9 97.9 96.68 92.58 79.4 UK English 93.88 94.85 94.21 91.45 77.79 German 95.17 95.17 93.65 87.81 75.29 French 94.83 95.02 94.08 90.25 76.62 Italian 95.77 94.15 93.64 91.56 81.73 Spanish 96.18 95.37 92.83 89.28 78 Brazilian Portuguese 96.2 97.15 95.49 93.35 80.29 Dutch 94.25 93.12 92.62 88.12 74.75 Japanese 96.55 96.1 92.4 90.4 81.1 Russian 97.15 95.6 93.62 87.07 75.47 Average 96.07 95.51 94.21 90.25 76.85 15

  16. CVC Field Test � Vocabulary: 200 full names, 20 ~ 30 apps with grammar � Tester: 4 ~ 6 native speakers � Device: Several PocketPC phone models � Environment: Office, Roadside, and Highway � Accuracy Env. Office Roadside Highway Language Taiwan Mandarin 98.6 92.8 93.5 China Mandarin 96.2 90.4 92.3 Cantonese 94.8 89.7 91.5 US English 93.7 85.2 90.5 UK English 93.2 83.7 88.5 German 95.7 86.3 93.8 French 96.5 91.4 92.6 Italian 97.5 92.3 94 Spanish 97.1 89.4 91.2 Brazilian Portuguese 95.3 87.6 88.7 Dutch 92.4 84 91.3 Japanese 96.2 88.3 91.2 Russian 96.3 88.4 92.8 Average 95.63 88.42 91.68 16

  17. 17 Text-To-Speech

  18. TTS in CVC • Mainly for voice feedback of VR result • 16k Hz, 16-bit PCM Output • Compact Size: 300 KB ~ 600 KB per Language • Acceptable quality • Lack of rich prosody (Robotic) • Good for pronunciation of single word and short phrase after fine tuning � Cyberon Talking Dictionary 18

  19. Architecture Speech Unit Database Word/Phrase break Pronunciation Input Text Output Speech POS tag Text Analysis Synthesizer Pronunciation Lexicon Prosody Model POS Lexicon 19

  20. Text Analysis • Word Boundary � For Chinese and Thai � Longest word first • POS Tagging � By POS n-gram and Viterbi search • Phrase Boundary � By boundary n-gram and Viterbi search � Simplified approach: by syllable length 20

  21. Prosody Model • Mandarin & Cantonese (Syllable Unit) � Save first tone of each syllable in database � Pre-define F0 contour of each tone � Adopt fixed base F0 contour of phrase � Compute duration by syllable position in word and in phrase • Other Languages (Diphone Unit) � Predict accent position and type by CART (Classification And Regression Tree) � Generate F0 contour by linear regression � Predict duration by CART 21

  22. Synthesizer • LPC (Linear Predictive Coding)-Based Approach � Save LPC coefficients and residual of pitch of speech unit into database � Adjust residual length for F0 contour � Adjust number of pitch for duration 22

  23. 23 Conclusion

  24. Cyberon Voice Com m ander • A successful commercial voice application on mobile device • Integrate several speech technologies, such as SI VR and TTS, into embedded system • Experience of developing a lot of languages • Show speech technologies workable in real daily life 24

  25. Future W ork • Improve TTS quality • Enhance recognition performance in heavy noisy condition • Find accurate approach to verify and transcribe speech data • Create more effective procedure of developing a language • Develop other advanced speech technology and application 25

  26. The End and Thanks Cyberon Corporation TEL : + 8 8 6 -2 -2 9 1 0 -9 0 8 8 FAX : + 8 8 6 -2 -2 9 1 0 -7 9 8 6 W ebsite : w w w .cyberon.com .tw 26

Recommend


More recommend