speech processing 11 492 18 492 speech processing 11 492
play

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 - PowerPoint PPT Presentation

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis Evaluating Speech Synthesis How good is the voice? How good is the voice? This voice is a 45.67 This voice is a


  1. Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

  2. Evaluating Speech Synthesis Evaluating Speech Synthesis  How good is the voice? How good is the voice?  This voice is a 45.67 This voice is a 45.67  Is voice X better than voice Y Is voice X better than voice Y  Why? Why?

  3. Evaluation Evaluation  Objective measures Objective measures  Run a program and get a number Run a program and get a number  Subjective measures Subjective measures  Have human listeners extract a score Have human listeners extract a score  Do Object and Subjective scores correlate Do Object and Subjective scores correlate

  4. Human Tests Human Tests  Synthesis people are warped Synthesis people are warped  The more you listen the better it becomes The more you listen the better it becomes  They hear things others don’t They hear things others don’t  Non-synthesis people are warped Non-synthesis people are warped  People very sensitive to listening conditions People very sensitive to listening conditions  What question do you ask What question do you ask  What hardware you play it on What hardware you play it on  There are (at least) two orthogonal scales There are (at least) two orthogonal scales  Understandability Understandability  Naturalness Naturalness

  5. Standard Tests Standard Tests  DRT: diagnostic rhyme tests DRT: diagnostic rhyme tests  Test confusable phones Test confusable phones  “ “bat” vs “pat” bat” vs “pat”  Good for identifying phone errors Good for identifying phone errors  Sometimes in carrier sentences Sometimes in carrier sentences  Now we will say pat again. Now we will say pat again.  Unit selection Unit selection  Just include the standard works in the database Just include the standard works in the database

  6. Standard Tests Standard Tests  SUS: Semantically unpredictable sentences SUS: Semantically unpredictable sentences  Det adj noun verb det adj noun Det adj noun verb det adj noun  Automatically filled in with low frequency words Automatically filled in with low frequency words  The parklike holders threw the vague vegetables The parklike holders threw the vague vegetables  The simplistic consonants swam the episcopal quartet The simplistic consonants swam the episcopal quartet  The dark geniuses woke the humane emptiness. The dark geniuses woke the humane emptiness.  The masterly serials withdrew the collaborative brochure The masterly serials withdrew the collaborative brochure  Test for understandability Test for understandability  Ask users to type in what they hear Ask users to type in what they hear  Good as discrimination Good as discrimination  Very hard for even fluent non-natives Very hard for even fluent non-natives

  7. Standard tests Standard tests  MOS: mean opinion scores MOS: mean opinion scores  1-5 quality, naturalness, “like it” 1-5 quality, naturalness, “like it”  Take average score Take average score

  8. Some experimental problems Some experimental problems  Order of presentation Order of presentation  Other aids change perception Other aids change perception  Showing the text makes it much easier Showing the text makes it much easier  Having a talking head “improves” the synthesis Having a talking head “improves” the synthesis  Hardware quality Hardware quality  Some voices better on the telephone Some voices better on the telephone  Loud speaker quality (headphone quality) Loud speaker quality (headphone quality)  Room acoustics Room acoustics  Volume Volume  Understandability Understandability  Harder if doing other task Harder if doing other task  Personal preference Personal preference  Voice is full understandable but “creepy” Voice is full understandable but “creepy”  Voice is incomprehensible but “funny” Voice is incomprehensible but “funny”  Sounds like my grade school teacher Sounds like my grade school teacher

  9. TTS Evaluation TTS Evaluation  How good are your ears? How good are your ears?

  10. SUS Sentences SUS Sentences  sus_00005 sus_00005  sus_00012 sus_00012  sus_00017 sus_00017  sus_00022 sus_00022

  11. SUS Sentences SUS Sentences  The sorrowful premieres sang the The sorrowful premieres sang the ostentation gymnast ostentation gymnast  The temperamental gateways forgave the The temperamental gateways forgave the weatherbeaten finalist weatherbeaten finalist  The disruptive billboards blew the sugary The disruptive billboards blew the sugary endorsement endorsement  The serene adjustments foresaw the The serene adjustments foresaw the acceptable acquisition acceptable acquisition

  12. TTS Evaluation TTS Evaluation

  13. TTS Evaluation TTS Evaluation  In mud eels are, in mud none are In mud eels are, in mud none are  A 1918 state constitutional amendment A 1918 state constitutional amendment made Massachusetts one of 23 states made Massachusetts one of 23 states where citizens can enact laws by plebiscite. where citizens can enact laws by plebiscite.  Which is which Which is which  The numbers are 25 and 34. The numbers are 25 and 34.  The numbers 20 5 and 34. The numbers 20 5 and 34.  What is the temperature in Pittsburgh What is the temperature in Pittsburgh

  14. Objective Synthesis Tests Objective Synthesis Tests  Text analysis Text analysis  How well do you cover NSWs How well do you cover NSWs  How well do you cover homographs How well do you cover homographs  Lexical coverage Lexical coverage  How often do you see a new word How often do you see a new word  Lexical correctness Lexical correctness  How correct are pronunciations How correct are pronunciations  For unseen words For unseen words  For seen words For seen words  Phonetic intelligibility Phonetic intelligibility  DRT tests DRT tests  Semantic intelligibility Semantic intelligibility  SUS tests SUS tests

  15. Blizzard Challenge Blizzard Challenge  Annual Event from 2005 (15 years plus) Annual Event from 2005 (15 years plus)  Distribute large databases of speech Distribute large databases of speech  Participants Participants  Build a voice Build a voice  Synthesize a set of sentences Synthesize a set of sentences  Listeners Listeners  Listen and grade results Listen and grade results

  16. Blizzard Challenge Blizzard Challenge 2005: US English synthesis, 4 voices, 1 hour each 2005: US English synthesis, 4 voices, 1 hour each  4 teams plus “Studio” (human speech) 4 teams plus “Studio” (human speech)  2006: US English: 1 voice: 6 hours and 1 hour 2006: US English: 1 voice: 6 hours and 1 hour  12 teams 12 teams  2007: US English: 1 voice: 9 hours and 1 hour 2007: US English: 1 voice: 9 hours and 1 hour  14 teams 14 teams  2008: UK English: 15 hours: Mandarin 5 hours 2008: UK English: 15 hours: Mandarin 5 hours  19 teams 19 teams  2009: UK English: 15 hours: Mandarin 5 hours 2009: UK English: 15 hours: Mandarin 5 hours  2010: UK English 18 hours: Mandarin 6 hours 2010: UK English 18 hours: Mandarin 6 hours  2010- Audio Books, Indian Languages, Speaking in Noise 2010- Audio Books, Indian Languages, Speaking in Noise  Split between industry and academia Split between industry and academia  Split between Asia, Europe, America (mostly Europe and Asia). Split between Asia, Europe, America (mostly Europe and Asia). 

  17. Listeners Listeners  Three sets of listeners Three sets of listeners  Speech experts (participants) Speech experts (participants)  Paid undergrads (native speakers) Paid undergrads (native speakers)  Volunteers Volunteers  Types of tests Types of tests  MOS tests (1-5) MOS tests (1-5)  SUS tests SUS tests  DRT tests DRT tests  About 300 listeners in total About 300 listeners in total

  18. Listening Listening  Web based Web based  So everyone did it in a different environment So everyone did it in a different environment  But we got access to more people But we got access to more people  Asked to do it in quiet office with headphone Asked to do it in quiet office with headphone  Could listen multiple times Could listen multiple times

  19. Blizzard Challenge Results Blizzard Challenge Results  Speech Experts Speech Experts  Like synthesis better Like synthesis better  Understand synthesis better Understand synthesis better  Volunteers don’t always finish tests Volunteers don’t always finish tests  Undergrads sometimes finish tests Undergrads sometimes finish tests  (or put in filler answers) (or put in filler answers)  Results were correlated over different Results were correlated over different subgroups subgroups

  20. Application Tests Application Tests  How does it work *in* the application How does it work *in* the application  With real application data With real application data  A good voice is not noticed A good voice is not noticed  Have *real* users evaluate it Have *real* users evaluate it  Give them a choice (even if artificial) Give them a choice (even if artificial)  CEO chooses the one they like! CEO chooses the one they like!

Recommend


More recommend