speech processing 15 492 18 492
play

Speech Processing 15-492/18-492 Speech Synthesis Evaluation - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How good is the voice? How good is the voice? This voice is a 45.67 This voice is a 45.67 Is voice X better than voice Y Is voice X


  1. Speech Processing 15-492/18-492 Speech Synthesis Evaluation

  2. Evaluating Speech Synthesis How good is the voice? � How good is the voice? � � This voice is a 45.67 This voice is a 45.67 � Is voice X better than voice Y � Is voice X better than voice Y � Why? � Why? �

  3. Evaluation Objective measures � Objective measures � � Run a program and get a number Run a program and get a number � Subjective measures � Subjective measures � � Have human listeners extract a score Have human listeners extract a score � Do Object and Subjective scores correlate � Do Object and Subjective scores correlate �

  4. Human Tests � Synthesis people are warped Synthesis people are warped � � The more you listen the better it becomes The more you listen the better it becomes � � They hear things others don’t They hear things others don’t � � Non Non- -synthesis people are warped synthesis people are warped � � People very sensitive to listening conditions People very sensitive to listening conditions � � What question do you ask What question do you ask � � What hardware you play it on What hardware you play it on � � There are (at least) two orthogonal scales There are (at least) two orthogonal scales � � Understandable Understandable � � natural natural �

  5. Standard Tests DRT: diagnostic rhyme tests � DRT: diagnostic rhyme tests � � Test confusable phones Test confusable phones � � “bat” “bat” vs vs “pat” “pat” � � Good for identifying phone errors Good for identifying phone errors � � Sometimes in carrier sentences Sometimes in carrier sentences �  Now we will say pat again. Now we will say pat again.  � Unit selection Unit selection �  Just include the standard works in the database Just include the standard works in the database 

  6. Standard Tests � SUS: Semantically unpredictable sentences SUS: Semantically unpredictable sentences � � Det Det adj adj noun verb noun verb det det adj adj noun noun � � Automatically filled in with low frequency words Automatically filled in with low frequency words �  The The parklike parklike holders threw the vague vegetables holders threw the vague vegetables   The simplistic consonants swam the The simplistic consonants swam the episcopal episcopal quartet quartet   The dark geniuses woke the humane emptiness. The dark geniuses woke the humane emptiness.   The masterly serials withdrew the collaborative brochure The masterly serials withdrew the collaborative brochure  � Test for understandability Test for understandability � � Ask users to type in what they hear Ask users to type in what they hear � � Good as discrimination Good as discrimination � � Very hard for even fluent non Very hard for even fluent non- -natives natives �

  7. Standard tests MOS: mean opinion scores � MOS: mean opinion scores � � 1 1- -5 quality, naturalness, “like it” 5 quality, naturalness, “like it” � � Take average score Take average score �

  8. Some experimental problems � Order of presentation Order of presentation � � Other aids change perception Other aids change perception � � Showing the text makes it much easier Showing the text makes it much easier � � Having a talking head “improves” the synthesis Having a talking head “improves” the synthesis � � Hardware quality Hardware quality � � Some voices better on the telephone Some voices better on the telephone � � Loud speaker quality (headphone quality) Loud speaker quality (headphone quality) � � Room acoustics Room acoustics � � Volume Volume � � Understandability Understandability � � Harder if doing other task Harder if doing other task � � Personal preference Personal preference � � Voice is full understandable but “creepy” Voice is full understandable but “creepy” � � Voice is incomprehensible but “funny” Voice is incomprehensible but “funny” � � Sounds like my grade school teacher Sounds like my grade school teacher �

  9. TTS Evaluation How good are your ears? � How good are your ears? �

  10. SUS Sentences sus_00022 � sus_00022 � sus_00012 � sus_00012 � sus_00005 � sus_00005 � sus_00017 � sus_00017 �

  11. SUS Sentences The serene adjustments foresaw the � The serene adjustments foresaw the � acceptable acquisition acceptable acquisition The temperamental gateways forgave the � The temperamental gateways forgave the � weatherbeaten finalist finalist weatherbeaten The sorrowful premieres sang the � The sorrowful premieres sang the � ostentatious gymnast ostentatious gymnast The disruptive billboards blew the sugary � The disruptive billboards blew the sugary � endorsement endorsement

  12. TTS Evaluation

  13. TTS Evaluation In mud eels are, in mud none are � In mud eels are, in mud none are � A 1918 state constitutional amendment � A 1918 state constitutional amendment � made Massachusetts one of 23 states made Massachusetts one of 23 states where citizens can enact laws by plebiscite. where citizens can enact laws by plebiscite. Which is which � Which is which � � The numbers are 25 and 34. The numbers are 25 and 34. � � The numbers 20 5 and 34. The numbers 20 5 and 34. � What is the temperature in Pittsburgh � What is the temperature in Pittsburgh �

  14. Objective Synthesis Tests � Text analysis Text analysis � � How well do you cover How well do you cover NSWs NSWs � � How well do you cover homographs How well do you cover homographs � � Lexical coverage Lexical coverage � � How often do you see a new word How often do you see a new word � � Lexical correctness Lexical correctness � � How correct are pronunciations How correct are pronunciations � � For unseen words For unseen words � � For seen words For seen words � � Phonetic intelligibility Phonetic intelligibility � � DRT tests DRT tests � � Semantic intelligibility Semantic intelligibility � � SUS tests SUS tests �

  15. Blizzard Challenge Annual Event from 2005 � Annual Event from 2005 � Distribute large databases of speech � Distribute large databases of speech � Participants � Participants � � Build a voice Build a voice � � Synthesize a set of sentences Synthesize a set of sentences � Listeners � Listeners � � Listen and grade results Listen and grade results �

  16. Blizzard Challenge � 2005: US English synthesis, 4 voices, 1 hour each 2005: US English synthesis, 4 voices, 1 hour each � � 4 teams plus “Studio” (human speech) 4 teams plus “Studio” (human speech) � � 2006: US English: 1 voice: 6 hours and 1 hour 2006: US English: 1 voice: 6 hours and 1 hour � � 12 teams 12 teams � � 2007: US English: 1 voice: 9 hours and 1 hour 2007: US English: 1 voice: 9 hours and 1 hour � � 14 teams 14 teams � � 2008: UK English: 15 hours: Mandarin 5 hours 2008: UK English: 15 hours: Mandarin 5 hours � � 19 teams 19 teams � � Split between industry and academia Split between industry and academia � � Split between Asia, Europe, Americas. Split between Asia, Europe, Americas. �

  17. Listeners � Three sets of listeners Three sets of listeners � � Speech experts (participants) Speech experts (participants) � � Paid undergrads (native speakers) Paid undergrads (native speakers) � � Volunteers Volunteers � � Types of tests Types of tests � � MOS tests (1 MOS tests (1- -5) 5) � � SUS tests SUS tests � � DRT tests DRT tests � � About 300 listeners in total About 300 listeners in total �

  18. Listening Web based � Web based � � So everyone did it in a different environment So everyone did it in a different environment � � But we got access to more people But we got access to more people � � Asked to do it in quiet office with headphone Asked to do it in quiet office with headphone � � Could listen multiple times Could listen multiple times �

  19. Blizzard Challenge Results Speech Experts � Speech Experts � � Like synthesis better Like synthesis better � � Understand synthesis better Understand synthesis better � Volunteers don’t always finish tests � Volunteers don’t always finish tests � Undergrads sometime finish tests � Undergrads sometime finish tests � � (or put in filler answers) (or put in filler answers) � Results were correlated over different � Results were correlated over different � subgroups subgroups

  20. Application Tests How does it work *in* the application � How does it work *in* the application � With real application data � With real application data � A good voice is not noticed � A good voice is not noticed � Have *real* users evaluate it � Have *real* users evaluate it � Give them a choice (even if artificial) � Give them a choice (even if artificial) � � CEO choices the one they like! CEO choices the one they like! �

Recommend


More recommend