Speech Processing 15-492/18-492 Speech Synthesis Evaluation
Evaluating Speech Synthesis How good is the voice? � How good is the voice? � � This voice is a 45.67 This voice is a 45.67 � Is voice X better than voice Y � Is voice X better than voice Y � Why? � Why? �
Evaluation Objective measures � Objective measures � � Run a program and get a number Run a program and get a number � Subjective measures � Subjective measures � � Have human listeners extract a score Have human listeners extract a score � Do Object and Subjective scores correlate � Do Object and Subjective scores correlate �
Human Tests � Synthesis people are warped Synthesis people are warped � � The more you listen the better it becomes The more you listen the better it becomes � � They hear things others don’t They hear things others don’t � � Non Non- -synthesis people are warped synthesis people are warped � � People very sensitive to listening conditions People very sensitive to listening conditions � � What question do you ask What question do you ask � � What hardware you play it on What hardware you play it on � � There are (at least) two orthogonal scales There are (at least) two orthogonal scales � � Understandable Understandable � � natural natural �
Standard Tests DRT: diagnostic rhyme tests � DRT: diagnostic rhyme tests � � Test confusable phones Test confusable phones � � “bat” “bat” vs vs “pat” “pat” � � Good for identifying phone errors Good for identifying phone errors � � Sometimes in carrier sentences Sometimes in carrier sentences � Now we will say pat again. Now we will say pat again. � Unit selection Unit selection � Just include the standard works in the database Just include the standard works in the database
Standard Tests � SUS: Semantically unpredictable sentences SUS: Semantically unpredictable sentences � � Det Det adj adj noun verb noun verb det det adj adj noun noun � � Automatically filled in with low frequency words Automatically filled in with low frequency words � The The parklike parklike holders threw the vague vegetables holders threw the vague vegetables The simplistic consonants swam the The simplistic consonants swam the episcopal episcopal quartet quartet The dark geniuses woke the humane emptiness. The dark geniuses woke the humane emptiness. The masterly serials withdrew the collaborative brochure The masterly serials withdrew the collaborative brochure � Test for understandability Test for understandability � � Ask users to type in what they hear Ask users to type in what they hear � � Good as discrimination Good as discrimination � � Very hard for even fluent non Very hard for even fluent non- -natives natives �
Standard tests MOS: mean opinion scores � MOS: mean opinion scores � � 1 1- -5 quality, naturalness, “like it” 5 quality, naturalness, “like it” � � Take average score Take average score �
Some experimental problems � Order of presentation Order of presentation � � Other aids change perception Other aids change perception � � Showing the text makes it much easier Showing the text makes it much easier � � Having a talking head “improves” the synthesis Having a talking head “improves” the synthesis � � Hardware quality Hardware quality � � Some voices better on the telephone Some voices better on the telephone � � Loud speaker quality (headphone quality) Loud speaker quality (headphone quality) � � Room acoustics Room acoustics � � Volume Volume � � Understandability Understandability � � Harder if doing other task Harder if doing other task � � Personal preference Personal preference � � Voice is full understandable but “creepy” Voice is full understandable but “creepy” � � Voice is incomprehensible but “funny” Voice is incomprehensible but “funny” � � Sounds like my grade school teacher Sounds like my grade school teacher �
TTS Evaluation How good are your ears? � How good are your ears? �
SUS Sentences sus_00022 � sus_00022 � sus_00012 � sus_00012 � sus_00005 � sus_00005 � sus_00017 � sus_00017 �
SUS Sentences The serene adjustments foresaw the � The serene adjustments foresaw the � acceptable acquisition acceptable acquisition The temperamental gateways forgave the � The temperamental gateways forgave the � weatherbeaten finalist finalist weatherbeaten The sorrowful premieres sang the � The sorrowful premieres sang the � ostentatious gymnast ostentatious gymnast The disruptive billboards blew the sugary � The disruptive billboards blew the sugary � endorsement endorsement
TTS Evaluation
TTS Evaluation In mud eels are, in mud none are � In mud eels are, in mud none are � A 1918 state constitutional amendment � A 1918 state constitutional amendment � made Massachusetts one of 23 states made Massachusetts one of 23 states where citizens can enact laws by plebiscite. where citizens can enact laws by plebiscite. Which is which � Which is which � � The numbers are 25 and 34. The numbers are 25 and 34. � � The numbers 20 5 and 34. The numbers 20 5 and 34. � What is the temperature in Pittsburgh � What is the temperature in Pittsburgh �
Objective Synthesis Tests � Text analysis Text analysis � � How well do you cover How well do you cover NSWs NSWs � � How well do you cover homographs How well do you cover homographs � � Lexical coverage Lexical coverage � � How often do you see a new word How often do you see a new word � � Lexical correctness Lexical correctness � � How correct are pronunciations How correct are pronunciations � � For unseen words For unseen words � � For seen words For seen words � � Phonetic intelligibility Phonetic intelligibility � � DRT tests DRT tests � � Semantic intelligibility Semantic intelligibility � � SUS tests SUS tests �
Blizzard Challenge Annual Event from 2005 � Annual Event from 2005 � Distribute large databases of speech � Distribute large databases of speech � Participants � Participants � � Build a voice Build a voice � � Synthesize a set of sentences Synthesize a set of sentences � Listeners � Listeners � � Listen and grade results Listen and grade results �
Blizzard Challenge � 2005: US English synthesis, 4 voices, 1 hour each 2005: US English synthesis, 4 voices, 1 hour each � � 4 teams plus “Studio” (human speech) 4 teams plus “Studio” (human speech) � � 2006: US English: 1 voice: 6 hours and 1 hour 2006: US English: 1 voice: 6 hours and 1 hour � � 12 teams 12 teams � � 2007: US English: 1 voice: 9 hours and 1 hour 2007: US English: 1 voice: 9 hours and 1 hour � � 14 teams 14 teams � � 2008: UK English: 15 hours: Mandarin 5 hours 2008: UK English: 15 hours: Mandarin 5 hours � � 19 teams 19 teams � � Split between industry and academia Split between industry and academia � � Split between Asia, Europe, Americas. Split between Asia, Europe, Americas. �
Listeners � Three sets of listeners Three sets of listeners � � Speech experts (participants) Speech experts (participants) � � Paid undergrads (native speakers) Paid undergrads (native speakers) � � Volunteers Volunteers � � Types of tests Types of tests � � MOS tests (1 MOS tests (1- -5) 5) � � SUS tests SUS tests � � DRT tests DRT tests � � About 300 listeners in total About 300 listeners in total �
Listening Web based � Web based � � So everyone did it in a different environment So everyone did it in a different environment � � But we got access to more people But we got access to more people � � Asked to do it in quiet office with headphone Asked to do it in quiet office with headphone � � Could listen multiple times Could listen multiple times �
Blizzard Challenge Results Speech Experts � Speech Experts � � Like synthesis better Like synthesis better � � Understand synthesis better Understand synthesis better � Volunteers don’t always finish tests � Volunteers don’t always finish tests � Undergrads sometime finish tests � Undergrads sometime finish tests � � (or put in filler answers) (or put in filler answers) � Results were correlated over different � Results were correlated over different � subgroups subgroups
Application Tests How does it work *in* the application � How does it work *in* the application � With real application data � With real application data � A good voice is not noticed � A good voice is not noticed � Have *real* users evaluate it � Have *real* users evaluate it � Give them a choice (even if artificial) � Give them a choice (even if artificial) � � CEO choices the one they like! CEO choices the one they like! �
Recommend
More recommend