Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation
Evaluating Speech Synthesis Evaluating Speech Synthesis How good is the voice? How good is the voice? This voice is a 45.67 This voice is a 45.67 Is voice X better than voice Y Is voice X better than voice Y Why? Why?
Evaluation Evaluation Objective measures Objective measures Run a program and get a number Run a program and get a number Subjective measures Subjective measures Have human listeners extract a score Have human listeners extract a score Do Object and Subjective scores correlate Do Object and Subjective scores correlate
Human Tests Human Tests Synthesis people are warped Synthesis people are warped The more you listen the better it becomes The more you listen the better it becomes They hear things others don’t They hear things others don’t Non-synthesis people are warped Non-synthesis people are warped People very sensitive to listening conditions People very sensitive to listening conditions What question do you ask What question do you ask What hardware you play it on What hardware you play it on There are (at least) two orthogonal scales There are (at least) two orthogonal scales Understandability Understandability Naturalness Naturalness
Standard Tests Standard Tests DRT: diagnostic rhyme tests DRT: diagnostic rhyme tests Test confusable phones Test confusable phones “ “bat” vs “pat” bat” vs “pat” Good for identifying phone errors Good for identifying phone errors Sometimes in carrier sentences Sometimes in carrier sentences Now we will say pat again. Now we will say pat again. Unit selection Unit selection Just include the standard works in the database Just include the standard works in the database
Standard Tests Standard Tests SUS: Semantically unpredictable sentences SUS: Semantically unpredictable sentences Det adj noun verb det adj noun Det adj noun verb det adj noun Automatically filled in with low frequency words Automatically filled in with low frequency words The parklike holders threw the vague vegetables The parklike holders threw the vague vegetables The simplistic consonants swam the episcopal quartet The simplistic consonants swam the episcopal quartet The dark geniuses woke the humane emptiness. The dark geniuses woke the humane emptiness. The masterly serials withdrew the collaborative brochure The masterly serials withdrew the collaborative brochure Test for understandability Test for understandability Ask users to type in what they hear Ask users to type in what they hear Good as discrimination Good as discrimination Very hard for even fluent non-natives Very hard for even fluent non-natives
Standard tests Standard tests MOS: mean opinion scores MOS: mean opinion scores 1-5 quality, naturalness, “like it” 1-5 quality, naturalness, “like it” Take average score Take average score
Some experimental problems Some experimental problems Order of presentation Order of presentation Other aids change perception Other aids change perception Showing the text makes it much easier Showing the text makes it much easier Having a talking head “improves” the synthesis Having a talking head “improves” the synthesis Hardware quality Hardware quality Some voices better on the telephone Some voices better on the telephone Loud speaker quality (headphone quality) Loud speaker quality (headphone quality) Room acoustics Room acoustics Volume Volume Understandability Understandability Harder if doing other task Harder if doing other task Personal preference Personal preference Voice is full understandable but “creepy” Voice is full understandable but “creepy” Voice is incomprehensible but “funny” Voice is incomprehensible but “funny” Sounds like my grade school teacher Sounds like my grade school teacher
TTS Evaluation TTS Evaluation How good are your ears? How good are your ears?
SUS Sentences SUS Sentences sus_00005 sus_00005 sus_00012 sus_00012 sus_00017 sus_00017 sus_00022 sus_00022
SUS Sentences SUS Sentences The sorrowful premieres sang the The sorrowful premieres sang the ostentation gymnast ostentation gymnast The temperamental gateways forgave the The temperamental gateways forgave the weatherbeaten finalist weatherbeaten finalist The disruptive billboards blew the sugary The disruptive billboards blew the sugary endorsement endorsement The serene adjustments foresaw the The serene adjustments foresaw the acceptable acquisition acceptable acquisition
TTS Evaluation TTS Evaluation
TTS Evaluation TTS Evaluation In mud eels are, in mud none are In mud eels are, in mud none are A 1918 state constitutional amendment A 1918 state constitutional amendment made Massachusetts one of 23 states made Massachusetts one of 23 states where citizens can enact laws by plebiscite. where citizens can enact laws by plebiscite. Which is which Which is which The numbers are 25 and 34. The numbers are 25 and 34. The numbers 20 5 and 34. The numbers 20 5 and 34. What is the temperature in Pittsburgh What is the temperature in Pittsburgh
Objective Synthesis Tests Objective Synthesis Tests Text analysis Text analysis How well do you cover NSWs How well do you cover NSWs How well do you cover homographs How well do you cover homographs Lexical coverage Lexical coverage How often do you see a new word How often do you see a new word Lexical correctness Lexical correctness How correct are pronunciations How correct are pronunciations For unseen words For unseen words For seen words For seen words Phonetic intelligibility Phonetic intelligibility DRT tests DRT tests Semantic intelligibility Semantic intelligibility SUS tests SUS tests
Blizzard Challenge Blizzard Challenge Annual Event from 2005 (15 years plus) Annual Event from 2005 (15 years plus) Distribute large databases of speech Distribute large databases of speech Participants Participants Build a voice Build a voice Synthesize a set of sentences Synthesize a set of sentences Listeners Listeners Listen and grade results Listen and grade results
Blizzard Challenge Blizzard Challenge 2005: US English synthesis, 4 voices, 1 hour each 2005: US English synthesis, 4 voices, 1 hour each 4 teams plus “Studio” (human speech) 4 teams plus “Studio” (human speech) 2006: US English: 1 voice: 6 hours and 1 hour 2006: US English: 1 voice: 6 hours and 1 hour 12 teams 12 teams 2007: US English: 1 voice: 9 hours and 1 hour 2007: US English: 1 voice: 9 hours and 1 hour 14 teams 14 teams 2008: UK English: 15 hours: Mandarin 5 hours 2008: UK English: 15 hours: Mandarin 5 hours 19 teams 19 teams 2009: UK English: 15 hours: Mandarin 5 hours 2009: UK English: 15 hours: Mandarin 5 hours 2010: UK English 18 hours: Mandarin 6 hours 2010: UK English 18 hours: Mandarin 6 hours 2010- Audio Books, Indian Languages, Speaking in Noise 2010- Audio Books, Indian Languages, Speaking in Noise Split between industry and academia Split between industry and academia Split between Asia, Europe, America (mostly Europe and Asia). Split between Asia, Europe, America (mostly Europe and Asia).
Listeners Listeners Three sets of listeners Three sets of listeners Speech experts (participants) Speech experts (participants) Paid undergrads (native speakers) Paid undergrads (native speakers) Volunteers Volunteers Types of tests Types of tests MOS tests (1-5) MOS tests (1-5) SUS tests SUS tests DRT tests DRT tests About 300 listeners in total About 300 listeners in total
Listening Listening Web based Web based So everyone did it in a different environment So everyone did it in a different environment But we got access to more people But we got access to more people Asked to do it in quiet office with headphone Asked to do it in quiet office with headphone Could listen multiple times Could listen multiple times
Blizzard Challenge Results Blizzard Challenge Results Speech Experts Speech Experts Like synthesis better Like synthesis better Understand synthesis better Understand synthesis better Volunteers don’t always finish tests Volunteers don’t always finish tests Undergrads sometimes finish tests Undergrads sometimes finish tests (or put in filler answers) (or put in filler answers) Results were correlated over different Results were correlated over different subgroups subgroups
Application Tests Application Tests How does it work *in* the application How does it work *in* the application With real application data With real application data A good voice is not noticed A good voice is not noticed Have *real* users evaluate it Have *real* users evaluate it Give them a choice (even if artificial) Give them a choice (even if artificial) CEO chooses the one they like! CEO chooses the one they like!
Recommend
More recommend