S PEECH SYNTHESIS EVALUATION S ÉBASTIEN L E M AGUER ADAPT C ENTRE , S IGMEDIA L AB , EE E NGINEERING , T RINITY C OLLEGE D UBLIN 11-07-2019
L ET ’ S RECAPITULATE [2/2]
W HAT WE HERE NOWADAYS Objectively Wavenet was a game changer Tacotron: easy to use if you have enough data What you may read "Human-like", "High-fidelity", "Highly Natural", . . . 1 33
K EY QUESTIONS / PROBLEMS AI Hype https://www.economist.com/technology-quarterly/2020/06/11/ an-understanding-of-ais-limitations-is-starting-to-sink-in Environmental issues: [Strubell et al., (2019)] and follow up Problematic question Is the quality really that good? Fundamental questions Did we solve anything? If yes, what did we solve? 2 33
L ET ’ S GET STARTED
W HAT IS EVALUATION 3 33
W HAT IS EVALUATION The ideal Being to describe in details what a system bring compare to other ones In practice Classify/order the systems based on their synthesis 4 33
W HAT TO EVALUATE ? acoustic prosody temporal structure tonal structure amplitude profile symbolic prosody syllabic stress word accent sentence mode pronunciation of words also in sentence context phrasing, rhythm voice quality inherent or introduced by signal processing? discontinuities in unit concatenation . . . 5 33
E VALUATION AXES Similarity Intelligibility Naturalness 6 33
W HERE DOES IT TAKE PLACE ? signal ac. param. Param. Offline Corpus Training stage ling. desc. NLP text Models Online ling. desc. ac. param. Generation stage Rendering Text NLP objective subjective 7 33
O BJECTIVE EVALUATION
O BJECTIVE EVALUATION - T HE METRICS Which axes Intelligibility: not so used Similarity : assessment/validation The main metrics Spectrum: MCD , RMSE, Euclidean distances F0: RMSE (Hz/Cent), VUV ratio , LL-Ratio BAP: RMSE Duration: RMSE , syllable/phoneme rate 8 33
S UBJECTIVE EVALUATION
S UBJECTIVE EVALUATION - I NTRODUCTION Subjective evaluation are human focused. . . . . . . .so expensive ! Should be really carefully prepared you won’t be able to repeat it if you mess up the preparation the analysis of the results is depending a lot on the preparation be careful about the question asked and the targeted listeners (see checklist later!) Generally at least 3 systems involved the original voice a reference (anchor) system the analyzed system 9 33
I NTELLIGIBILITY TEST Semantically Unpredictable Sentences (SUS) Unpredictable ⇒ force the listener to "decipher" the message Syntax is correct Example: A table eat the doctor Protocol guideline A step: 1. The listener listen to a SUS 2. He/she writes down what he heard (joker character for not heard word) 3. A distance is computed between what has been typed/the original sentence Score = Word Error Rate/Phone error rate (less common) A nice paper: [Benoit, (1990)] 10 33
T HE STANDARD PROTOCOLS FOR SUBJECTIVE EVALUATION
S CORING METHODOLOGIES The ACR protocol [ITU-T, (1996)] Absolute Category Rating (ACR) ⇒ Mean Opinion Score Scores from 1 (bad) to 5 (excellent) Key points Systems and utterances are randomized (Latin-Square algorithm) The question asked is going to condition the user ⇒ caution! Major problem: scores are "flatten" 11 33
ACR - I NTERFACE 12 33
P REFERENCE - BASED METHODOLOGIES AB(X) test 2 samples (A) an (B) are presented 3 choices: A, B and no preference ABX: I a fixed reference X is presented Key points More strict than ACR ⇒ results more significant "no preference" can be remove ⇒ post-processing analysis required! 13 33
AB - I NTERFACE 14 33
MUSHRA MUltiple Stimuli with Hidden Reference and Anchor [ITU-R, (2001)] Idea: combining scoring and preference Continuous score from 0 to 100 with steps at every 20 Some constraints: I Given reference + reference hidden (consistency) I Given anchors Key points Mix the scoring and preference methodologies But: I difficult from the listener perspective I small differences are difficult to interpret 15 33
MUSHRA - I NTERFACE 16 33
W HAT TO DO WITH THE RESULTS
S TATISTICAL ANALYSIS Why? We are using a sample ⇒ we want to generalize How: using statistical test Generally t-test or Wilcoxon based test Generally set ¸ = 0 : 05 Report the confidence interval and the effect size Important !!! Be careful and honest with the conclusion 17 33
A COUNTER EXAMPLE ( BLOG POST , PAPER IS BETTER ) Graphic results 18 33
A COUNTER EXAMPLE ( BLOG POST , PAPER IS BETTER ) Graphic results "Explanation" ". . . were obtained in blind tests with human subjects (from over 500 ratings on 100 test sentences). As we can see, WaveNets reduce the gap between the state of the art and human-level performance by over 50% for both US English and Mandarin Chinese." 18 33
H OW TO INTERPRET RESULTS Results taken from [Al-Radhi et al., (2018)] 19 33
V ALIDATION
S OME PRECAUTIONS Results have to be reproducible! Environment setup reproducibility I Description of the conditions (speaker/headphones, . . . ) Protocol reproducibility I Description of the test I Description of the question I Description of the corpora I Description of the "cognitive aspect" (duration, pause?, . . . ) Statistical reproducibility I Description of listeners (number, expert vs non expert, . . . ) I Statistical analysis (confidence interval, . . . ) 20 33
C HECKLIST - 1 [W ESTER ET AL ., (2015)] ˜ What test to use? I MOS, MUSHRA, preference, intelligibility, and same/different judgments all fit different situations. ˜ Which question(s) to ask? I Be aware that the question you ask may influence the answer you get. The terms you use may be interpreted differently by listeners, e.g., what does “quality” or “naturalness” actually mean? ˜ Which data to use for testing? I Factor out aspects that affect the evaluation, but which are unrelated to the research question studied. ˜ Is the evaluation material unbiased and free of training data? ˜ Is a reference needed? I Consider giving a reference or adding training material, particularly for intonation evaluation. Also consider the case for including other anchors. 21 33
C HECKLIST - 2 [W ESTER ET AL ., (2015)] ˜ What type of listeners? I Native vs. non-native? Speech experts vs. naive listeners? Age, gender, hearing impairments? Different listener groups can lead to different results. ˜ How many listeners to use? ˜ How many data-points are needed? ˜ Is the task suitable for human listeners? I Take into consideration listener boredom, fatigue, and memory constraints, as well as cognitive load. ˜ Can you use crowd-sourcing? I The biggest concern here is how to ensure the quality of the test-takers. ˜ How is the experiment going to be conducted? I With headphones or speakers, over the web or in a listening booth? 22 33
S OME BIASES ( EX : [C LARK ET AL ., (2019)]) Background Analyze how to evaluate long utterances in a ACR based test Some results 23 33
T HE BLIZZARD CHALLENGE
T HE BLIZZARD CHALLENGE Website: http://festvox.org/blizzard When? I every year since 2005 Who (participants)? I university I some companies Which kind of systems? I Parametrical I Unit selection I Hybrid Philosophy I Focus on the analysis and the exchange rather than pure rating ! I Results made anonymous (however by reading the different papers you can rebuild the results) 24 33
W HAT ’ S NOW
C URRENT SITUATION Strong need of new protocols ([Wagner et al., (2019)]) MOS, MUSHRA not refined enough! What does a preference, score mean? Get more precise feedback, qualify the speech 25 33
S UBJECTIVE EVALUATION - THE NEW WAYS Behavioural Task focused (reaction time, completion, . . . ) [Wagner and Betz, (2017)] Physiological Pupillometry [Govender and EEG [Parmonangan et al., (2019)] King, (2018)] Source: [Siuly et al., (2016)] Be careful: [Belardinelli et al., Source: Peter Lamb/123RF (2019)] Be careful: [Winn et al., (2018)] 26 33
O BJECTIVE EVALUATION - THE BIG MISCONCEPTION For more details, see [Wagner et al., (2019)] Subjective evaluation seems way better More robust Human involved is the loop Key problem(s) It is expensive What do we learn about the signal? Objective evaluation - 2 goals: 1. Pointing differences 2. Classifying systems 27 33
T AKE HOME MESSAGES [6/6]
W HAT TO REMEMBER Speech synthesis � = easy problem A lot of human effort A lot of computer effort Different solutions for different problems Parametric synthesis: handcrafted + database Control/speed vs quality 28 33
T HE CURRENT STATE The DNN (r)evolution Everything is moving to DNN based architecture Definitely a jump in quality (but how much and why?) Don’t forget the "user" CHI: GAFA (obviously!), startups Blind people : using diphone synthesis (why?) A lot of other : speech researchers/scientists, entertainment, . . . 29 33
S OME SENSITIVE POINTS A big potential danger Spoofing (See challenge ASVSpoof [Wu et al., (2015)]) Black box vs control DNN: we don’t understand what the system is doing And when it fails? Environmental issues [Strubell et al., (2019)] Evaluation Important issue Information vs marketing 30 33
Recommend
More recommend