Outliers Detection vs. Control Questions to Ensure Reliable Results in Crowdsourcing. A Speech Quality Assessment Case Study Rafael Zequeira Jiménez , Laura Fernández Gallardo, Sebastian Möller Quality and Usability Lab, Technische Universität Berlin HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop
Crowdsourcing HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 2
Motivation Spe Speech quali lity is important for for the e Quality of of Ex Exper erience (QoE) in: audio books virtual or robotic conversational agents * * The The coll collected ra rati tings can can be be used used to to trai train AI AI systems to to pre redict the the speech qu quali ality au auto tomat atical ally ly HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 3
Motivation Spe Speech quali lity ex experiments traditionally ly co conducted in Laboratory La - Professional audio equipment - Soundproof room - Limited number of participants HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 4
Crowdsourcing Study o Con onducted a a spe peech quality as assessment ex experiment o Cr Crowd-workers were ere pre presen ented with th 20 20 sp speech sti stimuli o Opinion ab about ove overall ll quality gat gathered in a a 5-point sc scale
Speech Material: o Database number 50 501 from from ITU-T Re Rec. P.8 P.863 o 4 4 Ge Germans were ere rec recorded per per co condition o 20 200 sp speech sti stimuli (9s (9s lon ong on on av avg.) o 50 50 deg degradation co conditions: o nar arrow- & wide- ba band o tem emporal l cl clipping o sig signal-corre related nois oise, o com ombinations of of these de degradations o Th The e da database con contains quali lity ra ratings to o the e 20 200 sti stimuli li made by by 24 24 di different nati tive Ge German listeners, in acc accordance to o ITU-T Rec Rec. P.8 .800 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 6
Study Conditions: o Address the st study to nati ative Ge Germ rmans o Coll Collect 24 24 ra rati tings per per sti stimulus fro from di different listener ers o Ex Experiment in ac accordance with the ITU-T Re Rec. . P.8 P.800 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 7
Crowdsourcing Platform: o Ge German bas based CS CS pla platform o Rep Reported 1 1 mill llion glo global user sers in Se September 201 2017 o Most of of their user sers are are from from Ge German sp speaking co countries HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 8
Crowdsourcing Experiment Scre Screening ta task to o recr recruit Ge Germ rman listeners o Speech quali Spe lity as assessment task: o o Qualification ph phase o Spe Speech quali lity ass assessment t HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 9
Crowdsourcing Experiment Speech Quali ality ty Quali alific icat ation Asse ssessment • co consent req request • intr troduction • use se of of hea eadphone • en environment rec record • au audio Math ath trapping up p to o 15 15s questi tion • 20 20 sti stimuli to o ra rate te • 5 5 st stimuli li as as an an an anchor • 2 2 tra rapping Question HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 10
Crowdsourcing Experiment Speech Quali ality ty Quali alific icat ation Asse ssessment • co consent req request • intr troduction • use se of of hea eadphone • en environment rec record • au audio Math ath trapping up p to o 15 15s questi tion • 20 20 sti stimuli to o ra rate te • 5 5 st stimuli li as as an an an anchor • 2 2 tra rapping Question HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 11
Results 87 87 wor orkers pa participated in the stu study o 8 8 wor orkers fa failed the e Qualification ph phase o 53 3 unique listen eners: o o 60, 60,4% male les o 96, 96,2% nati tive Ge Germans o pro rovided 48 4840 ra rati tings the co colle llected ra rati tings acc account for for 24 24 to 26 26 ass assessment fro from dif different t listeners o per per file file HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 12
Crowdsourcing vs. Laboratory Spearman’s rank -order cor correla lation: : o o rho = 0,864 (p<0,001 01) Mon onotonic rel relationship betw between La Lab- an and CS CS- MOS o Ro Root Mean Sq Square e Err Error: o o RMSE SE=0 =0,47 ,474 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 13
Filtering from unreliable workers Work in [1] [1] and and [2] [2] re recommends: o o the use se of of trapping ques estion, to cat catch inattentive user sers o when en the use ser fa fail, l, then en all all of of their ra rati tings are are dis discarded Thi This app approach was ef effe fective in [1] 1] an and improved slig slightly the res result lts in [2] 2] [1] B. Naderi, T. Polzehl, I. Wechsung, F. Köster , and S. Möller, “Effect of Trapping Questions on the Reliability of Speech Quality Judgments in a Crowdsourcing Paradigm,” in Interspeech, 2015, pp. 2799 – 2803. [2] R. Zequeira Jiménez, L. Fernández Gallardo, and S. Möller, “Scoring Voice Likability using Pair -Comparison: Laboratory vs. Crowdsourcing Approach,” in Ninth International Conference on Quality of Multimedia Experience (QoMEX), 2017, pp. 1 – 3. Page 14
Filtering from unreliable workers A wor orker is s unre reli liable or or untrustworthy when: : o s/h s/he fa fails the e trapping ques estion in the SQ SQAT AT o s/h s/he fa fails the e Qualification mor ore than onc once HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 15
Filtering from unreliable workers A wor orker is s unre reli liable or or untrustworthy when: : o s/h s/he fa fails the e trapping ques estion in the SQ SQAT AT o s/h s/he fa fails the e Qualification mor ore than onc once HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 16
Filtering from unreliable workers o Discarded 32 320 ra ratings in total fro from W4, 4, W5, W7 o W6 did did not ot co conduct the SQ SQAT Meth thod: : “ fi filt ltering by by tra trappin ing question" ( F-TQ ) o Spearman’s rank -order co correla lation on on 45 4520 ra rati tings: : o rh rho o = = 0,8 0,862 (p (p<0,001) When dis discarding all all the e wor orkers ( F- TQ’ ) ) : o rh rho o = = 0,8 0,854 (p< p<0,001) HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 17
Outlier Detection outliers: o rati tings abo above 1,5 1,5 inte terquartile le ran range (I (IQR) o dep epicted by by ci circle les ex extr treme out outliers: o rati tings at t 3,0 3,0 IQR or or abo above o dep epicted by by ast asterisks HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 18
Outlier Detection o Discarded 12 122 ra ratings identified as as ex extreme out outliers Meth thod: : “ fi filt ltering by by ou outli tlier de detectio ion 1" " ( F-OD1 ) o Spearman’s rank -order co correla lation: : o rh rho o = = 0,8 0,863 (p (p<0,001) sti still not be better r than the fir first coef coefficient when en no o da data was dis discarded HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 19
Outlier Detection 2 o Discarded 14 1480 ra ratings fro from 12 12 wor orkers that wer ere out outliers or or ex extr treme out outliers three times or or mor ore [5] 5]. Meth thod: : “ fi filt ltering by by ou outli tlier de detectio ion 2" " ( F-OD2 ) o Spearman’s rank -order co correla lation: : o rh rho o = = 0,8 0,867 (p (p<0,001) [5] B. Naderi, Motivation of Workers on Microtask Crowdsourcing Platforms. Springer, 2018. Page 20
Alternative Approach o Appli lied F-OD1 an and F-OD2 an and dis discarded 15 1529 ra ratings in total. o Ide dentify the ou outl tliers made by by all all the wor orkers that fa failed the e trapping ques estions. . The Then rem removed 17 17 ra rati tings. Method: : F-TQ-OD o Spearman’s rank -order co correla lation on on 32 3294 ra rati tings: : o rh rho o = = 0,8 0,868 (p< (p<0,001) HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 21
Results Overview Ratings Approach rho RMSE discarded - 0 0,864* 0,474 F-TQ 320 0,862* 0,476 F- TQ’ 780 0,854* 0,480 F-OD1 122 0,863* 0,477 F-OD2 1480 0,867* 0,474 F-TQ-OD 1546 0,868* 0,479 *p < 0,001 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 22
Recommend
More recommend