Speaking Speaking under co under cover er: The he impact impact of of f face ace-con concea cealing ling gar garments ments on on the aco the acoustics of ustics of fri frica cativ tives. es. Natalie Fecher Language & Linguistic Science University of York, York, UK PhD supervisors: Dominic Watt, David van Leeuwen IAFPA 2011, Vienna, Austria, 27 th July 2011
Outline Outline Background of the Project The ‘ Face Cover ’ Corpus Acoustic Fricative Study Conclusions and Outlook
Bac Backg kground ound
Forensic ic Speech Sp Scien Sc ience PhD PhD Pr Projec oject t on on Multimod Multi modal al Speech an Spee h and d Spea Speaker er Recognition ition Audio udio- Visual isual Spe Speec ech h Pr Proce ocessing ssing 4
Joint processing and transformation of acoustic and facial information under qualitatively variable input : Acous Acoustic tic Noise Noise • microphone type/placement • acoustic environment A / V / A / V / AV V • channel characteristics A(V) ‘Identify’ Speech/ Speec h/ • complexity of the scenario Identify speech/ Speech/ Speec h/ Speak Speaker er speech/ • ► face coverings speaker Speak Speaker er Recognition ecognition speaker or recognition or or by H by Hum uman an verify by human Verify erify Per erceptual ceptual claimed perceptual Claimed Claimed Visual isual Noise Noise content/ or or Content/ Content/ or automatic identity Automa utomatic tic Identity Identity • lighting, occlusion, perspective system System System • image background • resolution/compression • appearance change • ► face coverings (on the basis of Aleksic&Katsaggelos, 2006) 5
Pr Previou vious s Rese esear arch Llamas/H Llama s/Har arrison rison/Do /Donn nnell elly/W y/Watt tt (2009 2009) set of common confusions during bimodal presentation of AV stimuli sound transmission loss characteristics (TL) of 3 fabrics Watt/ tt/Llama Llamas/H s/Har arrison rison (2010 2010) sound quality judgement of speech filtered with TL spectra Zhang/ Zhan g/Tan an ( (2008 2008) test of an ASR system with 10 types of voice disguise ‘masking’ amongst the 3 guises with the lowest similarity rate Coniam Con iam (2005 2005) impact of surgical masks in oral exams during SARS outbreak 6
‘Face Cover’ Corpus
Whe here? e? High-quality audio/ video recordings in a professional TV Studio at the University of York. 8
Who ho? 10 British English speakers. Control for demographic, educational and language background (details in Fecher, 2011a/b). 9
Di Disg sguise uise? Not: Selection criteria: forensic relevance facial parts covered mask material No voice disguise per se. 10
Disg Di sguise uise? 11
Wha hat? t? Phoneticall Phonetically y contr controlled olled stimuli timuli syllable structure /C 1 VC 2 / (existing English words excluded) [ ɑ :] as in <father> vowel /p, t, k, b, d, g, f, s, ʃ , ϴ , v, z, ʒ , ð, m, n, ŋ, h/ consonants syllable position initial, final carrier phrase He said / stimulus /. n o /ŋ/ initial, no /h/ final phonotactic rules IPA, randomised ► 576 stimuli per speaker presentation 12
Ho How? w? VI VIDEO DEO AUDIO half-profile camera 3m headband 13
How? Ho w? 14
Frica ricativ tive Study e Study
Method Metho 6000 20dB f [Hz] A [dB] 4000 FFT spectrum s ʃ f θ t f [Hz] 2.4*10 4 /s ʃ f θ/ × 2 tokens × 2 syllable positions × 6 speakers × 8 disguise conditions less standardised analysis procedures for obstruents (see e.g. Haley et al., 2010; Maniwa et al., 2009; Jongman et al., 2000; Flipsen et al., 1999; Shadle&Mair, 1996; Tabain&Watson, 1996) no bandpass filter, no pre-emphasis (48kHz/16bit/PCM) 16
Varia ariables bles intensity peak CoG variance spectral moments skewness kurtosis 17
intens inte nsity ity ʃ s θ f 18
peak peak fr freq eque uenc ncy s θ f ʃ 19
cent ce ntre of e of gravity vity s θ f ʃ 20
sk skewn ewnes ess s * * ku kurto tosis sis 21 HEL r²=0.68, p<.05 s ʃ RUB 18 HOO kurtosis (dimensionless) f θ 15 NIQ BAL CON 12 SUR 9 TAP TAP 6 HEL HOO r²=.10, p=.44 RUB SUR r²=.67, p<.05 r²=.90, p<.001 CON HEL 3 HEL BAL TAP BAL TAP NIQ RUB CON NIQ RUB BAL HOO 0 SUR CON HOO SUR NIQ -3 0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3 3.3 skewness (dimensionless) 21
Summar Su mmary sound energy absorption dependent on mask material but: additional intensity variation due to the speakers’ individual compensation strategies overall stronger effects for the spectrally diffuse and low- energy non-sibilants /f, θ / than for the sibilants /s, ʃ / more prone to energy absorption in higher frequency bands lower centre of gravity for most coverings highly variable peak frequencies positive correlation for skewness*kurtosis; for both measures same ranking of guises by size of effect ( NIQ least, HEL most) 22
Conc Conclusions lusions
Spe Speec ech h pr produc oducti tion on Misarticulation physiological and somatosensory effects, e.g. lip/nose contact, restricted jaw movement, skin stretching (Fuchs et al., 2010; Haley et al., 2010; Iskarous et al., 2009; Maniwa at al., 2009) Articulatory compensation e.g. increased vocal effort (Coniam, 2005; Sluijter at al., 1997) , may be increased when impaired auditory self-monitoring 24
Speec Spe ech h acou acousti stics cs Interdependence of physiological and physical events in the vocal tract Acoustic damping effects mask materials assumed to act like a low-pass filter which attenuate energy in higher frequency bands (Watt et al., 2010; Llamas et al., 2009; Coniam, 2005) 25
Speec Spe ech h per perception ception Upcoming research Investigating speech intelligibility when the (visual) speech signal is impaired , i.e. when the mapping between acoustically distinct signals and perceptually consistent categories may be constrained due to a) acoustic transmission loss caused by the mask material, b) the auditory consequences of impaired speech production and acoustics, c) impoverished visual (facial) speech cues. 26
Ref efer erences ences
Aleksic , P.S. & Katsaggelos, A.K. (2006). Audio-Visual Biometrics, Proc. IEEE 94/11 , 2025-44. Coniam , D. 2005. The impact of wearing a face mask in a high-stakes oral examination: An exploratory post-SARS study in Hong Kong. Language Assessment Quarterly 2 , 235-261. Fecher , N. 2011a. Spectral properties of fricatives: a forensic approach. Proc. of the 4 th ISCA Tutorial and Research Workshop on Experimental Linguistics , May 25-27, Paris, France, 71-74. Fecher , N., Watt, D. 2011b. Speaking under cover: The effect of face-concealing garments on spectral properties of fricatives. Proc. of the 17 th International Congress of Phonetic Sciences , Hong Kong, August 2011 (accepted). Flipsen , P.Jr., Shriberg, L., Weismer, G., Karlsson, H., McSweeny, J. 1999. Acoustic characteristics of /s/ in adolescents. JSLHR 42 , 663-677. Fuchs , S., Weirich, M., Kroos, C., Fecher, N., Pape, D., Koppetsch, S. 2010. Time for a shave? Does facial hair interfere with visual speech intelligibility? In: Fuchs, S., Hoole, P., Mooshammer, C., Zygis, M. (eds.). Between the regular and the particular in speech and language . Frankfurt/M.: Peter Lang, 247-264. Haley , K.L., Seelinger, E., Mandulak, K.C., Zajac, D.J. 2010. Evaluating the spectral distinction between sibilant fricatives through a speaker-centered approach. Journal of Phonetics 38(4) , 548-554. Iskarous , K., Shadle, C., Proctor, M. 2008. Evidence for the dynamic nature of fricative production: American English /s/. Proc. of the 8 th Int. Seminar on Speech Production , Strasbourg, France, 405-408. Jongman , A., Wayland, R., Wong, S. 2000. Acoustic characteristics of English fricatives. JASA 108 (3), 1252-63. Llamas , C., Harrison, P., Donnelly, D., Watt, D. 2009. Effects of different types of face coverings on speech acoustics and intelligibility. York Papers in Linguistics (Series 2) 9 , 80-104. Maniwa , K., Jongman, A., Wade, T. 2009. Acoustic characteristics of clearly spoken English fricatives. JASA 125(6) , 3962-73. Shadle , C., Mair, S.J. 1996. Quantifying spectral characteristics of fricatives. Proc. of Interspeech , Philadelphia, 1521-24. Sluijter , A. M. C., van Heuven, V. J., Pacilly, J. J. A. 1997. Spectral balance as a cue in the perception of linguistic stress. JASA 101 (1), 503-513. Tabain , M., Watson, C. 1996. Classification of fricatives. Proc. 6 th Aust. Int. Conf. Speech Sci. Technol ., Adelaide, 623-628. Watt , D., Llamas, C., Harrison, P. 2010. Differences in perceived sound quality between speech recordings filtered using transmission loss spectra of selected fabrics. Talk given at the IAFPA Conference 2010 , Trier, Germany. Zhang , C., Tan, T. 2008. Voice disguise and automatic speaker recognition, Forensic Science International 175(2-3) , 118-122. 28
Recommend
More recommend