Identification of voices in disguised speech Jessica Clark* & Paul Foulkes** * University of York ** University of York & JP French Associates pf11@york.ac.uk IAFPA, Göteborg 2006
0.1 outline • experiment to test ability of lay listeners to identify disguised familiar voices • voices have been disguised artificially, as with commercially available voice changers – pitch modified 2
0.2 structure 1. introduction – rationale for experiment 2. experimental design – speakers – listeners – Control condition – Experimental conditions 3. results 4. discussion & conclusion 3
1. Introduction 4
1. Introduction • technical speaker identification is the most frequent task for the forensic phonetician • lay identification is also common in legal cases • many previous studies have thus examined lay listeners’ ability to identify voices and the factors which affect their ability 5
1.1 previous studies • identification is not automatic or flawless • listeners can make errors even with highly familiar voices – Ladefoged did not recognise his mother from a short sample (Ladefoged & Ladefoged 1980) – flatmates scored only 68% with 10 second samples (Foulkes & Barron 2000) 6
1.1 previous studies • identification may be affected by [Bull & Clifford 1984] – type of exposure (active/passive) – length of sample – nature of sample (phone, direct, shouting etc) – delay between exposure and test – age of listener – hearing ability – sightedness – natural variability across individual listeners – specific features of voice – degree of familiarity – nature and extent of any disguise 7
1.2 degree of familiarity • all things equal, more familiar voices are easier to identify • e.g. Hollien, Majewski & Doherty (1982) – listening tests with 10 male voices listener group N % correct (normal condition) familiar 10 98 trained 47 40 unfamiliar 14 27 8
1.3 disguise • all things equal, disguised voices are harder to identify • e.g. Hollien, Majewski & Doherty (1982) – various forms of disguise used listener group N % correct % correct (normal) (disguised) familiar 10 98 79 trained 47 40 21 unfamiliar 14 27 18 machine approach (LTAS) 30 9
1.3 disguise • previous studies have examined various types of disguise – whisper, pencils between teeth, hypernasality, dialect change, rate change, professional mimics • but little if any work on voice changers – hardware based – software based – easily available 10
www.crimebusters911.com www.blazeaudio.com www.maplin.co.uk 11
1.3 disguise • in our study we chose not to use real voice changers, in favour of total control over effects • pitch shift chosen as a universal function 12
2. Experimental design 13
2.1 design outline • simple design • listeners asked to identify samples of familiar voices • Control condition unmodified stimuli • 4 Experimental conditions modified stimuli 14
2.1 design outline • degree of familiarity known to affect rate of successful identification • thus we trained listeners to identify a group of speakers – controls degree of familiarity – all listeners had exactly the same exposure in terms of length & quality of samples – identification task carried out under same conditions 15
2.2 speakers • 4 male speakers – 16-18 years old • taken from IViE corpus (Grabe, Post & Nolan 2001) – Leeds dialect (nearest to York) – reading text of Cinderella story IViE speaker Experimental name JP Edward JW Matthew MD Harry RP David 16
2.2 speakers • training materials created for each speaker – c. 90 seconds of Cinderella (302 words) – edited out disfluencies, non-speech sounds, long pauses – samples normalised for amplitude with Audacity 1.2.5 17
2.3 listeners • 36 listeners • variety of regional/social backgrounds • York residents • age range 19-55 • 10 male, 26 female 18
2.4 Control condition • all 36 listeners – 4 voices * 90 seconds = c. 6 minutes – presented by PowerPoint 1. training phase with speakers’ names 2. break – Toshiba laptop – Aiwa A170 headphones 3. listening test – individually in quiet room 19
2.4 Control condition • all 36 listeners 1. training phase 2. break – 10 minutes 3. listening test 20
2.4 Control condition • all 36 listeners 1. training phase – 8 stimuli 2. break (2 per speaker) 3. listening test – duration c. 10 seconds – 5 second gap between – extracts from other parts of Cinderella story – normalised for amplitude with Audacity 1.2.5 – answer sheet with names
2.5 Experimental conditions • 4 Experimental conditions • listening tests same format as Control condition • but stimuli modified for pitch • Sound Forge 8.0 – pitch shift effect – accuracy setting ‘high’ – speech 1 mode – preserved durations 22
2.5 Experimental conditions (i) +8 semitones (ii) +4 semitones (iii) -4 semitones (iv) -8 semitones pitch shift > 8 semitones unnatural and partly incomprehensible 23
2.5 Experimental conditions listener group N conditions (semitones) A 18 -8, +4 B 18 -4, +8 24
2.5 Experimental conditions • listening test 16-92 days after Control test – no clear effects for length of delay • same training as in Control condition • 10 minute break • 2 stimuli for familiarisation • 8 experimental stimuli per condition – consecutive runs for + and - stimuli – order reversed for half of each group, but no effect 25
3. Results 26
3.1 Control condition • average correct identification = 4.8/8 (60%) 8 7 6 average N correct 5 4 3 2 1 0 Minus 8 Minus 4 Control Plus 4 Plus 8 27
3.1 Control condition • individuals’ range 8 to 0 • 29/36 performed better than chance control 8 7 6 5 N correct 4 3 2 1 0 listeners 28
3.2 Experimental conditions • ** sig. lower than in Control (p < .005, Wilcoxon) • trend (n.s.) for higher scores in + conditions 8 7 ** ** ** ** 6 average N correct 5 4 3 2 1 0 29 Minus 8 Minus 4 Control Plus 4 Plus 8
-8 semitones +8 semitones 8 8 7 7 6 6 5 5 N correct N correct 4 4 3 3 2 2 1 1 0 0 listeners listeners -4 semitones +4 semitones 8 8 7 7 6 6 5 5 N correct N correct 4 4 3 3 2 2 1 1 0 0 listeners listeners • variability in listener performance, esp. ±4 • majority perform above chance except -8 30
3.3 variation by listener sex • women sig. better in Control ( p = .008, Mann-Whitney) – trend (n.s.) maintained in Experimental tests – same pattern reported by Bull & Clifford (1984) 8 Male Female 7 6 ** 5 N correct 4 3 2 1 0 31 Minus 8 Minus 4 Control Plus 4 Plus 8
3.4 summary • as predicted, identification rates were lower with disguised voices – lowest scores with most extreme form of disguise (±8 semitones) • identification rates slightly better when pitch shifted up than down • trend for women to perform better than men • variability across listeners 32
4. Discussion & conclusion 33
4. discussion & conclusion • tests reported here were not forensically realistic • results may be affected by e.g. – degree of familiarity with voice – content of sample (vocabulary, syntax etc) – conditions of exposure (stress etc) – specific form of artificial disguise • software, hardware system • combination of effects 34
4. discussion & conclusion • considerable variation in listeners’ scores – courts should not assume all witnesses are equally good at such tasks – supports broader principle that lay witnesses should be tested in their ability to identify a voice 35
4. discussion & conclusion • but even marked disguise was not catastrophic for listeners • a broadly positive conclusion for lay speaker identification – a reasonable chance of identifying familiar voices 36
4. discussion & conclusion • but a less positive conclusion respect to use of voice changers as a means of protecting vulnerable witnesses giving evidence • more extreme forms of modification may affect intelligibility & naturalness • less extreme forms of modification may render witness’s voice recognisable • different modifications for different voices? 37
4. discussion & conclusion • as ever… • more work is needed 38
thanks tack thanks to Peter French, Phil Harrison, Robin How 39
Recommend
More recommend