Cai Wingfield CSLB, Department of Psychology Workshop on Neurocomputation: From Brains to Machines University of Cambridge 25 November 2015 Understanding human speech recognition: Reverse-engineering the engineering solution using EMEG and RSA
Understanding speech recognition 25 November 2015 Brains and Machines ▸ We’ve seen from previous speakers how: ▸ Machine systems are designed to perform the same tasks as humans. ▸ The architecture of machine models of (e.g.) vision may relate to those of biological systems. ▸ By using methods such as RSA, intermediate-level derived representations in one may be compared to those in the other.
Understanding speech recognition 25 November 2015 Speech and vision ▸ Unlike visual objects, speech stimuli are time-sensitive. ▸ There’s no standard neurocomputational model of speech comprehension. ▸ Humans alone amongst animals have this faculty. ▸ The most effective artificial systems’ designs don’t tend to relate to biological models. ▸ However, machines provide a computational model of the process.
Understanding speech recognition 25 November 2015 Speech recognition ▸ Both human brains and machines can recognise speech accurately. ▸ Transforms raw acoustic input into abstract word “objects”. ▸ Artificial (ASR) systems are nearly as good as humans. ▸ In brains, this is mediated by some complex, poorly understood neurobiological process. ▸ We will compare intermediate-level representations in an ASR and human auditory cortex using RSA. “what a lovely day”
Understanding speech recognition 25 November 2015 HTK: GMM-HMM [sil-aa-b] p [sil-aa-k] p ⋮ [sil-aa-d] p ⋮ ⋮ HMM ⋮ [ih-s-jh] p [ih-s-k] p ⋮ GMM ⋮ ⋮ ⋮ ⋮ ⋮ [uh-zh-uh] p [uh-zh-uw] p Young et al. (1997) [uh-zh-sil] p [sil-w-oh] [w-oh-t] [oh-t-sil] The HTK Book WHAT
Understanding speech recognition 25 November 2015 Searchlight GLM RSA [ ɑ ] = β [ ɑ ] + [æ] + … + [z] + β [æ] β [z] E Data RDM Searchlight Dynamic phonetic model RDMs from HTK's state patch [ ɑ ] [z] β Contributions of individual phonetic Su et al. (2014) models Frontiers in Neuroscience
� Understanding speech recognition 25 November 2015 Evidence for sensitivity to phonetic features a Subject 1 Subject 2 Dorsal Subject 3 Subject 4 Anterior Normalized classifier weights /ba/ vs /da/ /da/ vs /ga/ /ba/ vs /ga/ 0 1 Chang et al. (2010) Mesgarani et al. (2014) Nature Neuroscience Science
Understanding speech recognition 25 November 2015 categories Sonorant Broad Evidence for sensitivity to phonetic features Voiced Obstruent Labial ɑː ʌ ɔː a ʊ a ɪ t ʃ e ə ɜː e ɪ ɪə ɪ i ː d ʒ ŋ ɒ əʊ ɔɪ ɹ ʃ ʊ u ː IPA j z æ b d e f g h k l m n p s t � v w Place HTK aa ae ah ao aw ay b ch d ea eh er ey f g hh ia ih iy jh k l m n ng oh ow oy p r s sh t th uh uw v w y z Coronal Sonorant 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Dorsal Voiced 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Plosive Obstruent 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Manner 1 1 1 1 1 Labial Fricative 1 1 1 1 1 1 1 1 Coronal Sibilant Dorsal 1 1 1 1 1 Plosive 1 1 1 1 1 1 Nasal Fricative 1 1 1 1 1 1 1 1 1 Sibilant 1 1 1 1 1 1 1 1 Front frontness Nasal 1 1 1 Vowel Central Front 1 1 1 1 1 1 1 Central 1 1 1 1 1 1 1 Back Back 1 1 1 1 1 1 1 1 1 Close 1 1 1 1 1 1 Close closeness 1 1 1 1 1 1 Close-mid Vowel Close-mid 1 1 1 1 1 1 1 1 1 Open-mid Open-mid Open 1 1 1 1 Rounded 1 1 1 1 1 1 1 Open Rounded
ɑː ʌ ɔː a ʊ a ɪ t ʃ e ə ɜː e ɪ ɪə ɪ i ː d ʒ ŋ ɒ əʊ ɔɪ ɹ ʃ ʊ u ː IPA j z æ b d e f g h k l m n p s t � v w Understanding speech recognition 25 November 2015 HTK aa ae ah ao aw ay b ch d ea eh er ey f g hh ia ih iy jh k l m n ng oh ow oy p r s sh t th uh uw v w y z Sonorant 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Voiced 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Obstruent 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Labial 1 1 1 1 1 Searchlight GLM RSA 1 1 1 1 1 1 1 1 Coronal Dorsal 1 1 1 1 1 Plosive 1 1 1 1 1 1 Fricative 1 1 1 1 1 1 1 1 1 Sibilant 1 1 1 1 1 1 1 1 Nasal 1 1 1 Front 1 1 1 1 1 1 1 Central 1 1 1 1 1 1 1 Back 1 1 1 1 1 1 1 1 1 Close 1 1 1 1 1 1 1 1 1 1 1 1 Close-mid Open-mid 1 1 1 1 1 1 1 1 1 Open 1 1 1 1 Rounded 1 1 1 1 1 1 1 Map of feature fit f = χ f · β [ ɑ ] [z] β Contributions of individual phonetic models
Understanding speech recognition 25 November 2015 Speech recognition ▸ 16 subjects, 400 words, EMEG. ▸ Most features we tested showed significant fit in auditory cortex. ▸ Bilateral HG, STG, STS. ▸ Broad category features fit best on the right. ▸ Regions on the left tended to be more focussed. ▸ Within-category features showed fits bilaterally. [100, 170] ms Wingfield et al. (in prep.)
Moving forward: DNN-based ASR (work in progress) ▸ DNNs have proved very effective in ▸ Hidden-layer representations visual domain. provide “bottom-up” features which are used to disambiguate speech.
Understanding speech recognition 25 November 2015 Zhang & Woodland (2015) HTK: DNN-HMM Submission to InterSpeech Δ 2 MFCCS Δ MFCCS MFCCS ~6000 1000 1000 1000 1000 1000 720 (f.c.) 26 BN layer [0,25]ms ±40ms
Understanding speech recognition 25 November 2015 Individual node responses - 0 + ▸ BN architecture provides a low- dimensional feature space sufficient to accurately determine 6000+ phonetic labels. ▸ Dynamic inputs elicit dynamic BN responses. ▸ Can we investigate this BN representation space, and compare words it to brain representations? time Node 04
Understanding speech recognition 25 November 2015 Nodes track phonetic features? - 0 + words words time time Sibilance Node 04
Understanding speech recognition 25 November 2015 Nodes track phonetic features? - 0 + words words time time Vowel frontness Node 20
Understanding speech recognition 25 November 2015 BN–feature similarity +1 MDS Similarities between nodes and features 0 -1
Understanding speech recognition 25 November 2015 Summary ▸ We found evidence of regions of articulatory feature representation in human auditory cortex. ▸ We modelled speech-recognition-relevant features using machine systems which perform the task well. ▸ RSA allows comparison of brain states and machine states at the level of representations. ▸ EMEG records rich brain response data over time, non-invasively. ▸ The processes of sound-to-meaning mapping are still poorly understood.
Understanding speech recognition 25 November 2015 Department of Psychology Andrew Thwaites Elisabeth Fonteneau Cai Wingfield William Marslen-Wilson Department of Engineering Xunying Liu Chao Zhang Phil Woodland Department of Psychiatry Li Su
Recommend
More recommend