Investigating neural representations of spoken language Grzegorz Chrupała
In collaboration with Afra Alishahi Lieke Gelderloos Marie Barking Mark van der Laan
Automatic Speech Recognition A major success story in Language Technology
Large amounts of fjne- grained supervision I can see you
Grounded speech perception
Modeling spoken language Induce representations between auditory signal, and visual semantics Understand : What representations emerge in models? How much do they match linguistic analyses? Which parts of the architecture encode what?
Datasets Flickr8K Audio Caption Corpus 8K images, fjve audio captions each MS COCO Synthetic Spoken Captions 300K images, fjve synthetically spoken captions each Places Audio Caption 400K Corpus 400K spoken captions
Project speech and image to joint space a bird walks on a beam bears play in water
a bird walks on a beam
Image retrieval Grzegorz Chrupała, Lieke Gelderloos and Afra Alishahi. 2017. Representations of language in a model of visually grounded speech signal. In ACL.
Further advances Harwath, D., Torralba, A., & Glass, J. (2016). Flickr8K Unsupervised learning of spoken language with visual context. In NeurIPS. Harwath, D., & Glass, J. (2017). Learning Word-Like Units from Joint Audio-Visual Analysis. In ACL. Chrupała, G. (2019). Symbolic inductive bias for visually grounded learning of spoken language. In ACL. Merkx, D., Frank, S. L., & Ernestus, M. (2019). Language learning using Speech to Image retrieval. In Interspeech. Ilharco, G., Zhang, Y., & Baldridge, J. (2019). Large-scale representation learning from visually grounded untranscribed speech. In CoNLL. Havard, W. N., Chevrot, J. P., & Besacier, L. (2019). Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech. In CoNLL
Levels of representation What aspects of sentences are encoded? Which parts of the architecture encode what?
Homonym disambiguation Utterances with homonyms pair/pear, waste/waist ... Decide which meaning was present in an utterance. Easier if meaning is represented, harder if only form. Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neural model of grounded speech. In CoNLL.
Synthetic COCO
Synonym discrimination Disentangle phonological form and semantics. Discriminate between synonyms in identical context: A girl looking at a photo. A girl looking at a picture. How invariant to phonological form is a representation? Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neural model of grounded speech. In CoNLL.
Synthetic COCO
Phoneme discrimination ABX task (Schatz et al. 2013) A: /si/ B: /mi/ X: /me / Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neural model of grounded speech. In CoNLL.
Synthetic COCO ABX Especially challenging when the target (B) and distractor (A) belong to same phoneme class.
Interim summary Bottom layers encode form, top layers meaning Even top layers are not completely form-invariant
Caveats
Synthetic COCO Phoneme decoding Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neural model of grounded speech. In CoNLL.
Belinkov, Y., Ali, A., & Glass, J. (2019). Analyzing Phonetic and Graphemic Representations in End-to-End Automatic Speech Recognition. In Interspeech.
Flickr8K Phoneme decoding from random networks
Representational Similarity Analysis Kriegeskorte, N., Mur, M., & Bandettini, P. A. (2008). Representational similarity analysis- connecting the branches of systems neuroscience. Frontiers in systems neuroscience , 2, 4.
RSA: an example RSA score: correlation between Sim A and Sim B
Structured Spaces RSA applies given a similarity/distance metric WITHIN spaces A and B . No need for a metric BETWEEN A and B A can be a vector space, while B can be a space of strings/trees/graphs . For application to syntax, see: Chrupała, G., & Alishahi, A. (2019). Correlating neural and symbolic representations of language. In ACL
Phoneme with RSA A – cosine distances between activation vectors B – edit distances between phonemic transcriptions
Pooling Parameters W, u optimized with respect to RSA scores.
Flickr8K Phonemes with RSA
Conclusion, again Baselines and sanity checks are a must. Diagnostic classifjers may lack sensitivity to details of representation. Multiple analytical approaches to cross- check results.
BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP https:/ /blackboxnlp.github.io 2018: EMNLP in Brussels 2019: ACL, Florence 2020?
References Grzegorz Chrupała, Lieke Gelderloos and Afra Alishahi. 2017. Representations of language in a model of visually grounded speech signal. In ACL. Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neural model of grounded speech. In CoNLL. Grzegorz Chrupała and Afra Alishahi. 2019. Correlating neural and symbolic representations of language. In ACL. Grzegorz Chrupała. 2019. Symbolic inductive bias for visually grounded learning of spoken language. In ACL.
Extras
Model settings
Representational Similarity Correlations between sets of pairwise similarities according to Activations VS Edit ops on text Human judgments (SICK dataset)
Decoding speaker attributes (Flickr8K) gender identity
Decoding speaker attributes Substantial amount of speaker information in top layers Especially gender Idea: disentangle semantics from speaker info?
RSA + Tree Kernels Infersent (Conneau 2017) trained on NLI BERT (Devlin et al. 2018) trained on cloze and next-sentence classifjcation Random versions of these
BERT layers
Recommend
More recommend