investigating neural representations of spoken language
play

Investigating neural representations of spoken language Grzegorz - PowerPoint PPT Presentation

Investigating neural representations of spoken language Grzegorz Chrupaa In collaboration with Afra Alishahi Lieke Gelderloos Marie Barking Mark van der Laan Automatic Speech Recognition A major success story in Language Technology


  1. Investigating neural representations of spoken language Grzegorz Chrupała

  2. In collaboration with Afra Alishahi Lieke Gelderloos Marie Barking Mark van der Laan

  3. Automatic Speech Recognition A major success story in Language Technology

  4. Large amounts of fjne- grained supervision I can see you

  5. Grounded speech perception

  6. Modeling spoken language  Induce representations between  auditory signal, and  visual semantics  Understand :  What representations emerge in models?  How much do they match linguistic analyses?  Which parts of the architecture encode what?

  7. Datasets  Flickr8K Audio Caption Corpus  8K images, fjve audio captions each  MS COCO Synthetic Spoken Captions  300K images, fjve synthetically spoken captions each  Places Audio Caption 400K Corpus  400K spoken captions

  8. Project speech and image to joint space a bird walks on a beam bears play in water

  9. a bird walks on a beam

  10. Image retrieval Grzegorz Chrupała, Lieke Gelderloos and Afra Alishahi. 2017. Representations of language in a model of visually grounded speech signal. In ACL.

  11. Further advances  Harwath, D., Torralba, A., & Glass, J. (2016). Flickr8K Unsupervised learning of spoken language with visual context. In NeurIPS.  Harwath, D., & Glass, J. (2017). Learning Word-Like Units from Joint Audio-Visual Analysis. In ACL.  Chrupała, G. (2019). Symbolic inductive bias for visually grounded learning of spoken language. In ACL.  Merkx, D., Frank, S. L., & Ernestus, M. (2019). Language learning using Speech to Image retrieval. In Interspeech.  Ilharco, G., Zhang, Y., & Baldridge, J. (2019). Large-scale representation learning from visually grounded untranscribed speech. In CoNLL.  Havard, W. N., Chevrot, J. P., & Besacier, L. (2019). Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech. In CoNLL

  12. Levels of representation  What aspects of sentences are encoded?  Which parts of the architecture encode what?

  13. Homonym disambiguation  Utterances with homonyms  pair/pear, waste/waist ...  Decide which meaning was present in an utterance.  Easier if meaning is represented, harder if only form. Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neural model of grounded speech. In CoNLL.

  14. Synthetic COCO

  15. Synonym discrimination  Disentangle phonological form and semantics.  Discriminate between synonyms in identical context: A girl looking at a photo. A girl looking at a picture.  How invariant to phonological form is a representation? Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neural model of grounded speech. In CoNLL.

  16. Synthetic COCO

  17. Phoneme discrimination ABX task (Schatz et al. 2013) A: /si/ B: /mi/ X: /me / Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neural model of grounded speech. In CoNLL.

  18. Synthetic COCO ABX Especially challenging when the target (B) and distractor (A) belong to same phoneme class.

  19. Interim summary  Bottom layers encode form, top layers meaning  Even top layers are not completely form-invariant

  20. Caveats

  21. Synthetic COCO Phoneme decoding Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neural model of grounded speech. In CoNLL.

  22. Belinkov, Y., Ali, A., & Glass, J. (2019). Analyzing Phonetic and Graphemic Representations in End-to-End Automatic Speech Recognition. In Interspeech.

  23. Flickr8K Phoneme decoding from random networks

  24. Representational Similarity Analysis Kriegeskorte, N., Mur, M., & Bandettini, P. A. (2008). Representational similarity analysis- connecting the branches of systems neuroscience. Frontiers in systems neuroscience , 2, 4.

  25. RSA: an example RSA score: correlation between Sim A and Sim B

  26. Structured Spaces  RSA applies given a similarity/distance metric WITHIN spaces A and B .  No need for a metric BETWEEN A and B  A can be a vector space, while B can be a space of strings/trees/graphs .  For application to syntax, see:  Chrupała, G., & Alishahi, A. (2019). Correlating neural and symbolic representations of language. In ACL

  27. Phoneme with RSA  A – cosine distances between activation vectors  B – edit distances between phonemic transcriptions

  28. Pooling Parameters W, u optimized with respect to RSA scores.

  29. Flickr8K Phonemes with RSA

  30. Conclusion, again  Baselines and sanity checks are a must.  Diagnostic classifjers may lack sensitivity to details of representation.  Multiple analytical approaches to cross- check results.

  31. BlackboxNLP  Workshop on Analyzing and Interpreting Neural Networks for NLP  https:/ /blackboxnlp.github.io  2018: EMNLP in Brussels  2019: ACL, Florence  2020?

  32. References  Grzegorz Chrupała, Lieke Gelderloos and Afra Alishahi. 2017. Representations of language in a model of visually grounded speech signal. In ACL.  Afra Alishahi, Marie Barking and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neural model of grounded speech. In CoNLL.  Grzegorz Chrupała and Afra Alishahi. 2019. Correlating neural and symbolic representations of language. In ACL.  Grzegorz Chrupała. 2019. Symbolic inductive bias for visually grounded learning of spoken language. In ACL.

  33. Extras

  34. Model settings

  35. Representational Similarity  Correlations between sets of pairwise similarities according to  Activations VS  Edit ops on text  Human judgments (SICK dataset)

  36. Decoding speaker attributes (Flickr8K) gender identity

  37. Decoding speaker attributes  Substantial amount of speaker information in top layers  Especially gender  Idea: disentangle semantics from speaker info?

  38. RSA + Tree Kernels  Infersent (Conneau 2017)  trained on NLI  BERT (Devlin et al. 2018)  trained on cloze and next-sentence classifjcation  Random versions of these

  39. BERT layers

Recommend


More recommend