vision and language representation learning
play

Vision and Language Representation Learning Self Supervised - PowerPoint PPT Presentation

Vision and Language Representation Learning Self Supervised Pretraining and Multi-Task Learning Jiasen Lu April 21, 2020 1 1 Vision and Language Visual Question Answering Image Captioning Visual Commonsense Reasoning Refer Expression 2


  1. Vision and Language Representation Learning – Self Supervised Pretraining and Multi-Task Learning Jiasen Lu April 21, 2020 1 1

  2. Vision and Language Visual Question Answering Image Captioning Visual Commonsense Reasoning Refer Expression 2 [Antol et.al. 2015, Vinyals et.al. 2015, Zellers et.al. 2018, Yu et.al 2018]

  3. Vision and Language Visual Question Answering Image Captioning Visual Commonsense Reasoning Visual Commonsense Reasoning Refer Expression 3 [Antol et.al. 2015, Vinyals et.al. 2015, Zellers et.al. 2018, Yu et.al 2018]

  4. Visual Grounding C: A bunch of red and yellow Q: What type of plant is this? flowers on a branch. A: Banana Common model for visual grounding and leverage them on a wide array of vision-and-language tasks 4 [Shen et.al 2018]

  5. Pretrain-Transfer Object Detection Semantic Segmentation Pose Estimation Question Answering Commonsense Inference Sentiment Analysis 5 [Deng et.al 2009, Devlin 2018]

  6. Pretrain-Transfer • Aligned image-caption pairs. • 3.3 million image compared to 0.12 million in COCO Alt-text : Musician Justin Timberlake caption. performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 • Automatically collected. in Franklin, Tennessee. Conceptual Captions : pop artist performs at the festival in a city. Conceptual Caption Dataset 6 [Sharma et.al 2018]

  7. BERT … … ℎ 𝑀 0 ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑀 1 BERT Alt-text : Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 … … Tok 1 Tok2 <SEP> <CLS> Tok N <SEP> Tok1 Tok 2 in Franklin, Tennessee. Conceptual Captions : pop artist performs at the festival in a city. Sentence A Sentence B Conceptual Caption Dataset 7 [Sharma et.al 2018, Devlin et.al 2018]]

  8. ViLBERT T 1 T 2 … … ℎ 𝑀 0 ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑀 1 BERT Alt-text : Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 … <MASK … <MASK Tok 1 Tok2 <SEP> <CLS> Tok N <SEP> Tok1 Tok 2 in Franklin, Tennessee. > > Conceptual Captions : pop artist performs at the festival in a city. Sentence A Sentence B Conceptual Caption Dataset 8 [Sharma et.al 2018, Devlin et.al 2018]]

  9. Single-Stream model … … ℎ 𝑀 0 ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑀 1 BERT Alt-text : Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 … … <SEP> <CLS> <SEP> Tok1 Tok 2 in Franklin, Tennessee. Conceptual Captions : pop artist performs at the festival in a city. Sentence Image Conceptual Caption Dataset 9 [Sharma et.al 2018, Devlin et.al 2018]]

  10. Single-Stream model T 1 … … ℎ 𝑀 0 ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑀 1 BERT Alt-text : Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 … <MASK … <MASK <SEP> <CLS> <SEP> Tok1 Tok 2 in Franklin, Tennessee. > > Conceptual Captions : pop artist performs at the festival in a city. Sentence Image Conceptual Caption Dataset 10 [Sharma et.al 2018, Devlin et.al 2018]]

  11. ViLBERT Problem : Different modalities may require different level of abstractions. • Linguistic stream: Linear artist • Visual stream: 11 [He et.al. 2015]

  12. ViLBERT Solution : two-stream model which process visual and linguistic separately. … ℎ 𝑀 0 … ℎ 𝑀 1 ℎ 𝑀 𝑈 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑀 1 𝑙 - layers 𝑚 - layers L - BERT V- BERT … … <IMG> <SEP> Tok1 <CLS> Tok2 Sentence Image 12

  13. ViLBERT Problem : how to fuse two different modality? Solution : use co-attention [ Lu et.al. 2016 ] to fuse information between different source. 13

  14. ViLBERT Co-attention [ Lu et.al. 2016 ] to fuse information between different source. 14

  15. Pre-training Objective Masked multi-modal modelling • Follows masked LM in BERT. • 15% of the words or image regions to predict. • Linguistic stream: o 80% of the time, replace with [MASK] . o 10% of the time, replace random word. o 10% of the time, keep same. • Visual stream: o 80% of the time, replace with zero vector. Multi-modal alignment prediction • Predict whether image and caption is aligned or not 15

  16. Visualizations A boat covered in flowers near the market. 16 [Sharma et.al 2018]

  17. Sentence → Image H7 H0 Layer 0 Layer 5 17 BertVis https://github.com/jessevig/bertviz

  18. Sentence → Image H7 H0 Layer 0 Layer 5 18 BertVis https://github.com/jessevig/bertviz

  19. Image → Sentence H7 H0 Layer 0 Layer 5 19 BertVis https://github.com/jessevig/bertviz

  20. Image → Sentence H7 H0 Layer 0 Layer 5 20 BertVis https://github.com/jessevig/bertviz

  21. Fine-tuning Procedure Refer VCR VQA shopping Man shopping … … … … ℎ 𝑊 𝒰 ℎ 𝑀 2 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑊 𝒰 ℎ 𝑀 2 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑊 ℎ 𝑊 ℎ 𝑊 ℎ 𝑊 ℎ 𝑀 0 ℎ 𝑀 1 ℎ 𝑊 ℎ 𝑊 ℎ 𝑊 ℎ 𝑊 ℎ 𝑀 0 ℎ 𝑀 1 0 1 2 3 0 1 2 3 Vision & Language BERT Vision & Language BERT … … … … <MASK <MASK <MASK Man shopping What is <IMG> <CLS> <MASK> for <SEP> <IMG> <CLS> the <SEP> > > > Masked Region Masked Sentence Image Question Image and text pair from conceptual caption Image Question Pair Pre-training Fine-Tuning 21

  22. Tasks 22 [Antol 2015, zeller 2018, Yu 2016, Plummer 2015]

  23. Results VQA VCR Q->A RefCOCO+ Image Retrieval 70.55 58.2 70.22 54.04 72.34 52.73 68.85 68.93 69.21 49.48 68.61 47.27 48.6 65.33 65.64 45.5 65.9 43.1 test-dev val val test 23

  24. Concurrent Work 24 [Li 2019, Tan 2019, Li 2019, Su 2019, Zhou 2019, Chen 2019]

  25. Summary Summary Task-agnostic visiolinguistic representations pretraining for visual grounding • Introduce pretrain-transfer to vision and language tasks. • Achieve SOTA on multiple vision and language tasks. Limitations The model can still learn inconsistent grounding by task specific finetuning. • Training multiple vision and language task together – multi-task V&L 25

  26. Multi-Task V&L Learning One Model for V&L: ViLBERT Problems: • Inconsistent grounding by task specific Referring Expression VQA • Ref COCO • VQA finetuning. • Ref COCO+ • Genome QA • Four V&L tasks. • Ref COCOg • GQA • Model is huge, overfitting. • Visual 7w Image Description • GuessWhat What we want: • Caption based • Test on more tasks. V&L Verification Retrieval (COCO) • Consistent Grounding across tasks. • NLVR2 • Caption based • Explore the limit of the model. • Visual Entailment Retrieval (COCO) 26

  27. Multi-Task V&L Learning Model improvements over ViLBERT 27

  28. Multi-Task V&L Learning Model improvements over ViLBERT • Masked multi-modal modelling only for aligned image caption pairs. ℎ 𝑀 0 … ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 … ℎ 𝑀 𝑈 ℎ 𝑀 1 L - BERT V- BERT … … <IMG> <MASK> <SEP> <CLS> Tok1 Tok2 Image Aligned caption 28

  29. Multi-Task V&L Learning Model improvements over ViLBERT • Masked multi-modal modelling only for aligned image caption pairs. • Masking overlapped image regions (IOU > 0.4). ℎ 𝑀 0 … ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 … ℎ 𝑀 𝑈 ℎ 𝑀 1 L - BERT V- BERT … … <IMG> <MASK> <SEP> <CLS> Tok1 Tok2 Image Aligned caption 29

  30. Multi-Task V&L Learning Multi-Task Vision and Language Learning • Use different head but similar tasks share the same head. VQA/Genome QA GQA Retrieval NLVR ℎ 𝑀 0 … ℎ 𝑀 𝑈 Visual Entailment ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 … ℎ 𝑀 𝑈 ℎ 𝑀 1 L - BERT V- BERT … … <IMG> <SEP> <CLS> Tok1 <TSK> Image Aligned caption 30

  31. Multi-Task V&L Learning Multi-Task Vision and Language Learning • Use different head but similar tasks share the same head. VQA/Genome QA Refer Expression GQA Retrieval NLVR ℎ 𝑀 0 … ℎ 𝑀 𝑈 Visual Entailment ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 … ℎ 𝑀 𝑈 ℎ 𝑀 1 L - BERT V- BERT … … <IMG> <SEP> <CLS> Tok1 <TSK> Image Aligned caption 31

  32. Multi-Task V&L Learning Multi-Task Vision and Language Learning • Use different head but similar tasks share the same head. • Add <TSK> token for multi-task training. ℎ 𝑀 0 … ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 … ℎ 𝑀 𝑈 ℎ 𝑀 1 L - BERT V- BERT … … <IMG> <SEP> <CLS> Tok1 <TSK> Image Aligned caption 32

  33. Multi-Task V&L Learning Multi-Task Vision and Language Learning • Use different head but similar tasks share the same head. • Add <TSK> token for multi-task training. • Dynamic Stop and Go 33

Recommend


More recommend