actbert learning global local video text representations
play

ActBERT: Learning Global-Local Video-Text Representations Linchao - PowerPoint PPT Presentation

ActBERT: Learning Global-Local Video-Text Representations Linchao Zhu Self-supervised pretraining Single modal pretraining Image Jigsaw, CPC, MoCO, SimCLR Video Shuffle and Learn, Video GAN Text Word2Vec, GPT, BERT


  1. ActBERT: Learning Global-Local Video-Text Representations Linchao Zhu

  2. Self-supervised pretraining • Single modal pretraining • Image • Jigsaw, CPC, MoCO, SimCLR • Video • Shuffle and Learn, Video GAN • Text • Word2Vec, GPT, BERT • Multi-modal pretraining • Image-text • Vilbert, LXMERT, VisualBERT, VLBERT, UNITER, Unified VLP • Video-text • HowTo100M, VideoBERT, CBT, MIL-NCE

  3. Video and text pre-training • Instructional videos • A natural source for video and text representation learning • Instructions are available from ASR • Diverse domains • Cooking • Assembling furniture

  4. Video and text pre-training • Howto100M Miech et al., HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips, ICCV 2019

  5. Video and text pre-training • VideoBERT Sun et al., VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019.

  6. ActBERT • Decouple verbs and nouns • Add spinach -> verb: “add”, noun: “spinach” • Verb label is extracted from the description. • Train a 3D ConvNet for verb classification • Object label can be produced from a pre- trained Faster R-CNN.

  7. Tangled transformer ActBERT with Tangled Joint embedding Transformer / BERT Action Language Clip-level Language Region feature model feature model feature Rotate Rotate shrimp balls shrimp balls

Recommend


More recommend