ActBERT: Learning Global-Local Video-Text Representations Linchao Zhu
Self-supervised pretraining • Single modal pretraining • Image • Jigsaw, CPC, MoCO, SimCLR • Video • Shuffle and Learn, Video GAN • Text • Word2Vec, GPT, BERT • Multi-modal pretraining • Image-text • Vilbert, LXMERT, VisualBERT, VLBERT, UNITER, Unified VLP • Video-text • HowTo100M, VideoBERT, CBT, MIL-NCE
Video and text pre-training • Instructional videos • A natural source for video and text representation learning • Instructions are available from ASR • Diverse domains • Cooking • Assembling furniture
Video and text pre-training • Howto100M Miech et al., HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips, ICCV 2019
Video and text pre-training • VideoBERT Sun et al., VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019.
ActBERT • Decouple verbs and nouns • Add spinach -> verb: “add”, noun: “spinach” • Verb label is extracted from the description. • Train a 3D ConvNet for verb classification • Object label can be produced from a pre- trained Faster R-CNN.
Tangled transformer ActBERT with Tangled Joint embedding Transformer / BERT Action Language Clip-level Language Region feature model feature model feature Rotate Rotate shrimp balls shrimp balls
Recommend
More recommend