ActBERT: Learning Global-Local Video-Text Representations Linchao - PowerPoint PPT Presentation

Feb 10, 2024 •313 likes •571 views

ActBERT: Learning Global-Local Video-Text Representations Linchao Zhu Self-supervised pretraining Single modal pretraining Image Jigsaw, CPC, MoCO, SimCLR Video Shuffle and Learn, Video GAN Text Word2Vec, GPT, BERT

ActBERT: Learning Global-Local Video-Text Representations Linchao Zhu
Self-supervised pretraining • Single modal pretraining • Image • Jigsaw, CPC, MoCO, SimCLR • Video • Shuffle and Learn, Video GAN • Text • Word2Vec, GPT, BERT • Multi-modal pretraining • Image-text • Vilbert, LXMERT, VisualBERT, VLBERT, UNITER, Unified VLP • Video-text • HowTo100M, VideoBERT, CBT, MIL-NCE
Video and text pre-training • Instructional videos • A natural source for video and text representation learning • Instructions are available from ASR • Diverse domains • Cooking • Assembling furniture
Video and text pre-training • Howto100M Miech et al., HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips, ICCV 2019
Video and text pre-training • VideoBERT Sun et al., VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019.
ActBERT • Decouple verbs and nouns • Add spinach -> verb: “add”, noun: “spinach” • Verb label is extracted from the description. • Train a 3D ConvNet for verb classification • Object label can be produced from a pre- trained Faster R-CNN.
Tangled transformer ActBERT with Tangled Joint embedding Transformer / BERT Action Language Clip-level Language Region feature model feature model feature Rotate Rotate shrimp balls shrimp balls

Recommend

Learning text representations from character-level data Grzegorz Chrupa la Department of

Learning text representations from character-level data Grzegorz Chrupa la Department of Communication and Information Sciences Tilburg University CLIN 2013 Chrupa la (UvT) Text representations CLIN 2013 1 / 19 Text representations

221 views • 19 slides

7. Video databases Video data representations Video = time-ordered sequence of correlated

7. Video databases Video data representations Video = time-ordered sequence of correlated images ( frames ) Video signal representations originate from TV technology; different standards in USA (NTSC) and Europe (PAL, SECAM) 25-30

571 views • 16 slides

Inductive Learning Algorithms and Representations for Text Categorization David Heckerman Susan

Inductive Learning Algorithms and Representations for Text Categorization David Heckerman Susan Dumais John Platt Mehran Sahami Presenter: Haoran Hou Text Categorization real-time sorting emails/files topic identification structured search

283 views • 25 slides

Unsupervised Le Learning of Video Representations using LS LSTMs Srivastava et al. University

Unsupervised Le Learning of Video Representations using LS LSTMs Srivastava et al. University of Toronto Presented by Shyam Tailor The Overall Idea Take a sequence of images and encode into a fixed size latent representation Decode

210 views • 20 slides

Learning Graph Representations for Video Understanding Xiaolong Wang Carnegie Mellon University

Learning Graph Representations for Video Understanding Xiaolong Wang Carnegie Mellon University Computer Vision Dog He et al. Mask R-CNN. ICCV 2017. Gler et al. DensePose: Dense Human Pose Estimation In The Wild. CVPR 2018. Deep Learning

941 views • 64 slides

Investigating the forensic applications of global and local temporal representations of speech for

Investigating the forensic applications of global and local temporal representations of speech for dialect discrimination Leah Bradshaw, Vincent Hughes, and Eleanor Chodroff Department of Language and Linguistic Science University of York

462 views • 29 slides

Privacy-preserving Neural Representations of Text Maximin Coavoux Shashi Narayan Shay B.

Privacy-preserving Neural Representations of Text Maximin Coavoux Shashi Narayan Shay B. Cohen University of Edinburgh ILCC EMNLP 2018 Brussels 1 / 15 Context: Privacy and Neural Networks Machine learning uses data (e.g. UGC)

481 views • 19 slides

Local Representations of Binding Randy Pollack LFCS, University of Edinburgh Joint work with

Local Representations of Binding Local Representations of Binding Randy Pollack LFCS, University of Edinburgh Joint work with James McKinna, Christian Urban, Arthur Charguraud, Brian Aydemir, Benjamin Pierce, Stephanie Weirich Version of

754 views • 47 slides

Object Detection Deep ConvNets for Recognition for... Images (global) Objects (local) Video

Day 3 Lecture 4 Object Detection Deep ConvNets for Recognition for... Images (global) Objects (local) Video (2D+T) 2 Slide Credit: Xavier Gir Object Detection The task of assigning a label and a bounding box to all objects in the image

510 views • 31 slides

Presentation Video Retrieval using Automatically Recovered Slide and Spoken Text Matthew Cooper

Presentation Video Retrieval using Automatically Recovered Slide and Spoken Text Matthew Cooper FX Palo Alto Laboratory Palo Alto, CA 94034 USA cooper@fxpal.com ABSTRACT Video is becoming a prevalent medium for e-learning. Lecture videos

394 views • 7 slides

Integrating Drawing Tablet and Video Capturing/Sharing to Facilitate Student Learning ACM Global

Integrating Drawing Tablet and Video Capturing/Sharing to Facilitate Student Learning ACM Global Computing Education CompEd19 / May 18 / Chengdu, China Chen-Wei Wang York University, Toronto, Canada Challenges of Undergraduate Teaching 1.

643 views • 24 slides

Sequence to Sequence Video to Text Venugopalan et al. Given a variable-length sequence of

Garrett Bingham Sequence to Sequence Video to Text Venugopalan et al. Given a variable-length sequence of video frames, Prev: Title Slide generate a variable-length natural language Problem description of the video. Next: Motivation

850 views • 15 slides

from unlabeled video Kristen Grauman Department of Computer Science The University of T exas at

Learning image representations from unlabeled video Kristen Grauman Department of Computer Science The University of T exas at Austin Work with Dinesh Jayaraman Learning visual categories Recent major strides in category recognition

751 views • 51 slides

Global and local alignments Global vs. local alignments Global: align all nucleotides

Global and local alignments Global vs. local alignments Global: align all nucleotides Local: align subsequences with best score Align these sequences: GCAT, GCT (match = 1, mismatch = -1, gap = -1) global alignment: local alignment: ?

506 views • 19 slides

Dense Encoding for Video-to-Text Matching Jianfeng Dong 1 , Xirong Li 2 , Chaoxi Xu 2 , Jing Cao 2

Dense Encoding for Video-to-Text Matching Jianfeng Dong 1 , Xirong Li 2 , Chaoxi Xu 2 , Jing Cao 2 , Xun Wang 1 , Gang Yang 2 1 Zhejiang Gongshang University 2 AI & Media Computing Lab, Renmin University of China Video to Text (VTT) Task @

341 views • 18 slides

LOCAL REPRESENTATIONS OF BRAID GROUP AND ITS GENERALIZATIONS Yu. A. Mikhalchishina Knots,

LOCAL REPRESENTATIONS OF BRAID GROUP AND ITS GENERALIZATIONS Yu. A. Mikhalchishina Knots, braids and automorphism groups NOVOSIBIRSK 2014 Yu. A. Mikhalchishina LOCAL REPRESENTATIONS OF BRAID GROUP AND ITS GENERALIZA The braid group.

273 views • 13 slides

Text/Speech & Images/Video Presented By: Sonal Gupta March 7, 2008 Introduction New

Text/Speech & Images/Video Presented By: Sonal Gupta March 7, 2008 Introduction New area of research in Computer Vision Increasing importance of text captions, subtitles, speech etc. in images and video Additional modality

890 views • 88 slides

N EU G EN Text Generation from Meaning Representations Yannis Konstas Joint work

N EU G EN Text Generation from Meaning Representations Yannis Konstas Joint work with Mark Yatskar, Luke Zettlemoyer and Yejin Choi (UW) ~ Yonatan Bisk and Daniel Marcu (ISI) Motivation Motivation Machine-generated

879 views • 76 slides

Temporal Gaussian Mixture Layer for Videos AJ Piergiovanni and Michel S. Ryoo Indiana University

Temporal Gaussian Mixture Layer for Videos AJ Piergiovanni and Michel S. Ryoo Indiana University Motivation Video Representation Learning Learning good video representations has many applications Robot perception, activity

365 views • 14 slides

Text Classification Contd + Document Representations Prof. Sameer Singh CS 295: STATISTICAL NLP

Text Classification Contd + Document Representations Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 January 17, 2017 Based on slides from Nathan Schneider, Noah Smith, Dan Klein and everyone else they copied from. Outline Logistic

1.04k views • 35 slides

Image Data, Video Data and Both in VTT Model Training Video-to-Text Task in TRECVID 2019 Jorma

Image Data, Video Data and Both in VTT Model Training Video-to-Text Task in TRECVID 2019 Jorma Laaksonen, PicSOM Team Department of Computer Science Aalto University School of Science Espoo, Finland November 13th, 2019 Contents Background

537 views • 29 slides

Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis

Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 12 September 2016

500 views • 22 slides

The Massive Shift of Viewers Text does not work as well Video changes everything

CAMP Presentation: Cheat Sheet for Whats New in the Campground Industry The Massive Shift of Viewers Text does not work as well Video changes everything How We Got Here 10 Years Ago a crazy idea Experienced the

452 views • 3 slides

Technology Enhanced Learning - TEL initiatives by IIITM-K June 10, 2009 david@iiitmk.ac.in

Introduction Multimedia Library Course resources - video lectures E-learning systems Learning Methodologies National and Global Initiatives Challenges Further research References Acknowledgements Summary Technology Enhanced Learning - TEL

398 views • 25 slides