Towards generating stories about video Anna Rohrbach The End-of-End-to-End A Video Understanding Pentathlon, CVPR 2020
Let’s look at a human generated video description A young singer with moppy dark brown hair strums a guitar at the mic. Debbie brings Pete a bottle of beer. Setting her own drink down, she faces the stage and takes his hand. Pete shrugs and gives a delighted smile. Debbie smiles encouragingly. 2
Let’s look at a human generated video description • Human descriptions … are relevant to the video • are coherent and non-redundant • mention distinct person identities and make use • of co-references (e.g. she ) A young singer with moppy dark brown hair strums a guitar at the mic. Debbie brings Pete a bottle of beer. Setting her own drink down, she faces the stage and takes • Besides they … his hand. Pete shrugs and gives a delighted smile. Debbie smiles encouragingly. may contain references to objects and places • significant to the story, including named entities (e.g. they meet at the Denny’s ) may require common sense for deeper • understanding of events (e.g. they make up after quarreling ) • And much more (connect to audio, dialog, etc.) 3
This talk Connecting video description Coherent and diverse to person identities multi-sentence video description His brow furrowed, […] looks down at the ground. (1) (2) (3) […] eyes him angrily, her jaw clenched. Our work : A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion […] heads off. as well as people riding along. 4
This talk Connecting video description Coherent and diverse to person identities multi-sentence video description His brow furrowed, […] looks down at the ground. (1) (2) (3) […] eyes him angrily, her jaw clenched. Our work : A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion […] heads off. as well as people riding along. 5
Multi-Sentence Video Description … … A man is seen hosting a news One group of people fall out and pull More shots are shown of people Ground segment that shows clips of various each other off to the side and one man riding down the river and falling out Truth floats moving down a rapid with speaks to the camera. on the side. people. Linguistically Visually Diverse & Coherent Relevant Fluent Across Sentences Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019 6
Multi-Sentence Video Description … … A man is seen hosting a news One group of people fall out and pull More shots are shown of people Ground segment that shows clips of various each other off to the side and one man riding down the river and falling out Truth floats moving down a rapid with speaks to the camera. on the side. people. A man is seen speaking to the camera The man continues to speak to the camera The man continues talking to the Masked and leads into clips of people riding in while more clips of people riding. camera. Transformer the water. (Zhou et al.) Move Forward A man is seen speaking to the camera A man is seen speaking to the camera A man is seen speaking to the camera and leads into a man speaking to the and leads into him riding down a river. and leads into him riding down a river. and Tell camera. (Xiong et al.) Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019 7
Multi-Sentence Video Description … … A man is seen hosting a news One group of people fall out and pull More shots are shown of people Ground segment that shows clips of various each other off to the side and one man riding down the river and falling out Truth floats moving down a rapid with speaks to the camera. on the side. people. A man is seen speaking to the camera The man continues to speak to the camera The man continues talking to the Masked and leads into clips of people riding in while more clips of people riding. camera . Transformer the water. (Zhou et al.) Move Forward A man is seen speaking to the camera A man is seen speaking to the camera A man is seen speaking to the and leads into a man speaking to the and leads into him riding down a river. camera and leads into him riding down and Tell camera. a river. (Xiong et al.) Content Error Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019 8
Multi-Sentence Video Description … … A man is seen hosting a news One group of people fall out and pull More shots are shown of people Ground segment that shows clips of various each other off to the side and one man riding down the river and falling out Truth floats moving down a rapid with speaks to the camera. on the side. people. A man is seen speaking to the camera The man continues to speak to the camera The man continues talking to the Masked and leads into clips of people riding in while more clips of people riding. camera . Transformer the water. (Zhou et al.) Move Forward A man is seen speaking to the camera A man is seen speaking to the camera A man is seen speaking to the and leads into a man speaking to the and leads into him riding down a river. camera and leads into him riding down and Tell camera . a river. (Xiong et al.) Incoherent Content Error Sentence Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019 9
Multi-Sentence Video Description … … A man is seen hosting a news One group of people fall out and pull More shots are shown of people Ground segment that shows clips of various each other off to the side and one man riding down the river and falling out Truth floats moving down a rapid with speaks to the camera. on the side. people. A man is seen speaking to the camera The man continues to speak to the camera The man continues talking to the Masked and leads into clips of people riding in while more clips of people riding. camera . Transformer the water. (Zhou et al.) Move Forward A man is seen speaking to the camera A man is seen speaking to the camera A man is seen speaking to the and leads into a man speaking to the and leads into him riding down a river. camera and leads into him riding down and Tell camera . a river. (Xiong et al.) Repetition Incoherent Content Error Across Sentences Sentence Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019 10
Multi-Sentence Video Description … … A man is seen hosting a news One group of people fall out and pull More shots are shown of people Ground segment that shows clips of various each other off to the side and one man riding down the river and falling out Truth floats moving down a rapid with speaks to the camera. on the side. people. A man is seen speaking to the camera The man continues to speak to the camera The man continues talking to the Masked and leads into clips of people riding in while more clips of people riding. camera . Transformer the water. (Zhou et al.) Move Forward A man is seen speaking to the camera A man is seen speaking to the camera A man is seen speaking to the and leads into a man speaking to the and leads into him riding down a river. camera and leads into him riding down and Tell camera . a river. (Xiong et al.) Repetition Incoherent Content Error Across Sentences Sentence Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019 11
Multi-Sentence Video Description … … A man is seen hosting a news One group of people fall out and pull More shots are shown of people Ground segment that shows clips of various each other off to the side and one man riding down the river and falling out Truth floats moving down a rapid with speaks to the camera. on the side. people. A man is seen speaking to the camera The man continues to speak to the camera The man continues talking to the Masked and leads into clips of people riding in while more clips of people riding. camera . Transformer the water. (Zhou et al.) Move Forward A man is seen speaking to the camera A man is seen speaking to the camera A man is seen speaking to the and leads into a man speaking to the and leads into him riding down a river. camera and leads into him riding down and Tell camera . a river. (Xiong et al.) Adversarial A man is seen speaking to the camera People are shown in the water riding Several people are shown in slow and leads into several people riding a boat. motion as well as people riding along . Inference down a rough river. (Ours) 12
Conventional Video Captioning Model Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019 13
Conventional Video Captioning Model MLE Training Inference Favors frequent n-grams in training set People are riding down … Explores limited vocabulary space Maximum Greedy Max / Generator Generator Likelihood Beam Search Estimation (MLE) A man is seen speaking … Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019 14
More recommend