Computer Vision: Weakly-supervised learning from video and images - PowerPoint PPT Presentation

CSClub Saint Petersburg November 17, 2014 Computer Vision: Weakly-supervised learning from video and images Ivan Laptev ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, Paris Joint work with: Piotr Bojanowski – Rémi Lajugie – Maxime Oquab – Francis Bach – Leon Bottou – Jean Ponce – Cordelia Schmid – Josef Sivic

– Advertisement – О компании VisionLabs – команда профессионалов, обладающих значительными знаниями и существенным практическим Контакты : опытом в сфере разработки алгоритмов компьютерного Официальный сайт : http://visionlabs.ru/ зрения и интеллектуальных систем . Контактное лицо: Ханин Александр E-mail: a.khanin@visionlabs.ru Мы создаем и внедряем технологии Тел. : +7 (926) 988-7891 компьютерного зрения, открывая новые возможности для изменения окружающего нас мира к лучшему.

– Advertisement – Команда Направления деятельности  Технология распознавания лиц Система выявления мошенников в банках  Технология распознавания номеров Система учета и автоматизации доступа транспорта  Технологии для безопасного города Александр Алексей Слава Сергей Сергей Иван Алексей Иван Система выявления нарушений и опасных ситуаций Ханин Нехаев Казьмин Лаптев Миляев Кордичев Трусков Черепанов Chief Executive Chief Senior Software Scientific Financial Software Executive Officer Technical advisor CV engineer developer developer advisor Officer Officer Наша команда – симбиоз науки и бизнеса

– Advertisement – Достижения Проекты масштаба государства

– Advertisement – Мы ищем единомышленников Спасибо за внимание! Создание и внедрение интеллектуальных систем Решение интересных практических задач Контакты : Официальный сайт : http://visionlabs.ru/ Работа в дружной амбициозной команде Контактное лицо: Ханин Александр E-mail: a.khanin@visionlabs.ru Тел. : +7 (926) 988-7891

What is Computer Vision?

What is Computer Vision? 7

What is the recent progress? Industry Research 1990s: Automated quality inspection Recognition at the level of a few (controlled lighting, scale,…) toy objects (COIL 20 dataset) Now: Face recognition in social media ImageNet: 14M images, 21K classes 6% Top-5 error rate in 2014 Challenge

Why image and video analysis? Data: ~2.5 Billion new images / month TV-channels recorded since 60’s ~5K image uploads every min. >34K hours of video upload every day ~30M surveillance cameras in US => ~700K video hours/day And even more with future wearable devices

Why looking at people? How many person-pixels are in the video? Movies TV YouTube

Why looking at people? How many person-pixels are in the video? 35% 34% Movies TV 40% YouTube

How many person pixels in our daily life?  Wearable camera data: Microsoft SenseCam dataset

How many person pixels in our daily life?  Wearable camera data: Microsoft SenseCam dataset ~4%

What are the difficulties?  Large variations in appearance: occlusions, non-rigid motion, view- … point changes, clothing… Action Hugging :  Manual collection of training samples is prohibitive: many … action classes, rare occurrence  Action vocabulary is not well-defined … Action Open :

This talk: Brief overview of recent techniques Weakly-supervised learning from video and scripts Weakly-supervised learning with convolutional neural networks

Standard visual recognition pipeline  Collect image/video samples and corresponding class labels GetOutCar AnswerPhone  Design appropriate data representation, with certain HandShake StandUp invariance properties  Design / use existing DriveCar Kiss machine learning methods for learning and classification

Bag-of-Features action recognition space-time patches Extraction of Local features K-means clustering Occurrence histogram (k=4000) of visual words Feature description Non-linear SVM with χ 2 Feature kernel quantization [Laptev, Marszałek , Schmid, Rozenfeld 2008]

Action classification Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”

Where to get training data? • Shoot actions in the lab KTH dataset Weizman dataset,… - Limited variability - Unrealistic • Manually annotate existing content HMDB, Olympic Sports, UCF50, UCF101, … - Very time-consuming • Use readily-available video scripts - Scripts are available for 1000’s of hours of movies and TV -series www.dailyscript.com, www.movie-page.com, www.weeklyscript.com - Scripts describe dynamic and static content of videos

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 21

Script-based video annotation • Scripts available for >500 movies (no time synchronization) www.dailyscript.com, www.movie- page.com, www.weeklyscript.com … • Subtitles (with time info.) are available for the most of movies • Can transfer time to scripts by text alignment movie script subtitles … 1172 … 01:20:17,240 --> 01:20:20,437 RICK Why weren't you honest with me? Why weren't you honest with me? Why Why'd you keep your marriage a secret? did you keep your marriage a secret? 01:20:17 1173 Rick sits down with Ilsa. 01:20:20,640 --> 01:20:23,598 01:20:23 lt wasn't my secret, Richard. ILSA Victor wanted it that way. Oh, it wasn't my secret, Richard. Victor wanted it that way. Not even 1174 our closest friends knew about our 01:20:23,800 --> 01:20:26,189 marriage. Not even our closest friends … knew about our marriage. [Laptev, Marszałek , Schmid, Rozenfeld 2008] …

Scripts as weak supervision Challenges: • Imprecise temporal localization • No explicit spatial localization • NLP problems, scripts ≠ training labels “… Will gets out of the Chevrolet. …” vs. Get-out-car “… Erin exits her new truck…” 24:25 Uncertainty 24:51

Previous work Sivic, Everingham, and Zisserman, ''Who are you?'' -- Learning Person Specific Classifiers from Video, In CVPR 2009. Buehler, Everingham, and Zisserman "Learning sign language by watching TV (using weakly aligned subtitles)", In CVPR 2009. …wanted to know about the history of the trees Duchenne, Laptev, Sivic, Bach and Ponce, "Automatic Annotation of Human Actions in Video", In ICCV 2009.

Joint Learning of Actors and Actions [Bojanowski et al. ICCV 2013] Rick? Rick? Walks? Walks? Rick walks up behind Ilsa

Joint Learning of Actors and Actions [Bojanowski et al. ICCV 2013] Rick Walks Rick walks up behind Ilsa

Formulation: Cost function Actor classifier Actor labels Actor image features Rick Ilsa Sam

Formulation: Cost function Weak supervision from scripts: Person p appears at least once in clip N : p = Rick

Formulation: Cost function Weak supervision from scripts: Action a appears at least once in clip N : a = Walk

Formulation: Cost function Weak supervision from scripts: Person p and Action a Person p Action a appears in appears appear in clip N : in clip N : clip N :

Image and video features Face features • Facial features [Everingham’06] • HOG descriptor on normalized face image Action features • Dense Trajectory features in person bounding box [Wang et al.,’11] 34

Results for Person Labelling American beauty (11 character names) Casablanca (17 character names) 35

Results for Person + Action Labelling Casablanca, Walking 36

Finding Actions and Actors in Movies [Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]

Action Learning with Ordering Constraints [Bojanowski et al. ECCV 2014] 38

Computer Vision: Weakly-supervised learning from video and images - PowerPoint PPT Presentation

CSClub Saint Petersburg November 17, 2014 Computer Vision: Weakly-supervised learning from video and images Ivan Laptev ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, Paris Joint work with: Piotr Bojanowski Rmi Lajugie Maxime Oquab

CS262: Computer Vision (and Human-Computer Interaction) John Magee 1 Computer Vision How are

Computer Vision Neurobio 230 Bill Lotter Exciting time: Neuroscience computer vision

Introductions Computer Vision Automatic understanding of images and video Instructor :

A Computer Vision Sampler COMPSCI 527 Today: Introduction to computer vision Course

Computer Vision Introduction Historical context Connections to other disciplines Vision and

Computer Vision Computer Vision How does vision work? What is vision for? Ela Claridge

CS4495/6495 Introduction to Computer Vision 1A-L1 Introduction Outline What is computer

Camera Calibration COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Camera

Computer Vision/Graphics -- Dr. Chandra Kambhamettu for SIGNEWGRAD 11/24/04 Computer Vision :

COMPUTER VISION FOR ROBOT NAVIGATION Sanketh Shetty Computer Vision and Robotics Laboratory

CS201 Lecture 02 Computer Vision: Image Formation and Basic Techniques John Magee 1 Computer

Computer Vision II Bjoern Andres Machine Learning for Computer Vision TU Dresden 2020-05-22

Image Pyramids COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Image Pyramids 1

Stereo Vision I Introduction to Computer Vision CSE 152 Lecture 13 CSE152, Spr 07 Intro

Learning for Computer Vision Ramprasaath Lecture Outline Computer Vision Before

Computer Vision: from Recognition to Geometry Shao-Yi Chien

Tracking Feature Windows COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Training

Dont Use Computer Vision For Web Security Florian Tramr CV-COPS August 28 th 2020 Computer

Visual Cognition Computer Vision 3D Vision Mobile multimedia 3D TV Model-based

University of Cambridge Engineering Part IIB Module 4F12: Computer Vision Handout 1:

Computer Vision II Bjoern Andres Machine Learning for Computer Vision TU Dresden 2020-05-29

EVC: Image Processing & Computer Vision http://

Image Motion COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Image Motion 1 /

Computer Vision: Weakly-supervised learning from video and images - PowerPoint PPT Presentation

CSClub Saint Petersburg November 17, 2014 Computer Vision: Weakly-supervised learning from video and images Ivan Laptev ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, Paris Joint work with: Piotr Bojanowski Rmi Lajugie Maxime Oquab

CS262: Computer Vision (and Human-Computer Interaction) John Magee 1 Computer Vision How are

Computer Vision Neurobio 230 Bill Lotter Exciting time: Neuroscience computer vision

Introductions Computer Vision Automatic understanding of images and video Instructor :

A Computer Vision Sampler COMPSCI 527 Today: Introduction to computer vision Course

Computer Vision Introduction Historical context Connections to other disciplines Vision and

Computer Vision Computer Vision How does vision work? What is vision for? Ela Claridge

CS4495/6495 Introduction to Computer Vision 1A-L1 Introduction Outline What is computer

Camera Calibration COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Camera

Computer Vision/Graphics -- Dr. Chandra Kambhamettu for SIGNEWGRAD 11/24/04 Computer Vision :

COMPUTER VISION FOR ROBOT NAVIGATION Sanketh Shetty Computer Vision and Robotics Laboratory

CS201 Lecture 02 Computer Vision: Image Formation and Basic Techniques John Magee 1 Computer

Computer Vision II Bjoern Andres Machine Learning for Computer Vision TU Dresden 2020-05-22

Image Pyramids COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Image Pyramids 1

Stereo Vision I Introduction to Computer Vision CSE 152 Lecture 13 CSE152, Spr 07 Intro

Learning for Computer Vision Ramprasaath Lecture Outline Computer Vision Before

Computer Vision: from Recognition to Geometry Shao-Yi Chien

Tracking Feature Windows COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Training

Dont Use Computer Vision For Web Security Florian Tramr CV-COPS August 28 th 2020 Computer

Visual Cognition Computer Vision 3D Vision Mobile multimedia 3D TV Model-based

University of Cambridge Engineering Part IIB Module 4F12: Computer Vision Handout 1:

Computer Vision II Bjoern Andres Machine Learning for Computer Vision TU Dresden 2020-05-29

EVC: Image Processing &amp; Computer Vision http://

Image Motion COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Image Motion 1 /

EVC: Image Processing & Computer Vision http://