Ambient Sound Provides Supervision for Visual Learning Andrew Owens - PowerPoint PPT Presentation

Ambient Sound Provides Supervision for Visual Learning Andrew Owens 1 , Jiajun Wu 1 , Josh H. McDermott 1 , William T. Freeman 1 , 2 , and Antonio Torralba 1 1 MIT & 2 Google Research ECCV 2016 Presented by An T. Nguyen 1

Introduction Problem ◮ Learn Image Representation without labels ... ◮ ... that useful for a real task (e.g. Object Recognition). 2

Introduction Problem ◮ Learn Image Representation without labels ... ◮ ... that useful for a real task (e.g. Object Recognition). Idea ◮ Set up a pretext task. ◮ To solve pretext task, model must learn good representation. 2

Introduction Problem ◮ Learn Image Representation without labels ... ◮ ... that useful for a real task (e.g. Object Recognition). Idea ◮ Set up a pretext task. ◮ To solve pretext task, model must learn good representation. Learn to predict a “natural signal”... ◮ ...that available for ‘free’. 2

Introduction Problem ◮ Learn Image Representation without labels ... ◮ ... that useful for a real task (e.g. Object Recognition). Idea ◮ Set up a pretext task. ◮ To solve pretext task, model must learn good representation. Learn to predict a “natural signal”... ◮ ...that available for ‘free’. ◮ This paper: Sound. 2

Introduction Problem ◮ Learn Image Representation without labels ... ◮ ... that useful for a real task (e.g. Object Recognition). Idea ◮ Set up a pretext task. ◮ To solve pretext task, model must learn good representation. Learn to predict a “natural signal”... ◮ ...that available for ‘free’. ◮ This paper: Sound. ◮ Others: Camera motion. (Agrawal et. al., Jayaraman & Grauman, 2015) 2

Data Yahoo Flickr Creative Commons 100 Million Dataset. (Thomee et. al. 2015) 3

Data Yahoo Flickr Creative Commons 100 Million Dataset. (Thomee et. al. 2015) ◮ 360,000 video subset. ◮ Sample one image per 10sec. ◮ Extract 3.75 sec of sound around. ◮ 1 . 8 mil. train examples. 3

Examples 1 (flickr.com/photos/41894173046@N01/4530333858) Sound Video 4

Examples 2 (flickr.com/photos/42035325@N00/8029349128) Sound Video 5

Examples 3 (flickr.com/photos/zen/2479982751) Sound Video 6

Challenges ◮ Sound is sometimes indicative of image. ◮ But sometimes not. 7

Challenges ◮ Sound is sometimes indicative of image. ◮ But sometimes not. Sound producing objects ◮ outside image. ◮ not always produce sound. 7

Challenges ◮ Sound is sometimes indicative of image. ◮ But sometimes not. Sound producing objects ◮ outside image. ◮ not always produce sound. Video ◮ is edited. ◮ has noisy, background sound. 7

Challenges ◮ Sound is sometimes indicative of image. ◮ But sometimes not. Sound producing objects ◮ outside image. ◮ not always produce sound. Video ◮ is edited. ◮ has noisy, background sound. Question: What representation can we learn? 7

Represent sound Pre-process 8

Represent sound Pre-process ◮ Filter waveform ... (mimic human ear). 8

Represent sound Pre-process ◮ Filter waveform ... (mimic human ear). ◮ Compute statistics (e.g. mean of each freq. channel). 8

Represent sound Pre-process ◮ Filter waveform ... (mimic human ear). ◮ Compute statistics (e.g. mean of each freq. channel). ◮ → sound texture: 502-dim vector. 8

Represent sound Pre-process ◮ Filter waveform ... (mimic human ear). ◮ Compute statistics (e.g. mean of each freq. channel). ◮ → sound texture: 502-dim vector. Two labeling models 1. Cluster sound texture (k-mean). 8

Represent sound Pre-process ◮ Filter waveform ... (mimic human ear). ◮ Compute statistics (e.g. mean of each freq. channel). ◮ → sound texture: 502-dim vector. Two labeling models 1. Cluster sound texture (k-mean). 2. PCA, 30 projections, threshold → binary codes. 8

Represent sound Pre-process ◮ Filter waveform ... (mimic human ear). ◮ Compute statistics (e.g. mean of each freq. channel). ◮ → sound texture: 502-dim vector. Two labeling models 1. Cluster sound texture (k-mean). 2. PCA, 30 projections, threshold → binary codes. Given an image 1. Predict sound cluster. 2. Predict 30 binary codes (multi-label classification). 8

Training 9

Training Convolutional Neural Network ◮ Similar to (Krizhevsky et. al. 2012). ◮ Implemented in Caffe. 9

Training 10

Visualizing neurons (in upper layers) 11

Visualizing neurons (in upper layers) Method: for each neuron 11

Visualizing neurons (in upper layers) Method: for each neuron 1. Find images with large activation. 11

Visualizing neurons (in upper layers) Method: for each neuron 1. Find images with large activation. 2. Find locations with large contribution to activation. 11

Visualizing neurons (in upper layers) Method: for each neuron 1. Find images with large activation. 2. Find locations with large contribution to activation. 3. Highlight these regions. 11

Visualizing neurons (in upper layers) Method: for each neuron 1. Find images with large activation. 2. Find locations with large contribution to activation. 3. Highlight these regions. 4. Show to human on AMT. 11

Visualizing neurons 12

Visualizing neurons 13

Detectors Histogram Sound 14

Detectors Histogram Sound Ego Motion 14

Detectors Histogram Sound Ego Motion Labeled Scenes (supervised) 14

Observations ◮ Each method learn some kinds of representations... ◮ ...depend on the pretext task. 15

Observations ◮ Each method learn some kinds of representations... ◮ ...depend on the pretext task. Representation learned from sound ◮ Objects with distinctive sound. ◮ Complementary to other methods. 15

Object/Scene Recognition (1-vs-rest SVM) 1. Agrawal et.al. 2015 4. Doersch et.al 2015 16 20. Kr¨ ahenb¨ uhl et.al. 2016 35. Wang & Gupta 2015

Object/Scene Recognition (1-vs-rest SVM) Comparable Performance to Others 1. Agrawal et.al. 2015 4. Doersch et.al 2015 16 20. Kr¨ ahenb¨ uhl et.al. 2016 35. Wang & Gupta 2015

Object Detection (Pretrain Fast-RCNN) 1. Agrawal et.al. 2015 4. Doersch et.al 2015 20. Kr¨ ahenb¨ uhl et.al. 2016 35. Wang & Gupta 2015 17

Object Detection (Pretrain Fast-RCNN) Similar Performance to Motion 1. Agrawal et.al. 2015 4. Doersch et.al 2015 20. Kr¨ ahenb¨ uhl et.al. 2016 35. Wang & Gupta 2015 17

Discussion Sound ◮ is abundant. ◮ can learn good representations. ◮ complementary to visual info. 18

Discussion Sound ◮ is abundant. ◮ can learn good representations. ◮ complementary to visual info. Future work ◮ Other sound representations. ◮ What object/scene detectable by sound? 18

Bonus: Visually Indicative Sound (Owens et. al. 2016, vis.csail.mit.edu) 19

Ambient Sound Provides Supervision for Visual Learning Andrew Owens - PowerPoint PPT Presentation

Ambient Sound Provides Supervision for Visual Learning Andrew Owens 1 , Jiajun Wu 1 , Josh H. McDermott 1 , William T. Freeman 1 , 2 , and Antonio Torralba 1 1 MIT & 2 Google Research ECCV 2016 Presented by An T. Nguyen 1 Introduction

Noise2Self: Blind Denoising by Self-Supervision Joshua Batson Loc Royer Noisy Data

Supervision Strengthening Our Practice The plan Supervision what is it? Benefits

SOUND SOUND Wha hat is t is sound sound? Click on the image below to find out. Sounds are

? Message sound Message P(wolf|sound) P(sound| wolf) x P(wolf) 1 9/4/19 P(sound| wolf)

Sonification - Sound of Science VU, WS 2013 Lecture 8 - Parameter Mapping Visda Goudarzi

Supervision Mandatory Webinar 4 Webinar overview I. Background II. Why supervision? III.

SYNTHESIZING 3D SOUND SYNTHESIZING 3D SOUND AND AND SOUND LOCALIZATION SOUND LOCALIZATION

Sound & Editing Lily, Matt, Mei, Michaela Sound WHAT IS SOUND? An audible vibration of the

Sound 1 Sound "50% of the movie experience is sound - George Lucas Sound is used

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

Group and Commercial Insurer Supervision Presenter: Gerald Gakundi Assistant Director of

Sound Slide 2 / 50 Characteristics of Sound Sound can travel through any kind of matter, but

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

A Model of Visual Imagery A Model of Visual Imagery John Abbondanza, OD, FCOVD John Abbondanza,

Overview Overview Visual displays Visual displays Visual and tactile displays Visual and

The Cold Chain in High Ambient Temperature The Cold Chain in High Ambient Temperature Countries:

The Applicability of Ambient Sensors as Proximity Evidence for NFC Transactions Carlton Shepherd ,

Illumination Models realistic images, we must simulate and the appearance of surfaces under

Ray Tracing Assignment Goal is to reproduce the following Whitted, 1980 1 Ray Tracing

Interacting with the Ambience: Multimodal Interaction and Ambient Intelligence W3C Workshop on

Index Convolutional Codes 1 Free distance 2 Cyclic structures and free distance 3 Computing

Egon B orger Ambient Abstract State Machines Visiting ETH Z urich, Department of Computer

Lighting 3 rd lesson Lighting terminology Color Temperature White Balance Contrast Ratio The

CS 5 4 3 : Com puter Graphics Lecture 5 ( part I I ) : I llum ination and Shading Emmanuel Agu

Ambient Sound Provides Supervision for Visual Learning Andrew Owens - PowerPoint PPT Presentation

Ambient Sound Provides Supervision for Visual Learning Andrew Owens 1 , Jiajun Wu 1 , Josh H. McDermott 1 , William T. Freeman 1 , 2 , and Antonio Torralba 1 1 MIT & 2 Google Research ECCV 2016 Presented by An T. Nguyen 1 Introduction

Noise2Self: Blind Denoising by Self-Supervision Joshua Batson Loc Royer Noisy Data

Supervision Strengthening Our Practice The plan Supervision what is it? Benefits

SOUND SOUND Wha hat is t is sound sound? Click on the image below to find out. Sounds are

? Message sound Message P(wolf|sound) P(sound| wolf) x P(wolf) 1 9/4/19 P(sound| wolf)

Sonification - Sound of Science VU, WS 2013 Lecture 8 - Parameter Mapping Visda Goudarzi

Supervision Mandatory Webinar 4 Webinar overview I. Background II. Why supervision? III.

SYNTHESIZING 3D SOUND SYNTHESIZING 3D SOUND AND AND SOUND LOCALIZATION SOUND LOCALIZATION

Sound &amp; Editing Lily, Matt, Mei, Michaela Sound WHAT IS SOUND? An audible vibration of the

Sound 1 Sound &quot;50% of the movie experience is sound - George Lucas Sound is used

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

Group and Commercial Insurer Supervision Presenter: Gerald Gakundi Assistant Director of

Sound Slide 2 / 50 Characteristics of Sound Sound can travel through any kind of matter, but

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

A Model of Visual Imagery A Model of Visual Imagery John Abbondanza, OD, FCOVD John Abbondanza,

Overview Overview Visual displays Visual displays Visual and tactile displays Visual and

The Cold Chain in High Ambient Temperature The Cold Chain in High Ambient Temperature Countries:

The Applicability of Ambient Sensors as Proximity Evidence for NFC Transactions Carlton Shepherd ,

Illumination Models realistic images, we must simulate and the appearance of surfaces under

Ray Tracing Assignment Goal is to reproduce the following Whitted, 1980 1 Ray Tracing

Interacting with the Ambience: Multimodal Interaction and Ambient Intelligence W3C Workshop on

Index Convolutional Codes 1 Free distance 2 Cyclic structures and free distance 3 Computing

Egon B orger Ambient Abstract State Machines Visiting ETH Z urich, Department of Computer

Lighting 3 rd lesson Lighting terminology Color Temperature White Balance Contrast Ratio The

CS 5 4 3 : Com puter Graphics Lecture 5 ( part I I ) : I llum ination and Shading Emmanuel Agu

Sound & Editing Lily, Matt, Mei, Michaela Sound WHAT IS SOUND? An audible vibration of the

Sound 1 Sound "50% of the movie experience is sound - George Lucas Sound is used