Vision and Sound Computer Vision Fall 2018 Columbia University

Single-modality video representations Vision Hearing 1D Convolution 3D Convolution 1D Convolution 3D Convolution 1D Convolution 3D Convolution Slide credit: Andrew Owens

(McGurk 1976)

Same audio, different video! (McGurk 1976)

Object Recognition Objects f ( x s ; ω ) Sound

Natural Synchronization X Lion min D KL ( F ( x i ) || f ( x i )) f i f ( x s ; ω ) F ( x v ; Ω ) Sound Vision

Millions of Unlabeled Videos

SoundNet m s r e o i r f o e g v a e W t a C Convolutional Neural Network

Sound Recognition Classifying sounds in ESC-50 Method Accuracy Chance 2% Human Consistency 81%

Sound Recognition Classifying sounds in ESC-50 Method Accuracy Chance 2% SVM-MFCC 39% 44% Random Forest CNN, Piczak 2015 64% Human Consistency 81%

Sound Recognition Classifying sounds in ESC-50 Method Accuracy Chance 2% SVM-MFCC 39% 44% Random Forest CNN, Piczak 2015 64% 10% gain SoundNet 74% Human Consistency 81%

Vision vs Sound Low-dimensional embeddings via Maaten and Hinton, 2007 Vision Sound

Sensor Power Consumption Camera Microphone ~1 watt ~1 milliwatt

What does it learn? m s r e o i r f o e g v a e W t a C

Layer 1

Layer 5 Smacking-like

Layer 5 Chime-like

Layer 7 Scuba-like

Layer 7 Parents-like

Audiovisual Grounding Which regions are making which sounds?

Audiovisual Grounding

Which objects make which sounds?

The sound of clicked object

Collect unlabeled videos

Mix Sound Tracks

How to recover originals? Audio-only: • Ill-posed • permutation problem

Vision can help Video Analysis Network Audio Synthesizer Network Sound of target video Audio Analysis Network

Audiovisual Model Video Analysis Network Max CNN Pool K vision   channels

Audiovisual Model Video Analysis Network Max CNN Pool K vision   channels s 1 Audio Analysis Network s 2 STFT K audio   U-Net channels … s K … Sound spectrogram

Audiovisual Model Video Analysis Network Audio Synthesizer Max CNN Network Pool K vision   channels Sound of target video s 1 Audio Analysis Network s 2 STFT K audio   U-Net channels … s K … Sound spectrogram

Original Audio

What does this sound like?

What regions are making sound? Original Video Estimated Volume

What sounds are they making? Original Video Embedding (projected and visualized as color)

Adjusting Volume

Learning audio-visual correspondences ( ( , , ) , ) ( , ( , real or fake? Slide credit: Andrew Owens

Learning audio-visual correspondences ( , ) ( , “moo” real or fake ? Slide credit: Andrew Owens

Idea #1: random pairs ( , ) ( , Arandjelovic, Zisserman. ICCV 2017 Slide credit: Andrew Owens

Arandjelovic, Zisserman. ICCV 2017

Vision hidden units Arandjelovic, Zisserman. ICCV 2017

Sound hidden units Arandjelovic, Zisserman. ICCV 2017

Sound Recognition Arandjelovic, Zisserman. ICCV 2017

Visual Recognition Linear classifier on top of features (ImageNet) Arandjelovic, Zisserman. ICCV 2017

Idea #1: random pairs ( , ) ( , Slide credit: Andrew Owens

Idea #2: time-shifted pairs ( ( , , ) Slide credit: Andrew Owens

Idea #2: time-shifted pairs Slide credit: Andrew Owens

Fused audio-visual representation Aligned vs. misaligned 3D Convolution 3D Convolution 3D Convolution 1D Convolution 3D Convolution 1D Convolution 1D Convolution Slide credit: Andrew Owens

Fused audio-visual representation Aligned vs. misaligned 3D Convolution 3D Convolution 3D Convolution + concat at 1D Convolution 3D Convolution “conv2” 1D Convolution 1D Convolution Slide credit: Andrew Owens

What does the network learn? Aligned vs. misaligned Class activation map (Zhou et al. 2016) Aligned vs. misaligned Slide credit: Andrew Owens

Top responses per category (speech examples omitted) Dribbling basketball

Dribbling basketball

Playing organ

Chopping wood

Application: on/off-screen source separation Good morning! Guten Morgen! Task: separate on-screen sounds from background noise Slide credit: Andrew Owens

Creating training data On-scr Synthetic sound mixture On-screen Off-screen + Slide credit: Andrew Owens VoxCeleb

On/off-screen source separation On-screen Off-screen + Multisensory features Regression Frequen STFT Time Slide credit: Andrew Owens

On/off-screen source separation On-screen Off-screen + u-net concat (Ronneberger 2015) Frequen Time Slide credit: Andrew Owens

On/off-screen source separation On-screen Off-screen Training: 4-sec. videos • Inverse STFT VoxCeleb + AudioSet • d + L 1 loss on log spec. • No labels or face detection • u-net concat (Ronneberger 2015) Frequen Time Slide credit: Andrew Owens

Input video

On-screen prediction

Off-screen prediction

Input video

On-screen prediction

Vision and Sound Computer Vision Fall 2018 Columbia University - PowerPoint PPT Presentation

Vision and Sound Computer Vision Fall 2018 Columbia University Single-modality video representations Vision Hearing 1D Convolution 3D Convolution 1D Convolution 3D Convolution 1D Convolution 3D Convolution Slide credit: Andrew Owens

1 Timing: Used to locate sound sources Auditory System: Demands Frequency (logarithmic,

Distance Sensors: Sound, Light and Vision THOMAS MAIER SEMINAR: INTELLIGENT ROBOTICS 1

Sound 1 Sound "50% of the movie experience is sound - George Lucas Sound is used

? Message sound Message P(wolf|sound) P(sound| wolf) x P(wolf) 1 9/4/19 P(sound| wolf)

OMF East and TOD Vision Design and Construction Open House June 22, 2017 On Site Agenda

Vision : A healthy Puget Sound environment as defined by diverse natural ecosystems and

SOUND SOUND Wha hat is t is sound sound? Click on the image below to find out. Sounds are

Sound & Editing Lily, Matt, Mei, Michaela Sound WHAT IS SOUND? An audible vibration of the

Sound Slide 2 / 50 Characteristics of Sound Sound can travel through any kind of matter, but

West Parry Sound Area Recreation and Culture Centre The Vision The West Parry Sound Area Joint

CSM-EE727 6 Seconds of Sound and Vision: Creativity in Micro-Videos Authors Miriam Redi, Neil

CCA Vision Excellence through Active Learning and Sound Character Sports CCAs Motto Win with

Physics of Sound What is sound? Vibrations that travel through the air (or another medium)

Forza Motorsport 2 Racing Games GDC 2007 Microsoft Confidential Agenda Forza Audio Vision

Figure 2: Sound from file Locate the sound file you want to play, and click Insert, as shown in

Sound Change Principles of Sound Change: How sounds and

Sound 2: frequency analysis Tues. March 27, 2018 1 Speed of Sound Sound travels at about 340

Sonification - Sound of Science VU, WS 2013 Lecture 8 - Parameter Mapping Visda Goudarzi

Chapter 7 Audition Sound Sound is the compression and rarefaction of air, or, in other

The Ear and Hearing Sound and sensations: Physical attributes of sound: intensity, frequency,

Decibels and Acoustical Engineering What is Sound? Sound is the movement of energy through

Audio recording Outline What is sound? What is

3D Sound GWU Why Sound? Emotional Impact Improved Presence Situational Awareness

SxSeal 2000 Stefaan Gouhie CSI Ametek +33 6 70 61 73 28 Stefaan.gouhie@ametek.com CLEAR VISION