CMP722 ADVANCED COMPUTER VISION Lecture #4 Multimodality Aykut - PowerPoint PPT Presentation

Illustration: Detail from Fritz Kahn’s Der Mensch als Industriepalast CMP722 ADVANCED COMPUTER VISION Lecture #4 – Multimodality Aykut Erdem // Hacettepe University // Spring 2019

Illustration: DeepMind Previously on CMP722 • sequential data • convolutions in time • recurrent neural networks (RNNs) • autoregressive generative models • attention models • case study: transformer model

Lecture overview • what is multimodality • a historical view on multimodal research • core technical challenges • joint representations • coordinated representations • multimodal fusion • Disclaimer: Much of the material and slides for this lecture were borrowed from — Louis-Philippe Morency Tadas Baltrusaitis’s CMU 11-777 class — Qi Wu’s slides for ACL 2018 Tutorial on Connecting Language and Vision to Actions 3

What is Multimodal? • Multiple modes, i.e., distinct “peaks” (local maxima) in the probability density function Multimodal distribution Multimodal distribution ➢ Multiple modes, i.e., distinct “peaks” 4

What is Multimodal? Sensory Modalities 5

What is Multimodal? Modality : : The way in which something happens or is experienced. • Modality refers to a certain type of information and/or the representation format in which information is stored. • Sensory modality: one of the primary forms of sensation, as vision or touch; channel of communication. Medium (“middle”): A means or instrumentality for storing or communicating information; system of communication/transmission. • Medium is the means whereby this information is delivered to the senses of the interpreter. 6

Examples of Modalities • Natural language (both spoken or written) • Visual (from images or videos) • Auditory (including voice, sounds and music) • Haptics/touch • Smell, taste and self-motion • Physiological signals - Electrocardiogram (ECG), skin conductance • Other modalities - Infrared images, depth images, fMRI 7

Multiple Communities and Modalities Psychology Medical Speech Vision Psychology Medical Speech Vision Language Multimedia Robotics Learning Language Multimedia Robotics Learning 8

A Historical View 9

Prior Research on “Multimodal” Four eras of multimodal research • The “behavioral” era (1970s until late 1980s) • The “computational” era (late 1980s until 2000) • The “interaction” era (2000 - 2010) • The “deep learning” era (2010s until ...) • Main focus of this lecture 10

The “Behavioral” Era (1970s until late 1980s) Multimodal al Behavi avior Therap apy y by Arnold Lazarus [1973] ➢ 7 dimensions of personality (or modalities ) Multi-sensory integration (in psychology): • Multimodal signal detection: Independent decisions vs. integration [1980] • Infants' perception of substance and temporal synchrony in multimodal events [1983] • A multimodal assessment of behavioral and cognitive deficits in abused and neglected preschoolers [1984] Ø TRIVIA: Geoffrey Hinton received his B.A. in Psychology 11

The McGurk Effect (1976) Hearing lips and seeing voices – Nature 12

The “Computational” Era (Late 1980s until 2000) 1) Audio-Visual Speech Recognition (AVSR) • Motivated by the McGurk effect • First AVSR System in 1986 “Automatic lipreading to enhance speech recognition” • Good survey paper [2002] “Recent Advances in the Automatic Recognition of Audio-Visual Speech” Ø TRIVIA: The first multimodal deep learning paper was about audio visual speech recognition [ICML 2011] 13

The “Computational” Era (Late 1980s until 2000) 2) Multimodal/multisensory interfaces • Multimodal Human-Computer Interaction (HCI) “Study of how to design and evaluate new computer systems where human interact through multiple modalities, including both input and output modalities.” Glove-talk: A neural network interface between a data-glove and a speech synthesizer By Sidney Fels & Geoffrey Hinton [CHI’95] 14

➢ The “ ” Era (Late 1980s until 2000) The “Computational” Era (Late 1980s until 2000) 2) Multimodal/multisensory interfaces Ro Rosalind Picard Affective Computing is computing that relates to, arises from, or deliberately influences emotion or other affective phenomena. 15

➢ The “ ” Era (Late 1980s until 2000) The “Computational” Era (Late 1980s until 2000) 3) Multimedia Computing [1994-2010] “The Informedia Digital Video Library Project automatically combines speech, “The image and natural language understanding to create a full-content searchable digital video library.” digital video library.” 16

The “Computational” Era (Late 1980s until 2000) 3) Multimedia Computing Multimedia content analysis • Shot-boundary detection (1991 - ) • Parsing a video into continuous camera shots • Still and dynamic video abstracts (1992 - ) • Making video browsable via representative frames (keyframes) • Generating short clips carrying the essence of the video content • High-level parsing (1997 - ) • Parsing a video into semantically meaningful segments • Automatic annotation (indexing) (1999 - ) • Detecting prespecified events/scenes/objects in video 17

The “Computational” Era (Late 1980s until 2000) • • • • Hidden Markov Models [1960s] kov Models [1960s] h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 � � � � � � • Factorial Hidden Markov Models [1996] • Coupled Hidden Markov Models [1997] Models [1997] 18

Multimodal Computation Models • • ural Networks [1940s] • Artificial Neural Networks [1940s] • � � � Backpropagation [1 � � � Convolu tion [1975] • Backpropagation [1975] • Convolutional neural networks [1980s] 19

The “Interaction” Era (2000s) ➢ The “ ” Era (2000s) 1) Modeling Human Multimodal Interaction AMI Project [2001-2006, IDIAP] • 100+ hours of meeting recordings • • Fully synchronized audio-video • • Transcribed and annotated • CHIL Project [Alex Waibel] • Computers in the Human Interaction Loop • • Multi-sensor multimodal processing • • Face-to-face interactions • � Ø TRIVIA: Samy Bengio started at IDIAP working on AMI project 20

➢ The “ ” Era (2000s) The “Interaction” Era (2000s) 1) Modeling Human Multimodal Interaction CA CALO Project [2003-2008, SRI] • Cognitive Assistant that Learns and Organizes • • Personalized Assistant that Learns (PAL) • • Siri was a spinoff from this project • SSP Project [2008-2011, IDIAP] SS • Social Signal Processing • • First coined by Sandy Pentland in 2007 • • Great dataset repository: http://sspnet.eu/ • • TRIVIA: LP Morency’s PhD research was partially funded by CALO � TRIVIA: LP’s PhD research was partially funded by CALO ☺ 21

➢ The “ ” Era (2000s) The “Interaction” Era (2000s) 2) Multimedia Information Retrieval “Yearly competition to “Yearly competition to promote progress in content-based retrieval from digital video via open, metrics-based evaluation” based evaluation” [Hosted by NIST, 2001-2006] Rese sear arch ch tasks asks an and ch chal allenges: s: • Shot boundary, story segmentation, search • • “High-level feature extraction”: semantic event detection • “High level feature extraction”: semantic event detection • Introduced in 2008: copy detection and surveillance events • Introduced in 2010: Multimedia event detection (MED) • 22 •

Multimodal Computational Models ▪ • Dynamic Bayesian Networks ▪ Kevin Murphy’s PhD thesis and Matlab toolbox • Kevin Murphy’s PhD thesis and Matlab toolbox • Asynchronous HMM for multimodal [Samy Bengio, 2007] ▪ Audio-visual speech segmentation segmentation 23

Multimodal Computational Models ▪ ▪ • Discriminative sequential models ▪ ▪ tional random fields [Lafferty et • Conditional random fields [Lafferty et al., 2001] ▪ ▪ dynamic CRF • Latent-dynamic CRF [Morency et al., 2007] 24

The “deep learning” era (2010s until ...) Representation learning (a.k.a. deep learning) • Multimodal deep learning [ICML 2011] • Multimodal Learning with Deep Boltzmann Machines [NIPS 2012] • Visual attention: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention [ICML 2015] Key enablers for multimodal research: • New large-scale multimodal datasets • Faster computer and GPUS • High-level visual features The topic of our next lecture • “Dimensional” linguistic features 25

Real World tasks tackled by MMML • Affect recognition ▪ ▪ • Emotion ▪ ▪ • Persuasion ▪ ▪ • Personality traits ▪ ▪ ▪ • Media description ▪ ▪ • Image captioning ▪ ▪ • Video captioning ▪ ▪ • Visual Question Answering ▪ ▪ ▪ • Event recognition ▪ ▪ • Action recognition ▪ ▪ • Segmentation ▪ ▪ eval retrieval • Multimedia information retrieval ▪ ▪ • Content based/Cross-media 26

Core Technical Challenges 27

Multimodal Machine Learning Vocal Visual Verbal Verbal Vocal Visual Core Technical Challenges: Representation Translation Alignment Fusion Co-Learning 28

Core Challenge 1: Representation Definition : : Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy. A Joint representations: Representation Modality 1 Modality 2 29

CMP722 ADVANCED COMPUTER VISION Lecture #4 Multimodality Aykut - PowerPoint PPT Presentation

Illustration: Detail from Fritz Kahns Der Mensch als Industriepalast CMP722 ADVANCED COMPUTER VISION Lecture #4 Multimodality Aykut Erdem // Hacettepe University // Spring 2019 Illustration: DeepMind Previously on CMP722

CMP722 ADVANCED COMPUTER VISION Lecture #5 Language and Vision Aykut Erdem // Hacettepe

CMP722 ADVANCED COMPUTER VISION Lecture #3 Sequential Processing with NNs and Attention

CMP722 ADVANCED COMPUTER VISION Lecture #10 Modeling the Physical World Aykut Erdem //

CMP722 ADVANCED COMPUTER VISION Lecture #6 Deep Reinforcement Learning Aykut Erdem //

CMP722 ADVANCED COMPUTER VISION Lecture #9 Graph Networks Aykut Erdem // Hacettepe

CMP722 ADVANCED COMPUTER VISION Lecture #8 Image Synthesis Aykut Erdem // Hacettepe

Deep Random Neural Field Shun-ichi Amari RIKEN Center for Brain Science Araya Brief History

1/10/2019 www.captain.ca.gov/handouts.html 9:30- 10:30 Developed by Ann England, M.A., CCC-SLP-L

of Learning Curricular Implications for Medical Student & Residency Training Lawrence

Processing in Audio-Visual Human-Robot Interaction Petros Maragos and Athanasia Zlatintsi

The User Experience Week 15 LBSC 671 Creating Information Infrastructures Tonight

Beyond Text INFM 718X/LBSC 708X Session 10 Douglas W. Oard Agenda Beyond Text, but still

Graphics, Interaction and Perception in Augmented and Virtual Reality AR/VR Karan Singh Inspired

A New TABE for a New Era Agenda I. TABE Current Status II. NRS Changes III. TABE 11&12 IV.

Visualization of perceptual qualities in textural sounds International Computer Music Conference

The ERBlet transform, auditory time-frequency masking and perceptual sparsity Thibaud Necciari 1

MODULES AS PERCEPTUAL INPUT - SYSTEMS Language Perception Visual Auditory Perception

COMP 150: Developmental Robotics Instructor: Jivko Sinapov www.cs.tufts.edu/~jsinapov Audio

SENSORIAL MECR 2019 Ruthann Christensen introduction to sensorial Development of Sensorial

Augmented Reality (AR) Different implementations exist All combine real with virtual elements

A Study on Gesture Interaction with a 3D Audio Display Georgios Marentakis Stephen Brewster

Action perception in common coding (Van der Wel, Sebanz & Knoblich, 2013) Proposed

Frames for Psychoacoustics tics Peter Balazs Erblet transform and perceptual sparsity ARI

Pacific Belltower sound installation for live sonification of earthquake Internet data PerMagnus

CMP722 ADVANCED COMPUTER VISION Lecture #4 Multimodality Aykut - PowerPoint PPT Presentation

Illustration: Detail from Fritz Kahns Der Mensch als Industriepalast CMP722 ADVANCED COMPUTER VISION Lecture #4 Multimodality Aykut Erdem // Hacettepe University // Spring 2019 Illustration: DeepMind Previously on CMP722

CMP722 ADVANCED COMPUTER VISION Lecture #5 Language and Vision Aykut Erdem // Hacettepe

CMP722 ADVANCED COMPUTER VISION Lecture #3 Sequential Processing with NNs and Attention

CMP722 ADVANCED COMPUTER VISION Lecture #10 Modeling the Physical World Aykut Erdem //

CMP722 ADVANCED COMPUTER VISION Lecture #6 Deep Reinforcement Learning Aykut Erdem //

CMP722 ADVANCED COMPUTER VISION Lecture #9 Graph Networks Aykut Erdem // Hacettepe

CMP722 ADVANCED COMPUTER VISION Lecture #8 Image Synthesis Aykut Erdem // Hacettepe

Deep Random Neural Field Shun-ichi Amari RIKEN Center for Brain Science Araya Brief History

1/10/2019 www.captain.ca.gov/handouts.html 9:30- 10:30 Developed by Ann England, M.A., CCC-SLP-L

of Learning Curricular Implications for Medical Student &amp; Residency Training Lawrence

Processing in Audio-Visual Human-Robot Interaction Petros Maragos and Athanasia Zlatintsi

The User Experience Week 15 LBSC 671 Creating Information Infrastructures Tonight

Beyond Text INFM 718X/LBSC 708X Session 10 Douglas W. Oard Agenda Beyond Text, but still

Graphics, Interaction and Perception in Augmented and Virtual Reality AR/VR Karan Singh Inspired

A New TABE for a New Era Agenda I. TABE Current Status II. NRS Changes III. TABE 11&amp;12 IV.

Visualization of perceptual qualities in textural sounds International Computer Music Conference

The ERBlet transform, auditory time-frequency masking and perceptual sparsity Thibaud Necciari 1

MODULES AS PERCEPTUAL INPUT - SYSTEMS Language Perception Visual Auditory Perception

COMP 150: Developmental Robotics Instructor: Jivko Sinapov www.cs.tufts.edu/~jsinapov Audio

SENSORIAL MECR 2019 Ruthann Christensen introduction to sensorial Development of Sensorial

Augmented Reality (AR) Different implementations exist All combine real with virtual elements

A Study on Gesture Interaction with a 3D Audio Display Georgios Marentakis Stephen Brewster

Action perception in common coding (Van der Wel, Sebanz &amp; Knoblich, 2013) Proposed

Frames for Psychoacoustics tics Peter Balazs Erblet transform and perceptual sparsity ARI

Pacific Belltower sound installation for live sonification of earthquake Internet data PerMagnus

of Learning Curricular Implications for Medical Student & Residency Training Lawrence

A New TABE for a New Era Agenda I. TABE Current Status II. NRS Changes III. TABE 11&12 IV.

Action perception in common coding (Van der Wel, Sebanz & Knoblich, 2013) Proposed