Multimodal Machine Learning Louis-Philippe (LP) Morency CMU - PowerPoint PPT Presentation

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine Learning Laboratory [MultiComp Lab] 1

CMU Course 11-777: Multimodal Machine Learning 2

Lecture Objectives ▪ What is Multimodal? ▪ Multimodal: Core technical challenges ▪ Representation learning, translation, alignment, fusion and co-learning ▪ Multimodal representation learning ▪ Multimodal tensor representation ▪ Implicit Alignment ▪ Temporal attention ▪ Fusion and temporal modeling ▪ Multi-view LSTM and memory-based fusion

What is Multimodal? 4

What is Multimodal? Multimodal distribution ➢ Multiple modes, i.e., distinct “peaks” (local maxima) in the probability density function

What is Multimodal? Sensory Modalities

Multimodal Communicative Behaviors V erbal V isual ▪ Lexicon ▪ Gestures ▪ Words ▪ Head gestures ▪ Eye gestures ▪ Syntax ▪ Arm gestures ▪ Part-of-speech ▪ Dependencies ▪ Body language ▪ Body posture ▪ Pragmatics ▪ Proxemics ▪ Discourse acts V ocal ▪ Eye contact ▪ Head gaze ▪ Prosody ▪ Eye gaze ▪ Intonation ▪ Facial expressions ▪ Voice quality ▪ FACS action units ▪ Vocal expressions ▪ Smile, frowning ▪ Laughter, moans 7

What is Multimodal? Modality The way in which something happens or is experienced. • Modality refers to a certain type of information and/or the representation format in which information is stored. • Sensory modality: one of the primary forms of sensation, as vision or touch; channel of communication. (“middle”) Medium A means or instrumentality for storing or communicating information; system of communication/transmission. • Medium is the means whereby this information is delivered to the senses of the interpreter. 8

Multiple Communities and Modalities Psychology Medical Speech Vision Language Multimedia Robotics Learning

Examples of Modalities  Natural language (both spoken or written)  Visual (from images or videos)  Auditory (including voice, sounds and music)  Haptics / touch  Smell, taste and self-motion  Physiological signals ▪ Electrocardiogram (ECG), skin conductance  Other modalities ▪ Infrared images, depth images, fMRI

Prior Research on “Multimodal” Four eras of multimodal research ➢ The “ behavioral ” era (1970s until late 1980s) ➢ The “ computational ” era (late 1980s until 2000) ➢ The “ interaction ” era (2000 - 2010) ➢ The “ deep learning ” era (2010s until …) ❖ Main focus of this tutorial 1970 1980 1990 2000 2010 11

The McGurk Effect (1976) Hearing lips and seeing voices – Nature 1970 1980 1990 2000 2010 12

The McGurk Effect (1976) Hearing lips and seeing voices – Nature 1970 1980 1990 2000 2010 13

➢ The “ Computational ” Era(Late 1980s until 2000) 1) Audio-Visual Speech Recognition (AVSR) 1970 1980 1990 2000 2010 14

Core Technical Challenges 15

Core Challenges in “Deep” Multimodal ML Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency https://arxiv.org/abs/1705.09406 These challenges are non-exclusive. 16

Core Challenge 1: Representation Definition: Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy. Joint representations: A Representation Modality 1 Modality 2 17

Joint Multimodal Representation “I like it!” Joyful tone “Wow!” Tensed voice Joint Representation (Multimodal Space) 18

Joint Multimodal Representations Multimodal Representation Audio-visual speech recognition Depth Multimodal [Ngiam et al., ICML 2011] • Bimodal Deep Belief Network Image captioning [Srivastava and Salahutdinov, NIPS 2012] • Multimodal Deep Boltzmann Machine Depth Video Depth Verbal Audio-visual emotion recognition [Kim et al., ICASSP 2013] • Deep Boltzmann Machine Visual Verbal 19

Multimodal Vector Space Arithmetic [Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014] 20

Core Challenge 1: Representation Definition: Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy. Joint representations: Coordinated representations: A B Repres. 1 Repres 2 Representation Modality 1 Modality 2 Modality 1 Modality 2 21

Coordinated Representation: Deep CCA Learn linear projections that are maximally correlated: View 𝐼 𝑦 𝒗 ∗ , 𝒘 ∗ = argmax 𝑑𝑝𝑠𝑠 𝒗 𝑼 𝒀, 𝒘 𝑼 𝒁 View 𝐼 𝑧 𝒗,𝒘 𝑰 𝒚 𝑰 𝒛 · · · · · · 𝑾 𝑽 𝒘 · · · · · · 𝒗 𝑿 𝒛 𝑿 𝒚 · · · · · · 𝒁 Text Image 𝒀 𝒀 𝒁 Andrew et al., ICML 2013 22

Core Challenge 2: Alignment Definition: I dentify the direct relations between (sub)elements from two or more different modalities. Modality 2 Modality 1 A Explicit Alignment t 1 The goal is to directly find correspondences t 2 t 4 between elements of different modalities Fancy algorithm t 3 t 5 B Implicit Alignment Uses internally latent alignment of modalities in order to better solve a different problem t n t n 23

Temporal sequence alignment Applications: - Re-aligning asynchronous data - Finding similar data across modalities (we can estimate the aligned cost) - Event reconstruction from multiple sources

Alignment examples (multimodal)

Implicit Alignment Karpathy et al., Deep Fragment Embeddings for Bidirectional Image Sentence Mapping, https://arxiv.org/pdf/1406.5679.pdf 26

Core Challenge 3: Fusion Definition: To join information from two or more modalities to perform a prediction task. A Model-Agnostic Approaches 2) Late Fusion 1) Early Fusion Modality 1 Classifier Modality 1 Classifier Modality 2 Classifier Modality 2 27

Core Challenge 3: Fusion Definition: To join information from two or more modalities to perform a prediction task. B Model-Based (Intermediate) Approaches 1) Deep neural networks Multiple kernel learning 2) Kernel-based methods y 𝐵 𝐵 𝐵 𝐵 𝐵 ℎ 1 ℎ 2 ℎ 3 ℎ 4 ℎ 5 𝑊 𝑊 𝑊 𝑊 𝑊 ℎ 1 ℎ 2 ℎ 3 ℎ 4 ℎ 5 3) Graphical models 𝑩 𝑩 𝑩 𝑩 𝑩 𝒚 𝟐 𝒚 𝟑 𝒚 𝟒 𝒚 𝟓 𝒚 𝟔 𝑾 𝑾 𝑾 𝑾 𝑾 𝒚 𝟐 𝒚 𝟑 𝒚 𝟒 𝒚 𝟓 𝒚 𝟔 Multi-View Hidden CRF 28

Core Challenge 4: Translation Definition: Process of changing data from one modality to another, where the translation relationship can often be open-ended or subjective. A Example-based B Model-driven 29

Core Challenge 4 – Translation Transcriptions Visual gestures + (both speaker and listener gestures) Audio streams Marsella et al., Virtual character performance from speech, SIGGRAPH/Eurographics Symposium on Computer Animation, 2013

Core Challenge 5: Co-Learning Definition: T ransfer knowledge between modalities, including their representations and predictive models. Prediction Modality 1 Modality 2 31

Core Challenge 5: Co-Learning C Hybrid B Non-Parallel A Parallel 32

Taxonomy of Multimodal Research [ https://arxiv.org/abs/1705.09406 ] ▪ o Encoder-decoder Model-based Representation o Online prediction o Kernel-based ▪ Joint o Alignment Graphical models o Neural networks o Neural networks ▪ Explicit o Graphical models Co-learning o Sequential o Unsupervised ▪ Coordinated ▪ Parallel data o Supervised o Similarity ▪ Implicit o Co-training o Structured o o Transfer learning Graphical models Translation ▪ Non-parallel data o Neural networks ▪ Example-based ▪ Fusion Zero-shot learning ▪ Concept grounding o Retrieval ▪ Model agnostic ▪ Transfer learning o Combination o Early fusion ▪ Hybrid data ▪ Model-based o Late fusion ▪ Bridging o Grammar-based o Hybrid fusion Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency, Multimodal Machine Learning: A Survey and Taxonomy

Multimodal Applications [ https://arxiv.org/abs/1705.09406 ] Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency, Multimodal Machine Learning: A Survey and Taxonomy

Multimodal Representations 35

Core Challenge: Representation Definition: Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy. Joint representations: Coordinated representations: A B Repres. 1 Repres 2 Representation Modality 1 Modality 2 Modality 1 Modality 2 36

Deep Multimodal autoencoders ▪ A deep representation learning approach ▪ A bimodal auto-encoder ▪ Used for Audio-visual speech recognition [Ngiam et al., Multimodal Deep Learning, 2011]

Deep Multimodal autoencoders - training ▪ Individual modalities can be pretrained ▪ RBMs ▪ Denoising Autoencoders ▪ To train the model to reconstruct the other modality ▪ Use both ▪ Remove audio [Ngiam et al., Multimodal Deep Learning, 2011]

Deep Multimodal autoencoders - training ▪ Individual modalities can be pretrained ▪ RBMs ▪ Denoising Autoencoders ▪ To train the model to reconstruct the other modality ▪ Use both ▪ Remove audio ▪ Remove video [Ngiam et al., Multimodal Deep Learning, 2011]

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU - PowerPoint PPT Presentation

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine Learning Laboratory [MultiComp Lab] 1 CMU Course 11-777: Multimodal Machine Learning 2 Lecture Objectives What is Multimodal? Multimodal:

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Main Goal Define a common taxonomy for multimodal machine learning

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Probing the Need for Visual Context in Multimodal Machine Translation Ozan Caglayan 1 , Pranava

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

ICTP/Psi-k/CECAM School on Electron-Phonon Physics from First Principles Trieste, 19-23 March

Neural Text Matching Toolkit Yixing Fan fanyixing@ict.ac.cn University of Chinese Academy of

I could be brief Status of Tune and Orbit Measurements and Correction, Testing Strategy M.

COVID-19 and LTC April 30, 2020 Questions and Answer Session Use the QA box in the webinar

MPP setups overview Philipp Leitl phleitl@mpp.mpg.de Max Planck Institute for Physics 21st

Designing for Low Power 1 2 c c * 2 1 Architecture & Aritmetic ADD SUB

Draft 1 Version 4.0 Stakeholder Meeting July 10, 2014 Abigail Daken, U.S. EPA Agenda

Automated Aspect Recommendation through Clustering-Based Fan-in Analysis Danfeng Zhang , Yao Guo,

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU - PowerPoint PPT Presentation

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine Learning Laboratory [MultiComp Lab] 1 CMU Course 11-777: Multimodal Machine Learning 2 Lecture Objectives What is Multimodal? Multimodal:

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Main Goal Define a common taxonomy for multimodal machine learning

Multimodal Corridor Planning &amp; Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Probing the Need for Visual Context in Multimodal Machine Translation Ozan Caglayan 1 , Pranava

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

ICTP/Psi-k/CECAM School on Electron-Phonon Physics from First Principles Trieste, 19-23 March

Neural Text Matching Toolkit Yixing Fan fanyixing@ict.ac.cn University of Chinese Academy of

I could be brief Status of Tune and Orbit Measurements and Correction, Testing Strategy M.

COVID-19 and LTC April 30, 2020 Questions and Answer Session Use the QA box in the webinar

MPP setups overview Philipp Leitl phleitl@mpp.mpg.de Max Planck Institute for Physics 21st

Designing for Low Power 1 2 c c * 2 1 Architecture &amp; Aritmetic ADD SUB

Draft 1 Version 4.0 Stakeholder Meeting July 10, 2014 Abigail Daken, U.S. EPA Agenda

Automated Aspect Recommendation through Clustering-Based Fan-in Analysis Danfeng Zhang , Yao Guo,

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

Designing for Low Power 1 2 c c * 2 1 Architecture & Aritmetic ADD SUB