Part 2: Audio-Visual HRI: Methodology and Applications in Assistive - PowerPoint PPT Presentation

Computer Vision, Speech Communication & Signal Processing Group, Intelligent Robotics and Automation Laboratory National Technical University of Athens, Greece (NTUA) Robot Perception and Interaction Unit, Athena Research and Innovation Center (Athena RIC) Part 2: Audio-Visual HRI: Methodology and Applications in Assistive Robotics Petros Maragos and Athanasia Zlatintsi slides: http://cvsp.cs.ntua.gr/interspeech2018 Tutorial at INTERSPEECH 2018, Hyderabad, India, 2 Sep. 2018 1

2A. Audio-Visual HRI: General Methodology Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 2

Multimodal HRI: Applications and Challenges assistive robotics education, entertainment  Challenges  Speech: distance from microphones, noisy acoustic scenes, variabilities  Visual recognition: noisy backgrounds, motion, variabilities  Multimodal fusion: incorporation of multiple sensors, integration issues  Elderly users, Children Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 3

Database of Multimodal Gesture Challenge Database of Multim odal Gesture Challenge (in conjunction with ACM ICMI 2013 ) • 20 cultural/ anthropological signs of Italian language • ‘vattene’ (get out) • ‘ok’ (ok) • ‘cosa ti farei’ (what would I make to you!) • ‘vieni qui’ (come here) • ‘perfetto’ (perfect) • ‘basta’ (that’s enough) • ‘furbo’ (clever) • ‘prendere’ (to take) • ‘che due palle’ (what a nuisance!) • ‘non ce ne piu’ (there is none more) • ‘che vuoi’ (what do you want?) • ‘fame’ (hunger) • ‘d’accordo’ (together) • ‘tanto tempo’ (a long time ago) • ‘sei pazzo’ (you are crazy) • ‘buonissimo’ (very good) • ‘combinato’ (combined) • ‘messi d’accordo’ (agreed) • ‘freganiente’ (damn) • ‘sono stufo’ (I am sick) • 22 different users • 20 repeats per user approximately (~ 1 minute for each gesture video) Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 4

Multimodal Gesture Signals from Kinect‐0 Sensor Depth RGB Video & Audio ( vieniqui ‐ come here ) Skeleton User Mask ( vieniqui - come here ) ( vieniqui - come here ) ChaLearn [S. Escalera, J. Gonzalez, X. Baro, M. Reyes, O. Lopes, I. Guyon, V. Athitsos, and H. Escalante, “ Multi-modal corpus gesture recognition challenge 2013: Dataset and results ”, Proc. 15th ACM Int’l Conf. Multimodal Interaction, 2013.] Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 5

Multimodal Hypothesis Rescoring + Segmental Parallel Fusion best single‐stream hypotheses N‐best list audio best recognized generation multistream gesture multiple hypothesis parallel sequence skeleton N‐best list hypotheses segmental generation list rescoring fusion & resorting handshape N‐best list generation single‐stream models [V. Pitsikalis, A. Katsamanis, S. Theodorakis & P. Maragos, “ Multimodal Gesture Recognition via Multiple Hypotheses Rescoring ”, JMLR 2015] Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 6

Audio-Visual Fusion & Recognition  Audio and visual modalities for an A-V word sequence.  Ground truth transcriptions (“REF”) and decoding results for audio and 3 different fusion schemes.  Audio and visual modalities for A-V gesture word sequence.  Ground truth transcriptions (“REF”) and decoding results for audio and 3 different A-V fusion schemes.  Results in top rank of ChaLearn (ACM 2013 Gesture Challenge – 50 teams - 22 users x 20 gesture phrases x 20 repeats). [ V. Pitsikalis, A. Katsamanis, S. Theodorakis & P. Maragos, JMLR 2015 ] Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 7

Visual Activity Recognition �� Action: sit to stand Sign: (GSL) Europe Gestures: come here, come near Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 8

Visual action recognition pipeline Sit to Stand Walk Video Visual Recognized Feature Sequence Extraction Sliding Window Visual Feature Post- Classifier Extraction processing �� Temporal �� Sliding Window Classifier Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 9

Visual Front-End Video Dense Trajectories Example video Example video Example video Feature Optical Flow Descriptors Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 10

Features: Dense Trajectories 1. Feature points are sampled on a 3. Descriptors are computed in space- regular grid in multiple scales time volumes along trajectories 2. Feature points are tracked through [ Wang et al. consecutive video frames IJCV 2013 ] Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 11

K-means Clustering and Dictionary Feature Samples K-means Dictionary Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 12

Feature Encoding BOF ‐ Size: K VLAD VLAD ‐ Size: K* D

Visual Action Classification Labeled Videos Classifier Train Test Labels Unlabeled Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Videos 14

Temporal Segmentation Results Sit Walk Stand - B.M. Ground Truth SVM Ground Truth SVM+Filter+ HMM_Viterbi 15 Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 15

Action Recognition Results (4a, 6p) : Descriptors + Post-processing Smoothing  Dense Trajectories + BOF Encoding Results improve by adding Depth and/or advanced Encoding Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 16

Gesture Recognition Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 17

Gesture Recognition Challenges Challenging task of recognizing human gestural movements: • Large variability in gesture performance. • Some gestures can be performed with left or right hand. I want to Sit Down Come Closer I want to Perform a Task Park Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 18

Visual Gesture Classification Pipeline Class Probabilities (SVM scores) Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 19

Applying Dense Trajectories on Gesture Data Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 20

Extended Results on Gesture Recognition MOBOT‐I, Comparisons: Multiple descriptors, Multiple encodings; 80 Task 6a (8g, 8p) Mean over patients 70 60 50 accuracy (%) 40 30 20 10 0 traject. HOG HOF MBH combined BoVW VLAD Fisher Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 21

Part 2: Audio-Visual HRI: Methodology and Applications in Assistive - PowerPoint PPT Presentation

Computer Vision, Speech Communication & Signal Processing Group, Intelligent Robotics and Automation Laboratory National Technical University of Athens, Greece (NTUA) Robot Perception and Interaction Unit, Athena Research and Innovation

Robot attentional models for intuitive HRI. Verena Vanessa Hafner ! Kognitive Robotik, Institut

Audio Device Client Better and Faster Audio I/O on Web Hongchan Choi Google Chrome Web Audio

Visual Attention in Spoken HRI Maria Staudte & Matthew Crocker Saarland University, Germany

Cirrus Audio Solutions Cirrus Audio Solutions Home Audio Portable Audio Personal CD Player

Audio- -Visual Automatic Speech Recognition: Visual Automatic Speech Recognition: Audio Theory,

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

HRI Properties New Orleans, Louisiana Presentation to National Council of Affordable Housing

Human-Robot Interaction Elective in Artificial Intelligence Lecture 9 Motion control for HRI

CMSC 691 Spring 2016 Bookkeeping Piazza: .ny.cc/hri-piazza Coming Signup sheet:

Overview Overview Visual displays Visual displays Visual and tactile displays Visual and

Create PowerPoint Audio and Video V0B August 2020 V0B V0B Schield: 2020 PPTX Create Audio-Video

Audio and Speech August 13, 2001 Audio 2 Digital sound anti-aliasing amplifier codec filter

Game Audio Coding vs. Aesthetics Leonard Paul of Lotus Audio Vancouver, Canada Game Audio :

Ammann Slid Ammann Slides es-Audio Audio-Visual Visual-2017 20170106 0106 212 refere 212

CS378 - Mobile Computing Audio Android Audio Use the MediaPlayer class Common Audio

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

1. What are the two principal kinds of plants, the way of modelling them and of controlling them?

Making Model-Driven Verification Practical and Scalable: Experiences and Lessons Learned Lionel

Real-time Operating Systems VO Embedded Systems Engineering Armin Wasicek 11.12.2012 Overview

Software LOPA Approach to Performing a Layers of Protection Analysis for Complex Software

A safety-oriented engineering process for autonomous robotic systems Fabio Federici, Giulio Mos

Nanoscale Nanoscale Imaging of Semiconductor and Imaging of Semiconductor and Biological

Scalable Software Testing and Verification of Non-Functional Properties through Heuristic Search

System-Level Design Optimization for Integration with Silicon Photonics Ayse K. Coskun Boston

Part 2: Audio-Visual HRI: Methodology and Applications in Assistive - PowerPoint PPT Presentation

Computer Vision, Speech Communication & Signal Processing Group, Intelligent Robotics and Automation Laboratory National Technical University of Athens, Greece (NTUA) Robot Perception and Interaction Unit, Athena Research and Innovation

Robot attentional models for intuitive HRI. Verena Vanessa Hafner ! Kognitive Robotik, Institut

Audio Device Client Better and Faster Audio I/O on Web Hongchan Choi Google Chrome Web Audio

Visual Attention in Spoken HRI Maria Staudte &amp; Matthew Crocker Saarland University, Germany

Cirrus Audio Solutions Cirrus Audio Solutions Home Audio Portable Audio Personal CD Player

Audio- -Visual Automatic Speech Recognition: Visual Automatic Speech Recognition: Audio Theory,

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

HRI Properties New Orleans, Louisiana Presentation to National Council of Affordable Housing

Human-Robot Interaction Elective in Artificial Intelligence Lecture 9 Motion control for HRI

CMSC 691 Spring 2016 Bookkeeping Piazza: .ny.cc/hri-piazza Coming Signup sheet:

Overview Overview Visual displays Visual displays Visual and tactile displays Visual and

Create PowerPoint Audio and Video V0B August 2020 V0B V0B Schield: 2020 PPTX Create Audio-Video

Audio and Speech August 13, 2001 Audio 2 Digital sound anti-aliasing amplifier codec filter

Game Audio Coding vs. Aesthetics Leonard Paul of Lotus Audio Vancouver, Canada Game Audio :

Ammann Slid Ammann Slides es-Audio Audio-Visual Visual-2017 20170106 0106 212 refere 212

CS378 - Mobile Computing Audio Android Audio Use the MediaPlayer class Common Audio

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

1. What are the two principal kinds of plants, the way of modelling them and of controlling them?

Making Model-Driven Verification Practical and Scalable: Experiences and Lessons Learned Lionel

Real-time Operating Systems VO Embedded Systems Engineering Armin Wasicek 11.12.2012 Overview

Software LOPA Approach to Performing a Layers of Protection Analysis for Complex Software

A safety-oriented engineering process for autonomous robotic systems Fabio Federici, Giulio Mos

Nanoscale Nanoscale Imaging of Semiconductor and Imaging of Semiconductor and Biological

Scalable Software Testing and Verification of Non-Functional Properties through Heuristic Search

System-Level Design Optimization for Integration with Silicon Photonics Ayse K. Coskun Boston

Visual Attention in Spoken HRI Maria Staudte & Matthew Crocker Saarland University, Germany