part 2 audio visual hri methodology and applications in
play

Part 2: Audio-Visual HRI: Methodology and Applications in Assistive - PowerPoint PPT Presentation

Computer Vision, Speech Communication & Signal Processing Group, Intelligent Robotics and Automation Laboratory National Technical University of Athens, Greece (NTUA) Robot Perception and Interaction Unit, Athena Research and Innovation


  1. Computer Vision, Speech Communication & Signal Processing Group, Intelligent Robotics and Automation Laboratory National Technical University of Athens, Greece (NTUA) Robot Perception and Interaction Unit, Athena Research and Innovation Center (Athena RIC) Part 2: Audio-Visual HRI: Methodology and Applications in Assistive Robotics Petros Maragos and Athanasia Zlatintsi slides: http://cvsp.cs.ntua.gr/interspeech2018 Tutorial at INTERSPEECH 2018, Hyderabad, India, 2 Sep. 2018 1

  2. 2A. Audio-Visual HRI: General Methodology Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 2

  3. Multimodal HRI: Applications and Challenges assistive robotics education, entertainment  Challenges  Speech: distance from microphones, noisy acoustic scenes, variabilities  Visual recognition: noisy backgrounds, motion, variabilities  Multimodal fusion: incorporation of multiple sensors, integration issues  Elderly users, Children Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 3

  4. Database of Multimodal Gesture Challenge Database of Multim odal Gesture Challenge (in conjunction with ACM ICMI 2013 ) • 20 cultural/ anthropological signs of Italian language • ‘vattene’ (get out) • ‘ok’ (ok) • ‘cosa ti farei’ (what would I make to you!) • ‘vieni qui’ (come here) • ‘perfetto’ (perfect) • ‘basta’ (that’s enough) • ‘furbo’ (clever) • ‘prendere’ (to take) • ‘che due palle’ (what a nuisance!) • ‘non ce ne piu’ (there is none more) • ‘che vuoi’ (what do you want?) • ‘fame’ (hunger) • ‘d’accordo’ (together) • ‘tanto tempo’ (a long time ago) • ‘sei pazzo’ (you are crazy) • ‘buonissimo’ (very good) • ‘combinato’ (combined) • ‘messi d’accordo’ (agreed) • ‘freganiente’ (damn) • ‘sono stufo’ (I am sick) • 22 different users • 20 repeats per user approximately (~ 1 minute for each gesture video) Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 4

  5. Multimodal Gesture Signals from Kinect‐0 Sensor Depth RGB Video & Audio ( vieniqui ‐ come here ) Skeleton User Mask ( vieniqui - come here ) ( vieniqui - come here ) ChaLearn [S. Escalera, J. Gonzalez, X. Baro, M. Reyes, O. Lopes, I. Guyon, V. Athitsos, and H. Escalante, “ Multi-modal corpus gesture recognition challenge 2013: Dataset and results ”, Proc. 15th ACM Int’l Conf. Multimodal Interaction, 2013.] Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 5

  6. Multimodal Hypothesis Rescoring + Segmental Parallel Fusion best single‐stream hypotheses N‐best list audio best recognized generation multistream gesture multiple hypothesis parallel sequence skeleton N‐best list hypotheses segmental generation list rescoring fusion & resorting handshape N‐best list generation single‐stream models [V. Pitsikalis, A. Katsamanis, S. Theodorakis & P. Maragos, “ Multimodal Gesture Recognition via Multiple Hypotheses Rescoring ”, JMLR 2015] Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 6

  7. Audio-Visual Fusion & Recognition  Audio and visual modalities for an A-V word sequence.  Ground truth transcriptions (“REF”) and decoding results for audio and 3 different fusion schemes.  Audio and visual modalities for A-V gesture word sequence.  Ground truth transcriptions (“REF”) and decoding results for audio and 3 different A-V fusion schemes.  Results in top rank of ChaLearn (ACM 2013 Gesture Challenge – 50 teams - 22 users x 20 gesture phrases x 20 repeats). [ V. Pitsikalis, A. Katsamanis, S. Theodorakis & P. Maragos, JMLR 2015 ] Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 7

  8. Visual Activity Recognition ��� � �� ���� �� � �� �� Action: sit to stand Sign: (GSL) Europe Gestures: come here, come near Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 8

  9. Visual action recognition pipeline Sit to Stand Walk Video Visual Recognized Feature Sequence Extraction Sliding Window Visual Feature Post- Classifier Extraction processing ���� �� � Temporal ������ �� Sliding Window Classifier Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 9

  10. Visual Front-End Video Dense Trajectories Example video Example video Example video Feature Optical Flow Descriptors Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 10

  11. Features: Dense Trajectories 1. Feature points are sampled on a 3. Descriptors are computed in space- regular grid in multiple scales time volumes along trajectories 2. Feature points are tracked through [ Wang et al. consecutive video frames IJCV 2013 ] Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 11

  12. K-means Clustering and Dictionary Feature Samples K-means Dictionary Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 12

  13. Feature Encoding BOF ‐ Size: K VLAD VLAD ‐ Size: K* D

  14. Visual Action Classification Labeled Videos Classifier Train Test Labels Unlabeled Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Videos 14

  15. Temporal Segmentation Results Sit Walk Stand - B.M. Ground Truth SVM Ground Truth SVM+Filter+ HMM_Viterbi 15 Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 15

  16. Action Recognition Results (4a, 6p) : Descriptors + Post-processing Smoothing  Dense Trajectories + BOF Encoding Results improve by adding Depth and/or advanced Encoding Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 16

  17. Gesture Recognition Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 17

  18. Gesture Recognition Challenges Challenging task of recognizing human gestural movements: • Large variability in gesture performance. • Some gestures can be performed with left or right hand. I want to Sit Down Come Closer I want to Perform a Task Park Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 18

  19. Visual Gesture Classification Pipeline Class Probabilities (SVM scores) Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 19

  20. Applying Dense Trajectories on Gesture Data Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 20

  21. Extended Results on Gesture Recognition MOBOT‐I, Comparisons: Multiple descriptors, Multiple encodings; 80 Task 6a (8g, 8p) Mean over patients 70 60 50 accuracy (%) 40 30 20 10 0 traject. HOG HOF MBH combined BoVW VLAD Fisher Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 21

Recommend


More recommend