anticipating the unseen and unheard for embodied
play

Anticipating the Unseen and Unheard for Embodied Perception Kristen - PowerPoint PPT Presentation

Anticipating the Unseen and Unheard for Embodied Perception Kristen Grauman University of Texas at Austin Facebook AI Research Visual recognition: significant recent progress Big labeled Deep learning datasets ImageNet top-5 error (%) GPU


  1. Anticipating the Unseen and Unheard for Embodied Perception Kristen Grauman University of Texas at Austin Facebook AI Research

  2. Visual recognition: significant recent progress Big labeled Deep learning datasets ImageNet top-5 error (%) GPU technology Kristen Grauman

  3. The Web photo perceptual experience A “disembodied” well-curated moment in time A “disembodied” well-curated moment in time BSD (2001) PASCAL (2007-12) Caltech 101 (2004), Caltech 256 (2006) LabelMe (2007) ImageNet (2009) SUN (2010) Places (2014) MS COCO (2014) Visual Genome (2016)

  4. Egocentric perceptual experience A tangle of relevant and irrelevant A tangle of relevant and irrelevant multi-sensory information multi-sensory information Kristen Grauman

  5. Big picture goal: Embodied perception Status quo : Learning and inference with “disembodied” snapshots. On the horizon: Visual learning in the context of action, motion, and multi-sensory observations. Kristen Grauman

  6. Anticipating the unseen and unheard Look-around Affordance Audio-visual policies learning learning Towards embodied perception Kristen Grauman

  7. Active perception From learning representations to learning policies Bajcsy 1985, Aloimonos 1988, Ballard 1991, Wilkes 1992, Dickinson 1997, Schiele & Crowley 1998, Tsotsos 2001, Denzler 2002, Soatto 2009, Ramanathan 2011, Borotschnig 2011, … Kristen Grauman

  8. End-to-end active recognition Main idea: Deep reinforcement learning approach that anticipates visual changes as a function of egomotion mug? bowl? mug pan? Perception Perception Action selection Evidence fusion Kristen Grauman Jayaraman and Grauman, ECCV 2016, PAMI 2018

  9. End-to-end active recognition Predicted label: T=1 T=2 T=3 [Jayaraman and Grauman, ECCV 2016, PAMI 2018] Kristen Grauman

  10. Goal: Learn to “look around” vs. reconnaissance search and rescue recognition task predefined task unfolds dynamically Can we learn look-around policies for visual agents that are curiosity-driven, exploratory, and generic? Kristen Grauman

  11. Key idea: Active observation completion Completion objective: Learn policy for efficiently inferring (pixels of) all yet-unseen portions of environment Agent must choose where to look before looking there. Jayaraman and Grauman, CVPR 2018 Kristen Grauman

  12. Completing unseen views Encoder-decoder model to infer unseen viewpoints output viewgrid “supervision”: actual 360 scene Kristen Grauman Jayaraman and Grauman, CVPR 2018; Ramakrishnan & Grauman, ECCV 2018

  13. Actively selecting observations Decoder Actor belief visualized model Encoder Reward for fast completion Non-myopic : Train to target a budget of observation time Kristen Grauman Jayaraman and Grauman, CVPR 2018; Ramakrishnan & Grauman, ECCV 2018

  14. Two scenarios Kristen Grauman

  15. Active “look around” results 1-view random large-action large-action+ peek-saliency* ours ModelNet (seen cls) ModelNet (unseen cls) SUN360 4 7.5 38 7.3 3.9 7.1 3.8 6.9 33 per-pixel MSE (x1000) 3.7 6.7 6.5 3.6 28 6.3 3.5 6.1 3.4 23 5.9 3.3 5.7 Learned active look-around policy: quickly grasp 5.5 18 3.2 1 2 3 4 1 2 3 4 5 6 1 2 3 4 Time Time Time environment independent of a specific task Jayaraman and Grauman, CVPR 2018 *Saliency -- Harel et al, Graph based Visual Saliency, NIPS’07

  16. Active “look around” results

  17. Active “look around” Agent’s mental model for 360 scene evolves with actively accumulated glimpses Jayaraman and Grauman, CVPR 2018; Ramakrishnan & Grauman, ECCV 2018

  18. Active “look around” Agent’s mental model for 3D object evolves with actively accumulated glimpses Jayaraman and Grauman, CVPR 2018; Ramakrishnan & Grauman, ECCV 2018

  19. Look-around policy transfer Unsupervised Supervised “beach” Predictor Decoder Task-specific Look-around Look-around Policy Policy Policy Task-specific Look-around encoder encoder Plug observation completion policy in for new task Kristen Grauman

  20. Look-around policy transfer SUN 360 Scenes ModelNet Objects Unsupervised exploratory policy approaches Unsupervised exploratory policy approaches Plug observation completion policy in for active recognition task supervised task-specific policy accuracy! supervised task-specific policy accuracy! Kristen Grauman Jayaraman and Grauman, CVPR 2018

  21. Look-around policy transfer Multiple perception tasks Kristen Grauman Ramakrishnan et al. 2019

  22. Look-around policy transfer Agent navigates 3d environment leveraging active exploration Kristen Grauman

  23. Extreme relative pose from RGB-D scans Input : Pair of RGB-D scans with little or no overlap Output : Rigid transformation (R,t) that separates them scan 1 Transform Transform scan 2 Approach : Alternate between completion and matching Yang et al. CVPR 2019 Kristen Grauman

  24. Extreme relative pose from RGB-D scans GT Ours 4PCS Outperform existing methods on SUNCG / Matterport / ScanNet, particularly for small overlap case (10% to 50%) Kristen Grauman Yang et al. CVPR 2019

  25. 360 ° video: a “look around” problem for people Control by mouse Where to look when? Kristen Grauman

  26. AutoCam Output NFOV Video Input 360° Video Automatically select FOV and viewing direction [Su & Grauman, ACCV 2016, CVPR 2017] Kristen Grauman

  27. Anticipating the unseen and unheard Look-around Affordance Audio-visual policies learning learning Towards embodied perception Kristen Grauman

  28. Object interaction Turn on Increase height Move lamp Replace Embodied Object lightbulb perception system manipulation Kristen Grauman

  29. What actions does an object afford ? Adjustable Toggle-able Replaceable Movable Embodied Object perception system manipulation Kristen Grauman

  30. Current approaches: affordance as semantic segmentation Label “holdable” regions Captures annotators’ expectations of what is important Sawatzky et al. (CVPR 17), Nguyen et al. (IROS 17), Roy et al. (ECCV 16), Myers et al. (ICRA 15), … Kristen Grauman

  31. …but real human behavior is complex Kristen Grauman

  32. How to learn object affordances? V S. Manually curated Real human affordances interactions? Sawatzky et al. (CVPR 17), Nguyen et al. (IROS 17), Roy et al. (ECCV 16), Myers et al. (ICRA 15), … Kristen Grauman

  33. Our idea: Learn directly by watching people (video) [Nagarajan et al. 2019] Kristen Grauman

  34. Learning affordances from video Object at Anticipation Aggregated state network for the action rest Classifier Action LSTM “open” t=0 T [Nagarajan et al. 2019] Kristen Grauman

  35. Extracting interaction hotspot maps ? Anticipation network activations gradients Classifier Action Hypothesize for “Pullable” action a = “pullable” Hotspot Map t=0 T Activation mapping to identify responsible spatial regions [Nagarajan et al. 2019] Kristen Grauman

  36. Wait, is this just action recognition? Action recognition + Grad-CAM Ours No: Hotspot anticipation model maps object at rest to potential for interaction Kristen Grauman

  37. Evaluating interaction hotspots OPRA EPIC Kitchens MS COCO (Fang et al., CVPR 18) (Damen et al., ECCV 18) (Lin et al., ECCV 14) Train on video datasets, generate heatmaps on novel images--- even from unseen categories Kristen Grauman

  38. Results: interaction hotspots Given static image of object at rest, infer affordance regions OPRA data EPIC data Weakly Supervised Strongly Supervised Up to 24% increase vs. weakly supervised methods [Nagarajan et al. 2019] Kristen Grauman

  39. Results: interaction hotspots Kristen Grauman

  40. Results: hotspots for recognition Better low-shot object recognition by anticipating object function Kristen Grauman

  41. Anticipating the unseen and unheard Look-around Affordance Audio-visual policies learning learning Towards embodied perception Kristen Grauman

  42. Listening to learn woof meow clatter ring Goal : a repertoire of objects and their sounds Challenge : a single audio channel mixes sounds of multiple objects Kristen Grauman

  43. Learning to separate object sounds Our idea: Leverage visual objects to learn from unlabeled video with multiple audio sources Violin Dog Cat Disentangle Object sound models Unlabeled video Apply to separate simultaneous sounds in novel videos Kristen Grauman [Gao, Feris, & Grauman, ECCV 2018]

  44. Results: audio-visual source separation Train on 100,000 unlabeled multi-source video clips, then separate audio for novel video Dataset: AudioSet [Gemmeke et al. 2017] Kristen Grauman [Gao et al. ECCV 2018]

  45. Results: audio-visual source separation Train on 100,000 unlabeled multi-source video clips, then separate audio for novel video Kristen Grauman [Gao et al. ECCV 2018]

  46. Spatial effects in audio Spatial effects absent in monaural audio Cues for spatial hearing: • Interaural time difference (ITD) • Interaural level difference (ILD) • Spectral detail (from pinna reflections) Kristen Grauman Image Credit: Michael Mandel

  47. Our idea: 2.5D visual sound “Lift” mono audio to spatial audio via visual cues Monaural Binaural “Lift” + Kristen Grauman [Gao & Grauman, CVPR 2019]

Recommend


More recommend