data driven affordance
play

DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei - PowerPoint PPT Presentation

DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California, Merced March 19, 2019 UNDERSTANDING SCENE AND HUMAN scene image segmentation human pose estimation semantic


  1. DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California, Merced March 19, 2019

  2. UNDERSTANDING SCENE AND HUMAN scene image segmentation human pose estimation semantic segmentation from cityscapes dataset pose estimation via OpenCV 2

  3. CREATING SCENE OR HUMAN? instance placement human placement ✘ ✘ semantic segmentation from cityscapes dataset rendered scene from the SUNCG dataset 3

  4. LET’S MAKE IT MORE CHALLENGING! shape synthesis semantic segmentation from cityscapes dataset 4

  5. LET’S MAKE IT MORE CHALLENGING! shape synthesis ? 5

  6. LET’S MAKE IT MORE CHALLENGING! placement in the real world video’s from: Learning Rigidity in Dynamic Scenes with a Moving Camera for 3D Motion Field Estimation 6

  7. WHAT IS AFFORDANCE? Where are they? scene image indoor environment sitting standing human car 7

  8. WHAT IS AFFORDANCE? What are they look like? scene image indoor environment 8

  9. WHAT IS AFFORDANCE? How do they interact with the others? Input Image Generated Poses 9

  10. OUTLINES Context-Aware Synthesis and Placement of Object Instances Neurips 2018 Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming-Hsuan Yang, Jan Kautz Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments CVPR 2019 Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz 10

  11. QUIZ Which object is a fake one? 11

  12. SEQUENTIAL EDITING Insert new objects one by one 12

  13. PROBLEM DEFINATION Semantic map manipulation by inserting objects Add a person 13

  14. WHY SEMANTIC MAP? • Editing RGB image is difficult Image-to-image translation, Image editing, ... Image 2 Image 1 14

  15. WHY SEMANTIC MAP? • We don’t have real RGB images in case of using a simulator, playing a game, or experiencing a virtual world Semantic map Rendering Visualization Image is from 15 Stephan R. Richter, Zeeshan Hayer, and Vladlen Koltun , “Playing for Benchmarks”, ICCV 2017

  16. MAIN GOALS 1. Learn “where” and “what” jointly 2. End-to-end trainable network 3. Diverse outputs given the same input 16

  17. “WHERE” MODULE How can we learn where to put a new object? 17

  18. “WHERE” MODULE Pixel-wise annotation: almost impossible to get p=0 p=0.8 p=0.2 18

  19. “WHERE” MODULE Existing objects: need to remove and inpaint objects Object 19

  20. “WHERE” MODULE Existing objects: need to remove and inpaint objects Removed Object 20

  21. “WHERE” MODULE Existing objects: need to remove and inpaint objects Inpainting ? 21

  22. “WHERE” MODULE Our approach: put a box and see if it is reasonable Bad box Good box Why box? 1) We don’t want to care about the object shape for now. 2) All objects can be covered by a bounding box. 22

  23. “WHERE” MODULE How to put a box? Affine transform Unit box Why not using (x,y,w,h) directly? It is not differentiable to put a box using indices. 23

  24. “WHERE” MODULE Affine transform bbox 24

  25. “WHERE” MODULE concat STN tile bbox 25

  26. “WHERE” MODULE concat fake STN Real/fake loss tile bbox real 26

  27. “WHERE” MODULE Results with 100 different random vectors 27

  28. “WHERE” MODULE concat fake STN Real/fake loss tile bbox Ignored real 28

  29. “WHERE” MODULE concat fake STN Real/fake loss tile bbox real 29

  30. “WHERE” MODULE concat fake STN Real/fake loss tile bbox real 30

  31. “WHERE” MODULE Results with 100 different random vectors 31

  32. “WHERE” MODULE concat fake STN Real/fake loss tile bbox Lazy Lazy z1 real z2 32

  33. “WHERE” MODULE concat fake STN Real/fake loss tile bbox real bbox 33

  34. “WHERE” MODULE concat fake STN Real/fake loss tile bbox (shared) (shared) real concat STN bbox tile 34

  35. “WHERE” MODULE concat fake STN Encoder-decoder tile bbox Unsupervised path (shared) (shared) Supervised path + reconstruct real concat STN bbox tile + supervision 35

  36. “WHERE” MODULE Results with 100 different random vectors (red: person, blue: car) 36

  37. “WHERE” MODULE Results from epoch 0 to 30 37

  38. MAIN GOAL 1. Learn “where” and “what” jointly 2. End-to-end trainable network 3. Diverse outputs given the same input. 38

  39. “WHAT” MODULE concat tile 39

  40. “Where” module “WHAT” MODULE concat tile 40

  41. “Where” module “WHAT” MODULE concat tile fake Unsupervised path Encoder-decoder (shared) Supervised path concat real + supervision tile 41

  42. OVERALL ARCHITECTURE Forward pass Affine Input Unit box Bounding box Object shape Output prediction generation 42

  43. OVERALL ARCHITECTURE Backward pass for “where” loss Affine Input Unit box Bounding box Object shape Output prediction generation “Where” discriminator 43

  44. OVERALL ARCHITECTURE Backward pass for “what” loss Affine Input Unit box Bounding box Object shape Output prediction generation “What” discriminator 44

  45. “WHAT” MODULE Fix “where”, change “what” 45

  46. EXPERIMENTS Synthesized RGB (pix2pix HD) Input Generated 46

  47. EXPERIMENTS Synthesized RGB (nearest-neighbor) Nearest Neighbor Generated 47

  48. EXPERIMENTS Synthesized RGB (pix2pix HD) Input Generated 48

  49. EXPERIMENTS Synthesized RGB (nearest-neighbor) Nearest Neighbor Generated 49

  50. 50

  51. 51

  52. 52

  53. 53

  54. USER STUDY Ideal: 50% Our result: 43% 54

  55. BASELINES Encoder - Generated Input STN Result Input Result decoder object Real Generated Encoder Generator object Baseline 1 Baseline 2 Real Real Real Baseline 1 Baseline 2 55

  56. CONCLUSION Learning Affordance in 2D where are they? what are they look like? 56

  57. PUTTING HUMANS IN A SCENE: LEARNING AFFORDANCE IN 3D INDOOR ENVIRONMENTS Xueting Li, Sifei Liu, Kihwan Kim , Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz

  58. WHAT IS AFFORDANCE IN 3D? • General definition: ➢ opportunities of interaction in the scene, i.e. what actions can the object be used for. The floor can The desk can be used for be used for standing sitting • Applications: ➢ Robot navigation ➢ Game development Image Credit: David F . Fouhey et al. In Defense of the Direct Perception of Affordances, CoRR abs/1505.01085 (2015)

  59. AFFORDANCE IN 3D WORLD • Given a single image of a 3D scene, generating reasonable human poses in 3D scenes. ?

  60. LEARNING 3D AFFORDANCE How to define a “reasonable” human pose in indoor scenes? • Semantically plausible: the human should take common actions in indoor environment • Physically stable: the human should be well supported by its surrounding objects.

  61. LEARNING 3D AFFORDANCE semantic knowledge fuse geometry knowledge A data-driven way?

  62. LEARNING 3D AFFORDANCE • Stage I: Build a fully-automatic 3D semantic knowledge geometry knowledge pose synthesizer. • Stage II: Using the dataset synthesized by stage I to train a data-driven and end-to-end 3D pose prediction model. pose synthesizer where what …

  63. LEARNING 3D AFFORDANCE • Stage I: Build a fully-automatic 3D semantic knowledge geometry knowledge pose synthesizer. • Stage II: Using the dataset synthesized by stage I to train a data-driven and end-to-end 3D pose prediction model. pose synthesizer where what …

  64. A FULLY-AUTOMATIC 3D POSE SYNTHESIZER Fusing semantic & geometry knowledge semantic geometry knowledge knowledge The Sitcom [1] dataset. (no 3D annotations) The SUNCG [2] dataset. (no human poses) Combine ? [1] Wang X, Girdhar R, Gupta A. Binge watching: Scaling affordance learning from sitcoms. CVPR, 2017 [2] Song S, Yu F , Zeng A, et al. Semantic scene completion from a single depth. CVPR 2017.

  65. A FULLY-AUTOMATIC 3D POSE SYNTHESIZER semantic knowledge adaptation geometry adjustment 𝑋 𝑍 𝑧 𝑎 𝑊 𝑦 𝑌 mapping from image to voxel Domain adaptation 𝑉 input image location heat map generated poses mapped pose adjusted pose

  66. A FULLY-AUTOMATIC 3D POSE SYNTHESIZER semantic knowledge adaptation geometry adjustment 𝑋 𝑍 𝑧 𝑎 𝑊 𝑦 𝑌 mapping from image to voxel Domain adaptation 𝑉 input image location heat map generated poses mapped pose adjusted pose

  67. A FULLY-AUTOMATIC 3D POSE SYNTHESIZER semantic knowledge adaptation input image location heat map ResNet convolution 18 deconvolution

  68. A FULLY-AUTOMATIC 3D POSE SYNTHESIZER semantic knowledge adaptation input image location heat map generated poses Binge Watching: Scaling Affordance Learning from Sitcoms , Xiaolong Wang et al. CVPR, 2017

  69. A FULLY-AUTOMATIC 3D POSE SYNTHESIZER semantic knowledge adaptation Domain adaptation input image location heat map generated poses

  70. A FULLY-AUTOMATIC 3D POSE SYNTHESIZER semantic knowledge adaptation geometry adjustment 𝑋 𝑍 𝑧 𝑎 𝑊 𝑦 𝑌 mapping from image to voxel Domain adaptation 𝑉 input image location heat map generated poses mapped pose adjusted pose

Recommend


More recommend