DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California, Merced March 19, 2019
UNDERSTANDING SCENE AND HUMAN scene image segmentation human pose estimation semantic segmentation from cityscapes dataset pose estimation via OpenCV 2
CREATING SCENE OR HUMAN? instance placement human placement ✘ ✘ semantic segmentation from cityscapes dataset rendered scene from the SUNCG dataset 3
LET’S MAKE IT MORE CHALLENGING! shape synthesis semantic segmentation from cityscapes dataset 4
LET’S MAKE IT MORE CHALLENGING! shape synthesis ? 5
LET’S MAKE IT MORE CHALLENGING! placement in the real world video’s from: Learning Rigidity in Dynamic Scenes with a Moving Camera for 3D Motion Field Estimation 6
WHAT IS AFFORDANCE? Where are they? scene image indoor environment sitting standing human car 7
WHAT IS AFFORDANCE? What are they look like? scene image indoor environment 8
WHAT IS AFFORDANCE? How do they interact with the others? Input Image Generated Poses 9
OUTLINES Context-Aware Synthesis and Placement of Object Instances Neurips 2018 Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming-Hsuan Yang, Jan Kautz Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments CVPR 2019 Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz 10
QUIZ Which object is a fake one? 11
SEQUENTIAL EDITING Insert new objects one by one 12
PROBLEM DEFINATION Semantic map manipulation by inserting objects Add a person 13
WHY SEMANTIC MAP? • Editing RGB image is difficult Image-to-image translation, Image editing, ... Image 2 Image 1 14
WHY SEMANTIC MAP? • We don’t have real RGB images in case of using a simulator, playing a game, or experiencing a virtual world Semantic map Rendering Visualization Image is from 15 Stephan R. Richter, Zeeshan Hayer, and Vladlen Koltun , “Playing for Benchmarks”, ICCV 2017
MAIN GOALS 1. Learn “where” and “what” jointly 2. End-to-end trainable network 3. Diverse outputs given the same input 16
“WHERE” MODULE How can we learn where to put a new object? 17
“WHERE” MODULE Pixel-wise annotation: almost impossible to get p=0 p=0.8 p=0.2 18
“WHERE” MODULE Existing objects: need to remove and inpaint objects Object 19
“WHERE” MODULE Existing objects: need to remove and inpaint objects Removed Object 20
“WHERE” MODULE Existing objects: need to remove and inpaint objects Inpainting ? 21
“WHERE” MODULE Our approach: put a box and see if it is reasonable Bad box Good box Why box? 1) We don’t want to care about the object shape for now. 2) All objects can be covered by a bounding box. 22
“WHERE” MODULE How to put a box? Affine transform Unit box Why not using (x,y,w,h) directly? It is not differentiable to put a box using indices. 23
“WHERE” MODULE Affine transform bbox 24
“WHERE” MODULE concat STN tile bbox 25
“WHERE” MODULE concat fake STN Real/fake loss tile bbox real 26
“WHERE” MODULE Results with 100 different random vectors 27
“WHERE” MODULE concat fake STN Real/fake loss tile bbox Ignored real 28
“WHERE” MODULE concat fake STN Real/fake loss tile bbox real 29
“WHERE” MODULE concat fake STN Real/fake loss tile bbox real 30
“WHERE” MODULE Results with 100 different random vectors 31
“WHERE” MODULE concat fake STN Real/fake loss tile bbox Lazy Lazy z1 real z2 32
“WHERE” MODULE concat fake STN Real/fake loss tile bbox real bbox 33
“WHERE” MODULE concat fake STN Real/fake loss tile bbox (shared) (shared) real concat STN bbox tile 34
“WHERE” MODULE concat fake STN Encoder-decoder tile bbox Unsupervised path (shared) (shared) Supervised path + reconstruct real concat STN bbox tile + supervision 35
“WHERE” MODULE Results with 100 different random vectors (red: person, blue: car) 36
“WHERE” MODULE Results from epoch 0 to 30 37
MAIN GOAL 1. Learn “where” and “what” jointly 2. End-to-end trainable network 3. Diverse outputs given the same input. 38
“WHAT” MODULE concat tile 39
“Where” module “WHAT” MODULE concat tile 40
“Where” module “WHAT” MODULE concat tile fake Unsupervised path Encoder-decoder (shared) Supervised path concat real + supervision tile 41
OVERALL ARCHITECTURE Forward pass Affine Input Unit box Bounding box Object shape Output prediction generation 42
OVERALL ARCHITECTURE Backward pass for “where” loss Affine Input Unit box Bounding box Object shape Output prediction generation “Where” discriminator 43
OVERALL ARCHITECTURE Backward pass for “what” loss Affine Input Unit box Bounding box Object shape Output prediction generation “What” discriminator 44
“WHAT” MODULE Fix “where”, change “what” 45
EXPERIMENTS Synthesized RGB (pix2pix HD) Input Generated 46
EXPERIMENTS Synthesized RGB (nearest-neighbor) Nearest Neighbor Generated 47
EXPERIMENTS Synthesized RGB (pix2pix HD) Input Generated 48
EXPERIMENTS Synthesized RGB (nearest-neighbor) Nearest Neighbor Generated 49
50
51
52
53
USER STUDY Ideal: 50% Our result: 43% 54
BASELINES Encoder - Generated Input STN Result Input Result decoder object Real Generated Encoder Generator object Baseline 1 Baseline 2 Real Real Real Baseline 1 Baseline 2 55
CONCLUSION Learning Affordance in 2D where are they? what are they look like? 56
PUTTING HUMANS IN A SCENE: LEARNING AFFORDANCE IN 3D INDOOR ENVIRONMENTS Xueting Li, Sifei Liu, Kihwan Kim , Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz
WHAT IS AFFORDANCE IN 3D? • General definition: ➢ opportunities of interaction in the scene, i.e. what actions can the object be used for. The floor can The desk can be used for be used for standing sitting • Applications: ➢ Robot navigation ➢ Game development Image Credit: David F . Fouhey et al. In Defense of the Direct Perception of Affordances, CoRR abs/1505.01085 (2015)
AFFORDANCE IN 3D WORLD • Given a single image of a 3D scene, generating reasonable human poses in 3D scenes. ?
LEARNING 3D AFFORDANCE How to define a “reasonable” human pose in indoor scenes? • Semantically plausible: the human should take common actions in indoor environment • Physically stable: the human should be well supported by its surrounding objects.
LEARNING 3D AFFORDANCE semantic knowledge fuse geometry knowledge A data-driven way?
LEARNING 3D AFFORDANCE • Stage I: Build a fully-automatic 3D semantic knowledge geometry knowledge pose synthesizer. • Stage II: Using the dataset synthesized by stage I to train a data-driven and end-to-end 3D pose prediction model. pose synthesizer where what …
LEARNING 3D AFFORDANCE • Stage I: Build a fully-automatic 3D semantic knowledge geometry knowledge pose synthesizer. • Stage II: Using the dataset synthesized by stage I to train a data-driven and end-to-end 3D pose prediction model. pose synthesizer where what …
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER Fusing semantic & geometry knowledge semantic geometry knowledge knowledge The Sitcom [1] dataset. (no 3D annotations) The SUNCG [2] dataset. (no human poses) Combine ? [1] Wang X, Girdhar R, Gupta A. Binge watching: Scaling affordance learning from sitcoms. CVPR, 2017 [2] Song S, Yu F , Zeng A, et al. Semantic scene completion from a single depth. CVPR 2017.
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER semantic knowledge adaptation geometry adjustment 𝑋 𝑍 𝑧 𝑎 𝑊 𝑦 𝑌 mapping from image to voxel Domain adaptation 𝑉 input image location heat map generated poses mapped pose adjusted pose
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER semantic knowledge adaptation geometry adjustment 𝑋 𝑍 𝑧 𝑎 𝑊 𝑦 𝑌 mapping from image to voxel Domain adaptation 𝑉 input image location heat map generated poses mapped pose adjusted pose
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER semantic knowledge adaptation input image location heat map ResNet convolution 18 deconvolution
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER semantic knowledge adaptation input image location heat map generated poses Binge Watching: Scaling Affordance Learning from Sitcoms , Xiaolong Wang et al. CVPR, 2017
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER semantic knowledge adaptation Domain adaptation input image location heat map generated poses
A FULLY-AUTOMATIC 3D POSE SYNTHESIZER semantic knowledge adaptation geometry adjustment 𝑋 𝑍 𝑧 𝑎 𝑊 𝑦 𝑌 mapping from image to voxel Domain adaptation 𝑉 input image location heat map generated poses mapped pose adjusted pose
Recommend
More recommend