Scene Navigation by Knowledge Graph and Interaction Mohammad Rastegari ICCV, Oct, 2019
Task Navigate to Television … Television Television Television Television Move Move Rotate Done Forward Forward Right
• 120 Scenes • Room types • Kitchen • Living room • Bed room • Bath room • Each room class has 30 scenes • Training : 20 rooms/class • Testing: 5 rooms/class
Challenges • Normally we relocate a seen object in a seen scene • The main challenges are: • Generalizing to unseen scene • Generalizing to unseen object
Using Prior Knowledge Apple Coffee machine Cup Mango
Knowledge Graph
Scene Prior Plate Table Sand- Sink wich next to/on on Painting Remote Coffe Cabinet Machine TV Mug Bowl Table next to next to Cabinet Counter Micro- Laptop wave Box Toaster
Scene Prior Graph Remote n e x t t o Television
Architecture Flow History frames ! " Actor-Critic Model Environment ResNet-50 FC (512) # " Value Word MLP “ Television ” Embedding Action Sampler Policy Remote FC (512) Graph n Convolutional e x t Network t o FC (512) Joint Television Embedding
Architecture Flow with Scene Prior Graph History frames ! " Actor-Critic Model Environment ResNet-50 FC (512) # " Value Word MLP “ Television ” Embedding Action Sampler Policy Remote FC (512) Graph n Convolutional e x t Network t o FC (512) Joint Television Embedding
Architecture Flow with Scene Prior Graph History frames ! " Actor-Critic Model Environment ResNet-50 FC (512) # " Value Word MLP “ Television ” Embedding Action Sampler Policy Remote FC (512) Graph n Convolutional e x t Network t o FC (512) Joint Television Embedding
Graph Convolutional Network (GCN) H ( l +1) = f ( b AH ( l ) W ( l ) ) f ( b : Normalized Adjacency Matrix AH : Node features at the l th layer b AH ( l ) l ) W ( l ) ) : Learnable parameters at the l th Layer : Activation Function (e.g. ReLU) f
GCN for Scene Navigation * + “Fridge” $% ) ' ) ) $% & ' & ) !( # !( # FC (512) 1000 class score ResNet-50 … 512 512 “Toaster” concat 3 Layers The knowledge graph is updated over time according to the recent observations
Action Space • Move Ahead • Move Back • Rotate Right • Rotate Left • Stop We consider the stop action and expect the agent to issue this action when it reaches the target. This makes the learning challenging.
Seen Scenes, No Novel Objects
Bedroom | Mi Mirr rror or
Livingroom | Pa Painting
Kitchen | To Toaster
Kitchen | Mi Microwave
een Scenes, Known Objects Un Unseen
Bathroom | Soa Soap
Bedroom | La Lamp mp
Bedroom | Li Light S Switch ch
Kitchen | Ca Cabinet
een Scenes, No Novel Objects Un Unseen
Bathroom | To Towel
Kitchen | Mi Microwave
Evaluation Metrics • S uccess R ate (SR) • The ratio of successful navigations toward the object over N episodes • S uccess weighted by P ath L ength (SPL) • The ratio of successful navigations toward the object weighted by the path length over N episodes considering both Success Rate and P N as 1 L i i =1 S i max ( P i ,L i ) , N episode i , P represents
(SPL / SR) without STOP action (250 episods) Kitchen Living room Bedroom Bathroom Avg. Random 17.9 / 33.1 12.1 / 30.5 16.8 / 51.2 24.5 / 34.6 17.8 / 37.3 Seen scenes, A3C 79.9 / 86.7 38.8 / 57.6 87.8 / 89.5 93.7 / 96.6 75.0 / 82.5 Known objects Ours 83.5 / 88.2 46.4 / 64.4 90.6 / 92.7 93.6 / 96.5 78.5 / 85.5 Random 10.0 / 23.1 8.0 / 18.5 17.3 / 35.2 11.2 / 32.2 11.6 / 27.2 Seen scenes, A3C 20.2 / 38.8 24.2 / 46.5 23.5 / 35.8 50.2 / 74.6 29.5 / 48.9 Novel objects Ours 22.9 / 53.6 39.5 / 66.5 26.1 / 38.9 50.5 / 78.6 34.7 / 59.4 Random 27.3 / 45.2 5.6 / 16.6 13.1 / 34.5 36.0 / 49.1 20.5 / 36.3 Unseen scenes, A3C 39.5 / 56.2 12.0 / 31.8 22.5 / 49.2 47.4 / 60.2 30.3 / 49.3 Known objects Ours 46.2 / 62.5 13.8 / 40.6 26.5 / 58.6 51.5 / 65.8 34.5 / 56.9 Random 21.3 / 44.3 3.3 / 22.9 25.8 / 47.8 25.5 / 48.9 19.0 / 41.0 Unseen scenes, A3C 26.1 / 56.3 9.4 / 25.1 28.2 / 54.0 33.8 / 90.7 24.4 / 56.5 Novel objects Ours 38.5 / 62.5 13.7 / 40.3 30.1 / 63.1 39.2 / 93.6 30.4 / 64.9 Table 2: Results without termination (stop) action. SPL / Success rate ( ) is shown. We compare
(SPL / SR) with STOP action Kitchen Living room Bedroom Bathroom Avg. Random 2.4 / 3.5 1.1 / 1.7 1.8 / 2.7 3.2 / 4.8 2.1 / 3.1 Seen scenes, A3C 38.5 / 51.0 9.7 / 15.1 6.8 / 11.5 69.1 / 81.0 31.1 / 39.6 Known objects Ours 58.6 / 72.7 12.4 / 18.6 41.6 / 52.4 71.3 / 83.0 46.0 / 56.7 Random 0.9 / 1.3 0.8 / 1.2 2.3 / 3.4 1.4 / 2.1 1.4 / 2.0 Seen scenes, A3C 2.1 / 4.9 3.2 / 4.8 0.5 / 1.7 17.1 / 28.5 5.7 / 9.9 Novel objects Ours 3.2 / 6.1 9.8 / 16.2 6.2 / 8.6 24.7 / 37.3 11.0 / 17.1 Unseen scenes, Random 4.1 / 5.9 0.9 / 1.3 1.6 / 2.4 4.2 / 6.2 2.7 / 3.9 A3C 11.5 / 18.8 0.5 / 2.5 2.2 / 3.8 8.6 / 18.7 5.7 / 10.4 Known objects Ours 12.7 / 20.5 1.0 / 4.0 4.5 / 11.0 8.7 / 21.1 6.7 / 13.4 Random 2.0 / 2.8 0.6 / 1.0 2.0 / 2.8 2.7 / 3.9 1.8 / 2.6 Unseen scenes, A3C 2.2 / 7.5 2.5 / 4.4 1.3 / 4.4 3.4 / 9.3 2.4 / 5.9 Novel objects 3.3 / 12.7 2.8 / 5.3 2.0 / 6.3 4.1 / 12.2 3.1 / 8.5 Ours able 1: Results using termination (stop) action. SPL / Success rate ( ) is shown. We compare
Traditional Training Learning to Adapt Adaptation During Traditional Inference Inference
Initial Model Parameters Compute Adapted Parameters Initialize Model Complete Navigation Episode Take k steps Compute Compute Self- Supervised Supervised Navigation Loss Interaction Loss Backprop to Update Initialization
Learning to Learn Inference how to Learn Navigation Gradient (supervised) Learned Interaction Gradient (self-supervised)
Initial Model Parameters Compute Adapted Parameters Initialize Model Complete Navigation Episode Take k steps Compute Compute Self- Supervised Compute Self- Loss Supervised Navigation Loss Supervised Parameters Interaction Loss Interaction Loss via Neural Network
Navigation-Gradient (Training only) Forward Pass Interaction-Gradient (Training and Inference) 1D Temporal ResNet18 (Frozen) Conv Current Turn Look Move observation Image Down Forward Left Pointwise Feature Conv … 0 1 2 $ Pointwise Conv ,/×.×. ()*×.×. LSTM LSTM LSTM Target Glove Embedding Object Class Tile Laptop FC Concatenated 1 ×"## ,/×.×. policy and $ = # hidden states $ = ) &×(()* + ,) $ = *
Re Results Handcrafted Loss Handcrafted Loss Learned Loss Learned Loss Baseline Baseline SPL Success Training Scenes: 80 Validation Scenes: 20 Test Scenes: 20 Equal Split of Kitchen, Living Room, Bedroom, Bathroom
Goal: Navigate to Book
Thank you !!!!!
Recommend
More recommend