3D Attention-Driven Depth Acquisition for Object Identification Kai Xu, Yifei Shi, Lintao Zheng, Junyu Zhang, Min Liu, Hui Huang, Hao Su, Daniel Cohen-Or and Baoquan Chen National University of Defense Technology Shandong University Shenzhen University SIAT Stanford University Tel-Aviv University
Background & motiv ivatio ion • Robotic indoor scene modeling Perception on object
Background & motiv ivatio ion • Indoor environments acquisition and modeling Dense Reconstruction Object Extraction [Nießner et al. 2013] [Xu et al. 2015]
Background & motiv ivatio ion What are these objects?
Activ ive obje ject recognit itio ion
Activ ive obje ject recognit itio ion
Proble lem settin ing • A robot actively acquires new observations to gradually increase the confidence of object recognition • Two key components: Object classification View planning Estimate object class Predict the Next-Best- based on so far acquired View to maximize its observations information gain
The main in chall llenge • Observatio ion is is partia ial l and progressiv ive • Shape description/matching with partial data is hard • Observations from varying views
The main in chall llenge • Observatio ion is is partia ial l and progressiv ive • View planning ? Observed Unobserved ? view views ? How can you know which view is better without knowing its observation?
The main in chall llenge • Real l in indoor scenes are often clu luttered • Degrade recognition accuracy • Invalidate the off-line learned viewing policy
Related work
Rela lated work • Onli line scene analy lysis is and modeli ling SemanticPaint Plane/Object Extraction [Valentin et al. 2015] [Zhang et al. 2014]
Rela lated work • Activ ive reconstructio ion and recognit itio ion Next-best-view for reconstruction Next-best-view for recognition [Wu et al. 2014] [Wu et al. 2015]
Method
The general l framework
The general l framework Goal Action View planning Recognition Belief Observe
An attentio ional formula latio ion “Humans fo focus att ttention sele lectively on part rts of the visual space to acquire information when and where it is needed, and combine information from different fixations over time to build up an in internal l re representatio ion of the scene” Internal representation –– Ronald Rensink Hand-writing recognition Image caption generation [Mnih et al. 2014] [Xu et al. 2015]
Recurrent Attentio ion Model • Recurrent Neural Networks (RNN) 𝐳 𝑢−1 𝐳 𝑢+1 𝐗 ℎℎ 𝐳 𝑢 𝐗 𝑗ℎ 𝐗 ℎ𝑝 … … 𝐳 𝑢 𝐲 𝑢 𝐢 𝑢+1 𝐢 𝑢−1 𝐢 𝑢 𝐢 𝑢 𝐲 𝑢−1 𝐲 𝑢+1 𝐲 𝑢 Aggregate information
Vie iew-based observatio ion 𝑤 0 𝐽 (0) 𝜒 𝑢 𝑤 𝑢 𝜄 𝑢 𝐽 (t)
3D 3D Recurrent Attentio ion Model 𝜄 2 , 𝜚 (2) 𝜄 3 , 𝜚 (3) 𝜄 1 , 𝜚 (1) View NBV emission NBV emission selection … (2) (3) (1) ℎ 2 ℎ 2 ℎ 2 classify View classify classify … aggregation (2) (3) (1) ℎ 1 ℎ 1 ℎ 1 initial view 𝜄 1 , 𝜚 (1) 𝜄 2 , 𝜚 (2) 𝜄 0 , 𝜚 (0) Feature Feature Feature extraction extraction extraction 𝐽 (1) 𝐽 (2) 𝐽 (0)
3D 3D Recurrent Attentio ion Model 𝜄 2 , 𝜚 (2) 𝜄 3 , 𝜚 (3) 𝜄 1 , 𝜚 (1) CNN 1 Max-pooling NBV emission NBV emission View pooling ℓ 1 ℓ 2 CNN 1 … (2) (3) CNN 2 (1) ℎ 2 ℎ 2 ℎ 2 …… … … ℓ 𝐿 classify classify classify CNN 1 … Multi-View CNN [Su et al. 2015] (2) (3) (1) ℎ 1 ℎ 1 ℎ 1 initial view 𝜄 1 , 𝜚 (1) 𝜄 2 , 𝜚 (2) 𝜄 0 , 𝜚 (0) Feature Feature Feature extraction extraction extraction 𝐽 (1) 𝐽 (2) 𝐽 (0)
Network train inin ing Reinforcement CNN learning Back propagation 𝜄 𝑗 , 𝜚 (𝑗) 𝜄 𝑗 , 𝜚 (𝑗) 𝐽 (𝑗) 𝐽 (𝑗) rendering Indifferentiable
Rein inforcement le learnin ing agent Stop? Depth state acquisition action reward How good the depth is? environment
Reward 𝑠 𝑢 = 𝐼 𝑢 𝑞 𝑢 , 𝑞 + 𝐽 𝑢 𝑞 𝑢 , 𝑞 𝑢−1 − 𝐷 𝑢 prediction information movement accuracy gain cost
Part-le level attentio ion occlusion Informative parts How to distinguish these two chairs?
Attentio ion extractio ion Convolutional Neural Network … … … Mid-level …… kernels in CNN …
Attentio ion extractio ion One wing Two wings
Results and evaluation
Database 57,452 models 12,311 models 57 categories 40 categories 52 sampled views Render model 260 sampled views Render with jittering
Tim imin ing Database MV-RNN train MV-RNN test ShapeNet 49 hr. 0.1 sec. ModelNet40 22 hr. 0.1 sec.
Vis isuali lizatio ion of attentio ions Part-level attention View sequence View sequence
NBV estim imation 40 classes Classification Accuracy
NBV estim imation under occlu lusio ion Classification Accuracy …
Result lts on real l scenes
Result lts on real l scenes
Result lts on real l scenes
Lim imit itations • Recognizable objects • No contextual information
Future works: Mult lti-modal l recognit itio ion What is this? Image database Shape database
Future: Mult lti-robot scene reconstructio ion & understandin ing AscTec Pelican PR2 Turtlebot 40
Future: Mult lti-robot attentio ion model Attention based on shared internal representation? 41
Thank you Q & A More details: kevinkaixu.net & yifeishi.net
Recommend
More recommend