Depth Sensing and Deep Learning: Grasping and Segmenting 3D Objects from Real Depth Images using Synthetic Data Mike D e Dan aniel elczuk uk, Jeffrey Mahler, Matthew Matl, Saurabh Gupta, Andrew Lee, Andrew Li, Vishal Satish, Bill DeRose, Stephen McKinley, Ken Goldberg
Black Grouse
Imagenet: 14M labeled images, 20K categories
Classical Computer Vision Pipeline. CV experts 1. Select / develop features: SURF, HoG, SIFT, RIFT, … 2. Add Machine Learning for multi-class recognition and train classifier Feat eatur ure Detection De on, Extracti Ex tion: Cl Clas assificat ation SIFT, T, Ho HoG.. ... Recogn ognition on Classical CV feature definition is domain- specific and time-consuming Slide from Boris Ginzburg
Imagenet Classification 2012 Slide from Boris Ginzburg
Imagenet 2012 Leaderboard http://www.image-net.org/challenges/LSVRC/2012/results.html N Error-5 Algorithm Team Authors 1 0.153 Deep Conv. Neural Network Univ. of Toronto Krizhevsky et al 2 0.262 Features + Fisher Vectors + ISI Gunji et al Linear classifier 3 0.270 Features + FV + SVM OXFORD_VGG Simonyan et al 4 0.271 SIFT + FV + PQ + SVM XRCE/INRIA Perronin et al 5 0.300 Color desc. + SVM Univ. of van de Sande et al Amsterdam Slide from Boris Ginzburg
Imagenet 2013 Leaderboard http://www.image-net.org/challenges/LSVRC/2013/results.php N Error-5 Algorithm Team Authors 1 0.117 Deep Convolutional Neural Clarifi Zeiler Network 2 0.129 Deep Convolutional Neural Nat.Univ Min LIN Networks Singapore 3 0.135 Deep Convolutional Neural NYU Zeiler Networks Fergus 4 0.135 Deep Convolutional Neural Andrew Howard Networks 5 0.137 Deep Convolutional Neural Overfeat Pierre Sermanet et al Networks NYU Slide from Boris Ginzburg
Imagenet Classification 2013 Slide from Boris Ginzburg
Today’s Lecture • (Brief) Intro to Convolutional Neural Networks (CNNs) • Learning Instance Specific Grasping • Learning Grasp Quality CNNs • Learning Instance Segmentation CNNs
Today’s Lecture • (B (Bri rief) I ) Intr tro t to Convolutional l Neura ral l Netw tworks (C (CNNs) • Learning Instance Specific Grasping • Learning Grasp Quality CNNs • Learning Instance Segmentation CNNs
Why is Convolution good? • Shared Parameters • Only store filter parameters: 5*5*3 (for kernel) + 1 (for bias) for example • Local Connectivity • Each neuron only corresponds to a local patch of the image (not the whole thing)
Idea: Learn Features
Today’s Lecture • (Brief) Intro to Convolutional Neural Networks (CNNs) • Lear arning ng I Ins nstan ance-Spec pecific G Grasping • Learning Grasp Quality CNNs • Learning Instance Segmentation CNNs
Grand Challenge The ability to “grasp sp m millio ions of of differe rent si sized an and sh shape ped o obj bjects… would have significant impact on deployment of robots in fact ctorie ries, in warehouses, and in homes es.” - Rod Brooks
33
34
Universal Picking: diversely shaped and sized objects
Motivation: Instance-Specific Grasping Desired Object: from Dasani Drops
Grasping under Uncertainty
Physics UNCERTAINTY Perception Control
First Wave: R ( x , u ) ∈ {0, 1} Analytic Methods u *=π( x ) = argmax R ( x , u ) KRUGER ET AL., 2012 SHIMOGA, 1996 NGUYEN, 1988 POKORNY ET AL., BICCHI & KUMAR, REAULEAUX, 1876 FERRARI & CANNY, 2013 HANAFUSA & ASADA, 1977 2001 1992 HAAS-HEGER ET AL., LI & SASTRY, 1988 ROA & SUAREZ, BICCHI, 1994 2006 2006
Data-driven Grasping under Uncertainty Uncertainty in object pose: green grasps are robust to pose errors, red ones are not. Uncertainty in object identity: aggregate pre-computed grasp information from multiple objects in database for robustness Computations performed over database to recognition errors. of ~200 household objects. Ciocarlie, Pantofaru, Hsiao, Bradski, Brook, and Dreyfuss. “A Side of Data with My Robot: Three Datasets for Mobile Manipulation in Human Environments.”, R&A Magazine, 2011
Image-Based Grasp Planning Filter Bank Regressor Input Image Proposals Grasp Col olor Imag ages Point C Poi Clouds [Fishchinger & Vincze, 2012] [Saxena et al., 2008] [Saxena et al., 2008] [Herzog et al., 2014] [Stark et al., 2008] [Detry et al., 2009] [Oberlin & Tellex, 2015] [Bohg & Kragic, 2010] [Hubner & Kragic, 2010] [ten Pas et al., 2015] [Le et al., 2010] [Boularis et al., 2011] Image sources: Saxena et al., 2008, Pinto & Gutpa, 2016
Deep Grasp Planning Convolutional Neural Network Input Image Grasp Col olor Imag ages Poi Point C Clouds [Lenz et al., 2015] [Kappler et al., 2015] [Redmon & Angelova, 2016] [Gualtieri et al., 2016] [Pinto et al., 2016] [Johns et al, 2016] [Levine et al., 2017] [Viereck et al., 2017] Image source: Pinto & Gutpa, 2016
Data Sources Human labeling Self-supervision 1035 images: 80% success ~1 robot year: 80-90% success Lenz et al. “Deep Learning for Detecting Robotic Grasps.” S. Levine et al.. “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection.” IJRR 2015. IJRR 2017.
Dexterity Network (Dex- Net)
Synthetic Point Clouds Synthetic LIDAR Images
Dex-Net 6.7 million examples Positive Negative Negative
Grasp Quality CNN
GQ-CNN for Bins Iter 0 Iter 1 Plan
Mahler, Jeffrey, Matthew Matl, Vishal Satish, Michael Danielczuk, Bill DeRose, Stephen McKinley, and Ken Goldberg. "Learning ambidextrous robot grasping policies." Science Robotics 4, no. 26 (2019): eaau4984.
Motivation: Instance-Specific Grasping Desired Object: from Dasani Drops
One Approach Segm Segment all objects in the scene , Classify all segments, Gr Grasp segment that matches the target!
Slide from: Kaiming He
Mask R-CNN
Mask R-CNN • State-of-the-art instance e segm egmentation on network • Requires mas assive hand hand- label eled ed d dataset ets for training • Does not generalize to unseen classes He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).
Automated Dataset Generation • WISDOM-Sim: 50,000 images with 320,000 object labels in just 3. 3.5 h hour urs • 80/20 train/test split for both images and objects • 1600 unique objects, 10,000 unique heaps
Sampling Distributions
Data Augmentation
Modal Segmentation Masks
Amodal Segmentation Masks
SD Mask R-CNN • State-of-the-art performance for objec ect i ins nstanc nce s seg egment entation on real depth images • No No ha hand nd-labelin ling required, trained solely on synt nthet hetic d dep epth i h images es • Gener Generalizes to unseen objects • Outperforms point cloud clustering baselines by 15% in average precision and 20% in average recall
COCO Metrics
COCO Metrics Prec eciso son: TP / (TP + FP) Recal all: TP / (TP + FN)
Application: Instance-Specific Grasping Target: Dasani Drops
Thank You! Pro roje ject W t Website, Code: e: Sup upplementar ary M Mat ater erial, https://github.com/ an and D Dat atas asets: BerkeleyAutomation/ https:/ ://b /bit.l .ly/2 /2letCuE sd-maskrcnn
Recommend
More recommend