Real-Time Human Pose Recognition in Parts from Single Depth Images Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, Andrew Blake CVPR 2011 PRESENTER: AHSAN ABDULLAH
PROBLEM
APPROACH Partitioning into body parts helps localizing the joints • left right hand neck shoulder right elbow Shotton et. al. CVPR 2011
PIPELINE Design Goals • Efficiency • Robustness capture depth image & remove bg infer body parts per pixel cluster pixels to hypothesize body joint fit model & positions track skeleton Shotton et. al. CVPR 2011
BODY PART CLASSIFICATION Compute P( c i | w i ) pixels i = ( x , y ) body part c i image window w i image windows move with classifier Discriminative approach learn classifier P( c i | w i ) from training data Shotton et. al. CVPR 2011
LEARNING DATA synthetic real (test) (train & test) Shotton et. al. CVPR 2011
LEARNING – DATA SYNTHESIS Retarget to several models Record MoCap 500k frames distilled to 100k poses Render (depth, body parts) pairs Shotton et. al. CVPR 2011
FEATURE SET input • Depth comparisons Δ depth image Δ - very fast to compute Δ x Δ x x x image depth offset depth Δ Δ feature 𝑔 𝐽, x = 𝑒 𝐽 x − 𝑒 𝐽 (x + Δ) x response x image coordinate 𝐰 Δ = 𝑒 𝐽 x scales inversely with depth Background pixels d = large constant Shotton et. al. CVPR 2011
DECISION FORESTS Aggregation of decision trees
TRAINING DECISION TREES for all Q n = (I, x) P n ( c ) pixels [Breiman et al. 84] body part c f ( I, x ; Δ n ) > θ n n no yes P l ( c ) reduce P r ( c ) entropy r l c c Take ( Δ , θ ) that maximises information gain Shotton et. al. CVPR 2011
DECISION TREE CLASSIFICATION image window Toy example: centred at x Distinguish left ( L ) and right ( R ) sides of f ( I, x ; Δ 1 ) > θ 1 the body no yes f ( I, x ; Δ 2 ) > θ 2 P( c ) no yes L R P( c ) P( c ) L R L R Shotton et. al. CVPR 2011
DECISION FOREST CLASSIFIER (𝐽, x) (𝐽, x) tree 1 tree T ……… P T ( c ) P 1 ( c ) c [Amit & Geman 97] [Breiman 01] c Trained on different random subset of images [Geurts et al. 06] “bagging” helps avoid over -fitting 𝑈 Average tree posteriors 𝑄 𝑑 𝐽, x = 1 𝑈 𝑄 𝑢 (𝑑|𝐽, x) 𝑢=1 Shotton et. al. CVPR 2011
NUMBER OF TREES ground truth … Average per-class 55% 50% inferred body parts (most likely) 45% 1 tree 3 trees 6 trees 40% 1 2 3 4 5 6 Number of trees Shotton et. al. CVPR 2011
TREE DEPTH 65% 65% Average per-class synthetic test data real test data 60% 60% accuracy 55% 55% 50% 50% 45% 45% 40% 40% 35% 35% 30% 30% 8 12 16 20 5 15 Depth of trees Depth of trees Shotton et. al. CVPR 2011
Body parts to joint hypotheses Define 3D world space density • 1 2 3D coord pixel of i th pixel weight 3D coord bandwidth pixel index i inferred depth at 3. hypothesize i th pixel probability body joints Mean shift for mode detection • Shotton et. al. CVPR 2011 …
input depth inferred body parts front view side view top view inferred joint positions No tracking or smoothing Shotton et. al. CVPR 2011
input depth inferred body parts front view side view top view inferred joint positions No tracking or smoothing Shotton et. al. CVPR 2011
JOINT PREDICTION ACCURACY Average precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Center Head Shotton et. al. CVPR 2011 Center Neck Left Shoulder Right… Left Elbow Right Elbow Left Wrist Right Wrist Left Hand Right Hand Left Knee Right Knee Left Ankle Right Ankle Left Foot Right Foot Mean AP
JOINT PREDICTION ACCURACY Average precision 0.0 0.2 0.4 0.5 0.6 0.8 0.9 0.1 0.3 0.7 1.0 Center Head Center Neck Joint prediction from inferred body parts Joint prediction from ground truth body parts Left Shoulder Shotton et. al. CVPR 2011 Right Shoulder Left Elbow Right Elbow Left Wrist Right Wrist Left Hand Right Hand Left Knee Right Knee Left Ankle Right Ankle Left Foot Right Foot Mean AP
ANALYSIS No temporal information • frame-by-frame - Very fast • simple depth image feature - parallel decision forest classifier - Shotton et. al. CVPR 2011
KINECT SYSTEM Uses… 1 2 • 3D joint hypotheses • kinematic constraints • temporal coherence 3 … to give • full skeleton • higher accuracy • invisible joints • multi-player 4. track skeleton
SUMMARY • Frame-by-frame gives robustness • Body parts representation for efficiency • Fast, simple machine learning • Significant engineering to scale to a massive, varied training data set Shotton et. al. CVPR 2011
QUESTIONS
Recommend
More recommend