gesture recognition hand pose estimation
play

Gesture Recognition: Hand Pose Estimation Adrian Spurr Ubiquitous - PowerPoint PPT Presentation

Gesture Recognition: Hand Pose Estimation Adrian Spurr Ubiquitous Computing Seminar FS2014 27.05.2014 1 What is hand pose estimation? Input Computer-usable form 2 Augmented Reality Gaming PC Control Robot Control 3 3 Data glove


  1. Gesture Recognition: Hand Pose Estimation Adrian Spurr Ubiquitous Computing Seminar FS2014 27.05.2014 1

  2. What is hand pose estimation? Input Computer-usable form 2

  3. Augmented Reality Gaming PC Control Robot Control 3 3

  4. Data glove • Utilizes optical flex sensors to measure finger bending. • Advantage: High accuracy, can provide haptic feedback. • Disadvantages: invasive, long calibration time, unnatural feeling, heavily instrumented. 4 4

  5. Thanks to cheap depth cameras... Depth Camera RGB Camera 5 5

  6. ...and increase in GPU Power 6

  7. Problems occuring • • Segmentation Noisy data 7

  8. Problems occuring • Self-occlusion and viewpoint change: 8

  9. Problems occuring • 27 Degrees of freedom per hand -> 280 trillion hand poses: 9

  10. Problems occuring • Performance: For practical use, must be real time. 10

  11. Principle of operation Algorithm 11

  12. Existing schools of thought • • Model-based: Discriminative:  Keeps internally track of  Maps directly from current pose. observation to pose.  Updates pose according  “Learn” from training data to current pose and and apply knowledge to observation. unseen data. Processing 12

  13. Short intro to Random Forests  Ensemble learning A decision tree:  Classification and Regression  Consists of decision trees 13

  14. Short intro to Random Forests Data in feature space Features = «Properties» of data 14

  15. Short intro to Random Forests Data in feature space Features = «Properties» of data 15

  16. Short intro to Random Forests Data in feature space Features = «Properties» of data 16

  17. Short intro to Random Forests Data in feature space Features = «Properties» of data 17

  18. Short intro to Random Forests Data in feature space Features = «Properties» of data 18

  19. Building a classification tree 19

  20. Building a classification tree 20

  21. Building a classification tree 21

  22. Random feature sampling Choose 𝑈 𝑘 which splits the data with maximum information gain. 22

  23. Bagging 23

  24. Prediction 24

  25. RF for pose estimation Why Random Forests? How should we use them? • Robust • Must choose what to split on. • Fast • What should the labels be? • Thorougly studied 25

  26. Advanced body pose recognition [Shotton2011] 26

  27. Advanced body pose recognition  Discriminative approach.  Used in the Kinect.  First paper to use synthetic training data.  Basis for many future papers. [Shotton2011] 27

  28. Creating synthetic data [Shotton2011] 28

  29. Split funtion : Depth at position x 29 [Shotton2011]

  30. Joint prediction [Shotton2011] 30

  31. Per-class accuracy vs. tree depth • Accuracy increases as depth of tree increases. • Overfitting occurs for 15k training images. • More training images leads to higher accuracy and less overfitting. [Shotton2011] 31 31

  32. Negative Results • Failure due to self-occlusion: • Failure due to unseen pose: [Shotton2011] 32

  33. Unresolved issues • To capture all possible poses, need to generate huge amount of training data. • Training RF on big training set means more trees and deeper trees. • Big amount of memory needed. 33

  34. Unresolved issues • To capture all possible poses, need to generate huge amount of training data. • Training RF on big training set means more trees and deeper trees. • Big amount of memory needed. • Solution: Divide training data into sub-sets and solve classification for each set separately. 34

  35. Multi-layered Random Forest  Cluster training data based on similarity.  Train RF on and for each cluster.  First layer assigns input to proper cluster.  Second layer gives the final hand part label distribution. [Keskin2012] 35

  36. Clustering training data  Cluster based on weighted differences.  Penalize differences of viewpoint, finger positions.  Label each cluster, labels refer to hand shape.  Train Random Forest on clusters. 36

  37. Experts  Use hand part labels.  Train for each cluster a separate Random Forest.  Each forest is called Expert. 37

  38. Two prediction methods  Global Expert Network:  Feed input to first layer of Random Forest, average input, get hand shape label.  Feed input to corresponding expert, get hand part distribution. 38

  39. Two prediction methods  Local Expert Network  Feed input to first layer of Random Forest, get hand shape label for each pixel.  Feed each pixel to its corresponding expert, get hand part distribution. 39

  40. Parts distribution to pose • RDF returns the hand part distribution. • Get centre of each distribution by utilizing mean shift. 40

  41. American Sign Language 41

  42. First layer accuracy on ASL • 2-fold cross-validation: 97.8% • Confusion occurs for (m,n), (m,t) and (n,t) 42 42

  43. Confusions • Confusion occurs for (m,n), (m,t) and (n,t) 43

  44. Second layer accuracy Q = Number of clusters 44

  45. Problems  Not feasible to capture all possible variations of hand with synthetic data.  Methods using only synthetic data suffer from synthetic- realistic discrepancies.  But: Using realistic training data expensive, due to manually labelling them. 45 Synthetic Real

  46. Problems  Not feasible to capture all possible variations of hand with synthetic data.  Methods using only synthetic data suffer from synthetic- realistic discrepancies.  But: Using realistic training data expensive, due to manually labelling them.  Solution: Transductive Learning. 46

  47. Transductive Random Forest  Transductive learning: learn from labelled data, apply knowledge transform to related unlabelled data  Estimate pose based on knowledge gained from both labelled and unlabelled data. 47

  48. Overview 48

  49. Training data a = «Front»  Training data consists of p = «Thumb» labelled real data and v = (3x16) synthetic data, and coordinates unlabelled real data  Labelled elements are image patches, not pixels  Label consists of tuple (a,p,v):  a = Viewpoint  p = Label of the closest joint  v = Vector containing all positions of joint 49

  50. Quality Function • Randomly choose between the two: Transductive Term Classification-Regression Term 50

  51. Quality Function • 𝑅 𝑏 : Measures quality of split with respect to viewpoint a • 𝑅 𝑞 : Measures quality of split with respect to joint label p • 𝑅 𝑤 : Measures compactness of vote vector v 51

  52. Quality Function Parameter Measures the “purity” of the node with respect to either the viewpoint a, or the joint label p 52

  53. Quality Function • 𝑅 𝑢 : Measures image similarity between real data patches • 𝑅 𝑣 : Measures purity based on the association between the labelled and unlabelled data 53

  54. Kinematic Refinement • Hands are biomechanically constrained on the poses it can do. • Use this for our advantage. • Utilize kinematic refinement to enforce these constraints. 54

  55. Some results 55

  56. Joint prediction accuracy 56

  57. Estimating pose of two hands?  Just apply single hand pose estimator twice?  What if both hands are strongly interacting?  Additional occlusion must be accounted for. 57

  58. Dual hand pose estimation  Model-based approach.  Set up parameter space representing all degrees of freedom for both hands.  Employ PSO to find best parameters suiting observation and current configuration with respect to a cost function. 58

  59. Sample parameter space z - Yaw y - Pitch x - Roll 59

  60. Cost function over param. space 60

  61. Initialization Random sample of n particles with random velocities. 61

  62. Iterating over parameter space Update particle velocities Update particle position with regards to: according to velocity  Current velocity  Local best position  Global best position 62

  63. Tracking  Use RGB image to create skin map.  Segment depth image according to skin map. 63

  64. Tracking  Cost function to optimize: P(h): Penalizes invalid finger positions. D(O,h,C): Penalizes discrepancies between hypothesis h and observation O. 64

  65. Applying PSO  Change particle velocity according to: = Best known position of particle i in generation k. = Best known position of all particles in generation k.  Apply PSO for each observation O. Exploit temporal information by sampling particles around previous hypothesis. 65

  66. Some results 66

  67. Accuracy 67 67

  68. Future of Hand Pose estimation • Academically solved • Further research in areas of recovering more than pose, such as hand model or 3D skin models.  Including RGB image for prediction increases accuracy.  Use of real data reduces synthetic-realistic discrepancies. 68

  69. Thank you for your attention! 69 69

Recommend


More recommend