and more
play

and more Yang Wu Nara Institute of Science and Technology - PowerPoint PPT Presentation

Understanding humans : identity, communication, state, and more Yang Wu Nara Institute of Science and Technology 1 NAIST International Collaborative Laboratory for Robotics Vision For


  1. Collaborative Representation for Re-ID  Non-sparse CR Collaboratively Regularized Nearest Points • Classification Like sparse/collaborative representation models for single-instance based   β β β recognition, here the set-specific coefficients is implicitly made * * * [ , , ] 1 n to have some discrimination power. Therefore, we design our classification model as follows.      i C Q arg min d , CRNP i   2 2    Q α X β β i * * * where Q X d · / . CRNP i i i i * * 2 2 Recall that RNP doesn’t directly use the coefficients themselves which are actually also discriminative.   2    Q α X β i * * Q X d · , RNP i i * * 2 Yang Wu, Michihiko Minoh, Masayuki Mukunoki, " Collaboratively Regularized Nearest Points for Set Based Recognition ", In Proc. of The 24th British Machine Vision Conference ( BMVC ), 2013 . 34

  2. Collaborative Representation for Re-ID  Sparse CR LCSA (Locality-constrained Collaborative Sparse Approximation) g X g g g X X X 1 1 1 1 p p p X X X g X i p X g g g X X X p i i i X p X g g g g X X X X n n n n (a) SANP (b) CSA (c) LCSAwNN (d) LCSAwMPD Yang Wu, Michihiko Minoh, Masayuki Mukunoki, " Locality-constrained Collaborative Sparse Approximation for Multiple-shot Person Re-identification ", In Proc. of The Asian Conference on Pattern Recognition ( ACPR ), 2013 . 35

  3. Collaborative Representation for Re-ID  Sparse CR Experimental Results Performance changes on the "iLIDS-AA" dataset 0.7 Accuracy at rank top 10% 0.65 0.6 LCSAwNN, N=10 0.55 LCSAwNN, N=23 LCSAwNN, N=46 0.5 LCSAwMPD, N=10 LCSAwMPD, N=23 LCSAwMPD, N=46 0.45 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Locality ratio Yang Wu, Michihiko Minoh, Masayuki Mukunoki, " Locality-constrained Collaborative Sparse Approximation for Multiple-shot Person Re-identification ", In Proc. of The Asian Conference on Pattern Recognition ( ACPR ), 2013 . 36

  4. Collaborative Representation for Re-ID  Non-sparse CR LCRNP (Locality-constrained Collaboratively Regularized Nearest Points) g g X X 1 1 p p X X Sparse g g X X i i g g X X n n (a) LCSAwNN (b) LCSAwMPD g g X X 1 1 p p X X Non-sparse g g X X i i g g X X n n (c) LCRNPwNN (d) LCRNPwMPD Yang Wu, et al., " Locality-constrained Collaboratively Regularized Nearest Points for Multiple-shot Person Re-identification ", FCV 2014. 37

  5. Collaborative Representation for Re-ID  Non-sparse CR Experimental results for LCRNP , in comparison with the others CMC on the "iLIDS-MA" dataset with N=10 CMC on the "iLIDS-MA" dataset with N=23 CMC on the "iLIDS-MA" dataset with N=46 0.9 0.9 0.9 0.8 0.8 0.8 Recognition percentage Recognition percentage Recognition percentage 0.7 0.7 0.7 CSA (0.700) CSA (0.732) CSA (0.725) 0.6 0.6 0.6 LCSAwNN (0.750) LCSAwNN (0.768) LCSAwNN (0.800) LCSAwMPD (0.825) LCSAwMPD (0.780) LCSAwMPD (0.787) 0.5 CRNP (0.775) 0.5 CRNP (0.777) 0.5 CRNP (0.790) LCRNPwNN (0.787) LCRNPwNN (0.815) LCRNPwNN (0.850) LCRNPwMPD (0.798) LCRNPwMPD (0.838) LCRNPwMPD (0.875) 0.4 0.4 0.4 1 2 3 4 1 2 3 4 1 2 3 4 Rank Rank Rank CMC on the "iLIDS-AA" dataset with N=10 CMC on the "iLIDS-AA" dataset with N=23 CMC on the "iLIDS-AA" dataset with N=46 0.8 0.8 0.8 0.7 0.7 0.7 Recognition percentage Recognition percentage Recognition percentage 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 CSA (0.613) CSA (0.554) CSA (0.578) LCSAwNN (0.655) LCSAwNN (0.694) LCSAwNN (0.688) 0.3 0.3 0.3 LCSAwMPD (0.604) LCSAwMPD (0.676) LCSAwMPD (0.673) CRNP (0.707) CRNP (0.734) CRNP (0.713) 0.2 0.2 0.2 LCRNPwNN (0.722) LCRNPwNN (0.745) LCRNPwNN (0.759) LCRNPwMPD (0.721) LCRNPwMPD (0.737) LCRNPwMPD (0.714) 0.1 0.1 0.1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Rank Rank Rank CMC on the "CAVIAR4REID" dataset with N=5 CMC on the "CAVIAR4REID" dataset with N=10 CMC on the "CAVIAR4REID" dataset with N=10, unspecified 0.8 0.8 0.8 0.7 0.7 0.7 Recognition percentage Recognition percentage Recognition percentage 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 CSA (0.446) CSA (0.540) CSA (0.652) LCSAwNN (0.588) LCSAwNN (0.720) LCSAwNN (0.760) 0.3 0.3 0.3 LCSAwMPD (0.544) LCSAwMPD (0.660) LCSAwMPD (0.704) CRNP (0.624) CRNP (0.700) CRNP (0.674) 0.2 0.2 0.2 LCRNPwNN (0.642) LCRNPwNN (0.740) LCRNPwNN (0.734) LCRNPwMPD (0.638) LCRNPwMPD (0.700) LCRNPwMPD (0.734) 0.1 0.1 0.1 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Rank Rank Rank Yang Wu, et al., " Locality-constrained Collaboratively Regularized Nearest Points for Multiple-shot Person Re-identification ", FCV 2014. 38

  6. 1. Parametric ( Set-to-set distance + metric learning ) 2. Non-parametric ( Collaborative representation ) How about combining them? 39

  7. Related work  Parametric methods Background: Dictionary Learning Coefficients   α Samples Dictionary …     D X … … …   … … … … d … ... ... ... ... ... ... ... ... ... ... ... ... … … ... ... ... ... ... Feature vector … N k    … k N Training Regularizer Regularizer (e.g. Sparsity) (e.g. Discrimination) 40

  8. Parametric (Collaborative representation + Dictionary learning) Discriminative Collaborative Representation (DCR) Coefficients Samples Dictionary   α     X D p p … … … … …   … … … … … d … ... ... ... ... ... ... ... ... ... ... … … … N p ... ... ... ... ... ... ... N g  … …  … … … … d … ... ... ... ... ... … … Strong and costly … ... ... ... ... ... … … regularization terms … were used.   α   X g g Yang Wu, et al., " Discriminative Collaborative Representation for Classification ", ACCV 2014. 41

  9. New proposal: dictionary co-learning Dictionary Collaborative Learning (DCL) Coefficients Samples Dictionary       β X D g g g … … … … … …   … … X α i i … , … g g … ... ... ... ... ... ... ... …  d α i i 1 N . g i g ... ... ... ... … N ,1 ... ... ... ... g … … Learning Camera-specific c Dictionaries Collaboratively c … X α i i  , … … … p p  …  … … α i i d 1 N . p i p … ... ... ... ... N ,1 p … … ... ... ... ... ... ... ... X p … ... ... ... ... …   … … …   D β p p 42

  10. Experiments  Results Experimental results: Effectiveness Rank 1 accuracy Nonparametric Parametric 43

  11. Experiments  Results Experimental results: Efficiency • Running time in milliseconds/person, using matlab with a normal CPU . Nonparametric Parametric 10-100x Speedup 44

  12. Video-based ReID : Perspectives of Set and Sequence Set Sequence … … 45

  13. Sequence : the order matters! 46

  14. Proposal : Temporal Convolution [AAAI 2018] Wu et . al., “Temporal -Enhanced Convolutional Network for Person Re- identification”. 47

  15. NAIST International Collaborative Laboratory for Robotics Vision Identity (Who?) Communication State, Action, ... (What [does he/she want] ? (What [is he/she doing] ? How [does he/she feel] ?) How [does he/she do it] ?) Explicit expression Implicit expression 48

  16. People communicate to understand each other What if machines understand them? 49

  17. Our goal: automatic recognition of spontaneous head gestures 50

  18. Targeted head gestures Nod Ticks Jerk Up Down Tilt Shake Turn Forward Backward 51

  19. Benefits of understanding communication Human-robot interaction Communication assistance [Maatman et al. 2005] [Asakawa 2015] 52

  20. Importance of non-verbal information Verbal information Audio Communication information Non-verbal information Expression Visual information Hand gesture Head gesture Non-verbal information influences significantly ⋮ e.g.) Mehrabian’s rule (Rule of 7%-38%-55%) Verbal Audio Visual We focus on head gesture detection • Appears frequently [Hadar et al. 1983] • Takes important role [Kousidis et al. 2013, McClave 2000] 53

  21. Our contributions and novelties  Contributions  Built a novel dataset  Evaluated representative automatic recognition models  Novelties (with comparison to existing work)  Dataset :  closer to real applications  better for deeper and further researches  Solution :  a general hand-crafted feature  a comparative study of representative recognition algorithms 54

  22. Previous studies on head gesture detection Recognized head gestures P r e v i o u s s t u d i e s [Morency et al. 2007] Nod [Nakamura et al. 2013] Only Nod and Shake are [Chen et al. 2015] widely handled gestures. [Kawato et al. 2000] [Kapoor et al. 2001] Nod, Shake [Tan et al. 2003] Nod is commonly [Morency et al. 2005] concerned. [Wei et al. 2013] Nod, Shake, [Saiga et al. 2010] Turn Nod, Shake, [Fujie et al. 2004] Tilt, Still 55

  23. Previous studies on head gesture detection Recording conditions P r e v i o u s s t u d i e s [Kawato et al. 2000] [Kapoor et al. 2001] No interlocutors [Tan et al. 2003] [Wei et al. 2013] [Fujie et al. 2004] Against a robot [Morency et al. 2005] [Morency et al. 2007] Speaker-listener style [Nakamura et al. 2013] Few people have worked on spontaneous head gestures [Chen et al. 2015] Mutual conversations in human conversations [Saiga et al. 2010] 56

  24. Dataset Construction 57

  25. Recording • 30 sequences of approx. 10 min. from 15 participant • Includes familiar/unfamiliar pairs, indoor/outdoor records • Conversations with topics chosen beforehand • Purpose of the recording is announced Wearable camera Wearable camera Microphone Fixed camera Fixed camera Microphone 58

  26. Annotation A freeware Anvil5 [Kipp 2014] was used for manual annotation. (up to 3 overlapping gestures were allowed) 3 naive annotators annotated all the data independently, after a quick training with guideline and examples. 59

  27. Ground-Truth Inference • IoU: Interaction over Union Interval 1: 𝑇 1 Interval 2: 𝑇 2 time Intersection: 𝐽 1,2 Union: 𝑉 1,2 𝐽𝑝𝑉 𝑇 1 , 𝑇 2 = 𝑚𝑓𝑜𝑕𝑢ℎ(𝐽 1,2 ) 𝑚𝑓𝑜𝑕𝑢ℎ(𝑉 1,2 ) 60

  28. Ground-Truth Inference Gesture Type Strength Annotator A: Nod, 2 Nod, 3 Turn, 1 Shake, 3 Annotator B: Tilt, 1 Nod, 2 Shake, 2 Annotator C: Up, 2 Shake, 3 Down, 3 Suppose IoU_th = 0.5 Inferred: Non- maximum Nod, 2.5 Shake, 3 Suppression A&B: IoU=0.6 > IoU_th A&B: IoU=0.65 > IoU_th A&B: IoU=0.65 > IoU_th A&C: IoU=0.8 > IoU_th A&C: IoU=0.8 > IoU_th B&C: IoU=0.6 > IoU_th B&C: IoU=0.6 > IoU_th 61

  29. Statistics (Inferred Ground-truth with IoU=0.5) Total No. of Samples: 4147 62

  30. Type Distribution per Subject 63

  31. Strength Distribution per Subject 64

  32. Familiar vs. Unfamiliar Ticks Nod 65

  33. Length Distribution Median 66

  34. Recognition tasks To detect varied head gestures from spontaneous conversations Detection : Given a sequence, to infer when and which gestures appear. Nod Nod Shake To understand the problem better, we also work on the task of Classification : Given a segmented gesture clip, to infer which type it belongs to. Tilt Shake classifier ? ? ? Nod Turn 67

  35. General framework Head pose Features Classifier or Detector 68

  36. Head pose estimation Head pose (and position) were estimated with ZFace [Jeni et al. 2015] Pitch Roll Yaw X Y Scale Frame number 69

  37. A general hand-crafted feature Histogram of Velocity and Acceleration (HoVA) ⌇ ⌇ ⌇ ⌇ Original 1 st derivative ⌇ ⌇ ⌇ ⌇ 70

  38. Histogram of Velocity and Acceleration (HoVA) Original 1 st derivative +: 1.4 +: 2.4 +: 4.3 -: 2.0 -: 2.6 -: 1.8 71

  39. Histogram of Velocity and Acceleration (HoVA) Original 2 nd derivative +: 1.4 +: 2.4 +: 4.3 -: 2.0 -: 2.6 -: 1.8 +: 2.2 +: 2.4 +: 4.3 -: 2.0 -: 2.6 -: 1.8 72

  40. Existing classification models Learning model P r e v i o u s s t u d i e s [Kawato et al. 2000] (rule-base) [Saiga et al. 2010] [Nakamura et al. 2013] [Morency et al. 2005] SVM [Chen et al. 2015] [Kapoor et al. 2001] [Tan et al. 2003] HMM [Fujie et al. 2004] [Wei et al. 2013] LDCRF [Morency et al. 2007] 73

  41. We evaluate the following models  Non-graphical  SVM  Graphical  Hidden-state Conditional Random Field (HCRF) for classification  Latent-Dynamic Conditional Random Field (LDCRF) for detection  Long-Short Term Memory (LSTM) LDCRF 74

  42. LDCRF (Latent-Dynamic Conditional Random Field) [Morency et al. 2007] Conditional Random Field enhanced for action detection Hidden state Learn weights between each and Hidden state Hidden state Data Optimize the order of hidden states throughout temporal data A A A B B A A Label Hidden state A1 A1 A2 B2 B1 A1 A2 Data −2 −1 3 10 12 −1 −2 75

  43. LSTMs (Long-Short Term Memory) Output 10 Dense + Softmax ⋰ ⋮ ⋱ 32 Dense + ReLU ⋰ ⋮ ⋱ 192 Max pooling n x 64 Bidirectional LSTM n x 64 n x 64 LSTM Input temporal data n x 24 76

  44. Results – Classification (Accuracy, F-score) • Accuracy (Averaged) Method Training Set Training-Val Set Validation Set Test Set 0.68 ± 0.02 0.74 ± 0.04 0.62 ± 0.11 0.60 ± 0.12 SVM 0.65 ± 0.02 0.76 ± 0.01 0.59 ± 0.11 0.57 ± 0.13 SVM_weighted 0.88 ± 0.04 0.83 ± 0.03 0.66 ± 0.14 0.64 ± 0.10 HCRF 0.84 ± 0.06 0.63 ± 0.14 0.61 ± 0.15 LSTMs 0.79 ± 0.02 • F-score (Averaged) Method Training Set Training-Val Set Validation Set Test Set SVM 0.483 0.318 0.387 0.307 SVM_weighted 0.493 0.324 0.408 0.388 HCRF 0.799 0.386 0.433 0.382 LSTMs 0.600 0.394 0.386 0.391 77

  45. Results – Classification (Confusion Matrix) • Test set only, overall accumulation SVM SVM_weighted HCRF LSTMs 78

  46. Results – Classification (Class-specific) 79

  47. Simulated Human Performance -- Classification Frame-wise confusion matrix Frame-wise confusion matrix (without “None” class) (with “None” class) 80

  48. Results – Detection (PR-curve, AP) 81

  49. Results – Detection (AP) 1 Nod Jerk 0.75 Up Down Ticks Tilt 0.5 Shake Turn 0.25 Forward Backward 0 全体 SVM LDCRF • Poorer results when fewer samples are available • LDCRF can better model classes with more diversities, e.g. Ticks. 82

  50. Conclusions and discussions  Spontaneous head gesture recognition is a hard problem • Hard for humans, but even harder for automatic recognition  Gestures types are not equally hard for automatic recognition  Larger model is stronger  Deep learning is more promising, but more data is needed. 83

  51. NAIST International Collaborative Laboratory for Robotics Vision Identity (Who?) Communication State, Action, ... (What [does he/she want] ? (What [is he/she doing] ? How [does he/she feel] ?) How [does he/she do it] ?) Explicit expression Implicit expression 84

  52. Proposal of a Wrist-mounted Depth Camera for Finger Gesture Recognition Kai Akiyama, Yang Wu Nara Institute of Science and Technology Time-of-Flight Retrieved camera depth images AR/VR controller Daily activity recognition 85

  53. Hand pose estimation - Applications Playing games Driving assistant Surgery assistant etc. 86

  54. Background – Depth-based 3D hand pose estimation benchmark Hands In the Million Challenge (HIM2017) Testing data Training data (957K) Pose Single frame estimator (296K) Tracking Hand detector (295K) + Pose estimator Interaction (2K) (S. Yuan, et al. 2017) 87

  55. Proposed 3D hand pose estimator architecture (1) Output_P 24 Output_R 24 Output_M Output_hand 24 63 3D coordinates of Output_I hand joints 24 Thickened cloud points of hand Output_T 27 Block 1 Block 2 1024 dense Block 3 Block 4 88

  56. Proposed 3D hand pose estimator architecture (2) 24 24 24 63 3D coordinates of hand joints 24 Thickened cloud points of hand 27 89 Block 1 Block 2 1024 dense Block 3 Block 4

  57. Pipeline of Pose estimator Single frame pose estimation Extracting hand Estimate 3D hand pose and Represent data by based by given transform back to original 50x50x50 volume bounding box coordinates Pose Estimato r 90

  58. Qualitative results of 3D hand pose estimator 91

  59. Evaluation on the 3D hand pose estimation task of HIM2017 benchmark 92

  60. Utilizing a hand detector for tracking and interaction task Testing data Single frame Pose estimator Tracking Hand detector + Pose estimator Interaction We need a hand detector to find where is the hand in real application 93

  61. Architecture of the 3D hand pose tracking system Hand detector + Hand verifier + Pose estimator X Success Hand Hand Pose Detector Verifier Estimator Fail Taking pose from previous frame Hand verifier: 1. Comparing with the previous frame, whether the center of detected hand area shift more than 150 mm; 2. Whether the number of pixels for detected hand area is more than 1000. 94

  62. Qualitative results of 3D hand pose tracking Sequential Frames … … … Depth image … … … Hand mask Estimated … … … hand pose and depth image 95

  63. Evaluation on the 3D hand tracking task of HIM2017 benchmark 96

  64. Applying modified tracking system on Hand object interaction X Hand Pose Detector Estimator 97

  65. Qualitative results of 3D hand-object interaction pose estimation Depth image Hand mask Estimated hand pose and depth image 98

  66. Evaluation on the hand object interaction task of HIM2017 benchmark 99

  67. Evaluation results on all tasks of HIM2017 benchmark 100

Recommend


More recommend