and more Yang Wu Nara Institute of Science and Technology - PowerPoint PPT Presentation

Collaborative Representation for Re-ID  Non-sparse CR Collaboratively Regularized Nearest Points • Classification Like sparse/collaborative representation models for single-instance based   β β β recognition, here the set-specific coefficients is implicitly made * * * [ , , ] 1 n to have some discrimination power. Therefore, we design our classification model as follows.      i C Q arg min d , CRNP i   2 2    Q α X β β i * * * where Q X d · / . CRNP i i i i * * 2 2 Recall that RNP doesn’t directly use the coefficients themselves which are actually also discriminative.   2    Q α X β i * * Q X d · , RNP i i * * 2 Yang Wu, Michihiko Minoh, Masayuki Mukunoki, " Collaboratively Regularized Nearest Points for Set Based Recognition ", In Proc. of The 24th British Machine Vision Conference ( BMVC ), 2013 . 34

Collaborative Representation for Re-ID  Sparse CR LCSA (Locality-constrained Collaborative Sparse Approximation) g X g g g X X X 1 1 1 1 p p p X X X g X i p X g g g X X X p i i i X p X g g g g X X X X n n n n (a) SANP (b) CSA (c) LCSAwNN (d) LCSAwMPD Yang Wu, Michihiko Minoh, Masayuki Mukunoki, " Locality-constrained Collaborative Sparse Approximation for Multiple-shot Person Re-identification ", In Proc. of The Asian Conference on Pattern Recognition ( ACPR ), 2013 . 35

Collaborative Representation for Re-ID  Sparse CR Experimental Results Performance changes on the "iLIDS-AA" dataset 0.7 Accuracy at rank top 10% 0.65 0.6 LCSAwNN, N=10 0.55 LCSAwNN, N=23 LCSAwNN, N=46 0.5 LCSAwMPD, N=10 LCSAwMPD, N=23 LCSAwMPD, N=46 0.45 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Locality ratio Yang Wu, Michihiko Minoh, Masayuki Mukunoki, " Locality-constrained Collaborative Sparse Approximation for Multiple-shot Person Re-identification ", In Proc. of The Asian Conference on Pattern Recognition ( ACPR ), 2013 . 36

Collaborative Representation for Re-ID  Non-sparse CR LCRNP (Locality-constrained Collaboratively Regularized Nearest Points) g g X X 1 1 p p X X Sparse g g X X i i g g X X n n (a) LCSAwNN (b) LCSAwMPD g g X X 1 1 p p X X Non-sparse g g X X i i g g X X n n (c) LCRNPwNN (d) LCRNPwMPD Yang Wu, et al., " Locality-constrained Collaboratively Regularized Nearest Points for Multiple-shot Person Re-identification ", FCV 2014. 37

Collaborative Representation for Re-ID  Non-sparse CR Experimental results for LCRNP , in comparison with the others CMC on the "iLIDS-MA" dataset with N=10 CMC on the "iLIDS-MA" dataset with N=23 CMC on the "iLIDS-MA" dataset with N=46 0.9 0.9 0.9 0.8 0.8 0.8 Recognition percentage Recognition percentage Recognition percentage 0.7 0.7 0.7 CSA (0.700) CSA (0.732) CSA (0.725) 0.6 0.6 0.6 LCSAwNN (0.750) LCSAwNN (0.768) LCSAwNN (0.800) LCSAwMPD (0.825) LCSAwMPD (0.780) LCSAwMPD (0.787) 0.5 CRNP (0.775) 0.5 CRNP (0.777) 0.5 CRNP (0.790) LCRNPwNN (0.787) LCRNPwNN (0.815) LCRNPwNN (0.850) LCRNPwMPD (0.798) LCRNPwMPD (0.838) LCRNPwMPD (0.875) 0.4 0.4 0.4 1 2 3 4 1 2 3 4 1 2 3 4 Rank Rank Rank CMC on the "iLIDS-AA" dataset with N=10 CMC on the "iLIDS-AA" dataset with N=23 CMC on the "iLIDS-AA" dataset with N=46 0.8 0.8 0.8 0.7 0.7 0.7 Recognition percentage Recognition percentage Recognition percentage 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 CSA (0.613) CSA (0.554) CSA (0.578) LCSAwNN (0.655) LCSAwNN (0.694) LCSAwNN (0.688) 0.3 0.3 0.3 LCSAwMPD (0.604) LCSAwMPD (0.676) LCSAwMPD (0.673) CRNP (0.707) CRNP (0.734) CRNP (0.713) 0.2 0.2 0.2 LCRNPwNN (0.722) LCRNPwNN (0.745) LCRNPwNN (0.759) LCRNPwMPD (0.721) LCRNPwMPD (0.737) LCRNPwMPD (0.714) 0.1 0.1 0.1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Rank Rank Rank CMC on the "CAVIAR4REID" dataset with N=5 CMC on the "CAVIAR4REID" dataset with N=10 CMC on the "CAVIAR4REID" dataset with N=10, unspecified 0.8 0.8 0.8 0.7 0.7 0.7 Recognition percentage Recognition percentage Recognition percentage 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 CSA (0.446) CSA (0.540) CSA (0.652) LCSAwNN (0.588) LCSAwNN (0.720) LCSAwNN (0.760) 0.3 0.3 0.3 LCSAwMPD (0.544) LCSAwMPD (0.660) LCSAwMPD (0.704) CRNP (0.624) CRNP (0.700) CRNP (0.674) 0.2 0.2 0.2 LCRNPwNN (0.642) LCRNPwNN (0.740) LCRNPwNN (0.734) LCRNPwMPD (0.638) LCRNPwMPD (0.700) LCRNPwMPD (0.734) 0.1 0.1 0.1 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Rank Rank Rank Yang Wu, et al., " Locality-constrained Collaboratively Regularized Nearest Points for Multiple-shot Person Re-identification ", FCV 2014. 38

1. Parametric ( Set-to-set distance + metric learning ) 2. Non-parametric ( Collaborative representation ) How about combining them? 39

Related work  Parametric methods Background: Dictionary Learning Coefficients   α Samples Dictionary …     D X … … …   … … … … d … ... ... ... ... ... ... ... ... ... ... ... ... … … ... ... ... ... ... Feature vector … N k    … k N Training Regularizer Regularizer (e.g. Sparsity) (e.g. Discrimination) 40

Parametric (Collaborative representation + Dictionary learning) Discriminative Collaborative Representation (DCR) Coefficients Samples Dictionary   α     X D p p … … … … …   … … … … … d … ... ... ... ... ... ... ... ... ... ... … … … N p ... ... ... ... ... ... ... N g  … …  … … … … d … ... ... ... ... ... … … Strong and costly … ... ... ... ... ... … … regularization terms … were used.   α   X g g Yang Wu, et al., " Discriminative Collaborative Representation for Classification ", ACCV 2014. 41

New proposal: dictionary co-learning Dictionary Collaborative Learning (DCL) Coefficients Samples Dictionary       β X D g g g … … … … … …   … … X α i i … , … g g … ... ... ... ... ... ... ... …  d α i i 1 N . g i g ... ... ... ... … N ,1 ... ... ... ... g … … Learning Camera-specific c Dictionaries Collaboratively c … X α i i  , … … … p p  …  … … α i i d 1 N . p i p … ... ... ... ... N ,1 p … … ... ... ... ... ... ... ... X p … ... ... ... ... …   … … …   D β p p 42

Experiments  Results Experimental results: Effectiveness Rank 1 accuracy Nonparametric Parametric 43

Experiments  Results Experimental results: Efficiency • Running time in milliseconds/person, using matlab with a normal CPU . Nonparametric Parametric 10-100x Speedup 44

Video-based ReID : Perspectives of Set and Sequence Set Sequence … … 45

Sequence : the order matters! 46

Proposal : Temporal Convolution [AAAI 2018] Wu et . al., “Temporal -Enhanced Convolutional Network for Person Re- identification”. 47

NAIST International Collaborative Laboratory for Robotics Vision Identity (Who?) Communication State, Action, ... (What [does he/she want] ? (What [is he/she doing] ? How [does he/she feel] ?) How [does he/she do it] ?) Explicit expression Implicit expression 48

People communicate to understand each other What if machines understand them? 49

Our goal: automatic recognition of spontaneous head gestures 50

Targeted head gestures Nod Ticks Jerk Up Down Tilt Shake Turn Forward Backward 51

Benefits of understanding communication Human-robot interaction Communication assistance [Maatman et al. 2005] [Asakawa 2015] 52

Importance of non-verbal information Verbal information Audio Communication information Non-verbal information Expression Visual information Hand gesture Head gesture Non-verbal information influences significantly ⋮ e.g.) Mehrabian’s rule (Rule of 7%-38%-55%) Verbal Audio Visual We focus on head gesture detection • Appears frequently [Hadar et al. 1983] • Takes important role [Kousidis et al. 2013, McClave 2000] 53

Our contributions and novelties  Contributions  Built a novel dataset  Evaluated representative automatic recognition models  Novelties (with comparison to existing work)  Dataset :  closer to real applications  better for deeper and further researches  Solution :  a general hand-crafted feature  a comparative study of representative recognition algorithms 54

Previous studies on head gesture detection Recognized head gestures P r e v i o u s s t u d i e s [Morency et al. 2007] Nod [Nakamura et al. 2013] Only Nod and Shake are [Chen et al. 2015] widely handled gestures. [Kawato et al. 2000] [Kapoor et al. 2001] Nod, Shake [Tan et al. 2003] Nod is commonly [Morency et al. 2005] concerned. [Wei et al. 2013] Nod, Shake, [Saiga et al. 2010] Turn Nod, Shake, [Fujie et al. 2004] Tilt, Still 55

Previous studies on head gesture detection Recording conditions P r e v i o u s s t u d i e s [Kawato et al. 2000] [Kapoor et al. 2001] No interlocutors [Tan et al. 2003] [Wei et al. 2013] [Fujie et al. 2004] Against a robot [Morency et al. 2005] [Morency et al. 2007] Speaker-listener style [Nakamura et al. 2013] Few people have worked on spontaneous head gestures [Chen et al. 2015] Mutual conversations in human conversations [Saiga et al. 2010] 56

Dataset Construction 57

Recording • 30 sequences of approx. 10 min. from 15 participant • Includes familiar/unfamiliar pairs, indoor/outdoor records • Conversations with topics chosen beforehand • Purpose of the recording is announced Wearable camera Wearable camera Microphone Fixed camera Fixed camera Microphone 58

Annotation A freeware Anvil5 [Kipp 2014] was used for manual annotation. (up to 3 overlapping gestures were allowed) 3 naive annotators annotated all the data independently, after a quick training with guideline and examples. 59

Ground-Truth Inference • IoU: Interaction over Union Interval 1: 𝑇 1 Interval 2: 𝑇 2 time Intersection: 𝐽 1,2 Union: 𝑉 1,2 𝐽𝑝𝑉 𝑇 1 , 𝑇 2 = 𝑚𝑓𝑜𝑕𝑢ℎ(𝐽 1,2 ) 𝑚𝑓𝑜𝑕𝑢ℎ(𝑉 1,2 ) 60

Ground-Truth Inference Gesture Type Strength Annotator A: Nod, 2 Nod, 3 Turn, 1 Shake, 3 Annotator B: Tilt, 1 Nod, 2 Shake, 2 Annotator C: Up, 2 Shake, 3 Down, 3 Suppose IoU_th = 0.5 Inferred: Non- maximum Nod, 2.5 Shake, 3 Suppression A&B: IoU=0.6 > IoU_th A&B: IoU=0.65 > IoU_th A&B: IoU=0.65 > IoU_th A&C: IoU=0.8 > IoU_th A&C: IoU=0.8 > IoU_th B&C: IoU=0.6 > IoU_th B&C: IoU=0.6 > IoU_th 61

Statistics (Inferred Ground-truth with IoU=0.5) Total No. of Samples: 4147 62

Type Distribution per Subject 63

Strength Distribution per Subject 64

Familiar vs. Unfamiliar Ticks Nod 65

Length Distribution Median 66

Recognition tasks To detect varied head gestures from spontaneous conversations Detection ： Given a sequence, to infer when and which gestures appear. Nod Nod Shake To understand the problem better, we also work on the task of Classification ： Given a segmented gesture clip, to infer which type it belongs to. Tilt Shake classifier ? ? ? Nod Turn 67

General framework Head pose Features Classifier or Detector 68

Head pose estimation Head pose (and position) were estimated with ZFace [Jeni et al. 2015] Pitch Roll Yaw X Y Scale Frame number 69

A general hand-crafted feature Histogram of Velocity and Acceleration (HoVA) ⌇ ⌇ ⌇ ⌇ Original 1 st derivative ⌇ ⌇ ⌇ ⌇ 70

Histogram of Velocity and Acceleration (HoVA) Original 1 st derivative ＋： 1.4 ＋： 2.4 ＋： 4.3 －： 2.0 －： 2.6 －： 1.8 71

Histogram of Velocity and Acceleration (HoVA) Original 2 nd derivative ＋： 1.4 ＋： 2.4 ＋： 4.3 －： 2.0 －： 2.6 －： 1.8 ＋： 2.2 ＋： 2.4 ＋： 4.3 －： 2.0 －： 2.6 －： 1.8 72

Existing classification models Learning model P r e v i o u s s t u d i e s [Kawato et al. 2000] (rule-base) [Saiga et al. 2010] [Nakamura et al. 2013] [Morency et al. 2005] SVM [Chen et al. 2015] [Kapoor et al. 2001] [Tan et al. 2003] HMM [Fujie et al. 2004] [Wei et al. 2013] LDCRF [Morency et al. 2007] 73

We evaluate the following models  Non-graphical  SVM  Graphical  Hidden-state Conditional Random Field (HCRF) for classification  Latent-Dynamic Conditional Random Field (LDCRF) for detection  Long-Short Term Memory (LSTM) LDCRF 74

LDCRF (Latent-Dynamic Conditional Random Field) [Morency et al. 2007] Conditional Random Field enhanced for action detection Hidden state Learn weights between each and Hidden state Hidden state Data Optimize the order of hidden states throughout temporal data A A A B B A A Label Hidden state A1 A1 A2 B2 B1 A1 A2 Data −2 −1 3 10 12 −1 −2 75

LSTMs (Long-Short Term Memory) Output 10 Dense + Softmax ⋰ ⋮ ⋱ 32 Dense + ReLU ⋰ ⋮ ⋱ 192 Max pooling n x 64 Bidirectional LSTM n x 64 n x 64 LSTM Input temporal data n x 24 76

Results – Classification (Accuracy, F-score) • Accuracy (Averaged) Method Training Set Training-Val Set Validation Set Test Set 0.68 ± 0.02 0.74 ± 0.04 0.62 ± 0.11 0.60 ± 0.12 SVM 0.65 ± 0.02 0.76 ± 0.01 0.59 ± 0.11 0.57 ± 0.13 SVM_weighted 0.88 ± 0.04 0.83 ± 0.03 0.66 ± 0.14 0.64 ± 0.10 HCRF 0.84 ± 0.06 0.63 ± 0.14 0.61 ± 0.15 LSTMs 0.79 ± 0.02 • F-score (Averaged) Method Training Set Training-Val Set Validation Set Test Set SVM 0.483 0.318 0.387 0.307 SVM_weighted 0.493 0.324 0.408 0.388 HCRF 0.799 0.386 0.433 0.382 LSTMs 0.600 0.394 0.386 0.391 77

Results – Classification (Confusion Matrix) • Test set only, overall accumulation SVM SVM_weighted HCRF LSTMs 78

Results – Classification (Class-specific) 79

Simulated Human Performance -- Classification Frame-wise confusion matrix Frame-wise confusion matrix (without “None” class) (with “None” class) 80

Results – Detection (PR-curve, AP) 81

Results – Detection (AP) 1 Nod Jerk 0.75 Up Down Ticks Tilt 0.5 Shake Turn 0.25 Forward Backward 0 全体 SVM LDCRF • Poorer results when fewer samples are available • LDCRF can better model classes with more diversities, e.g. Ticks. 82

Conclusions and discussions  Spontaneous head gesture recognition is a hard problem • Hard for humans, but even harder for automatic recognition  Gestures types are not equally hard for automatic recognition  Larger model is stronger  Deep learning is more promising, but more data is needed. 83

NAIST International Collaborative Laboratory for Robotics Vision Identity (Who?) Communication State, Action, ... (What [does he/she want] ? (What [is he/she doing] ? How [does he/she feel] ?) How [does he/she do it] ?) Explicit expression Implicit expression 84

Proposal of a Wrist-mounted Depth Camera for Finger Gesture Recognition Kai Akiyama, Yang Wu Nara Institute of Science and Technology Time-of-Flight Retrieved camera depth images AR/VR controller Daily activity recognition 85

Hand pose estimation - Applications Playing games Driving assistant Surgery assistant etc. 86

Background – Depth-based 3D hand pose estimation benchmark Hands In the Million Challenge (HIM2017) Testing data Training data (957K) Pose Single frame estimator (296K) Tracking Hand detector (295K) + Pose estimator Interaction (2K) (S. Yuan, et al. 2017) 87

Proposed 3D hand pose estimator architecture (1) Output_P 24 Output_R 24 Output_M Output_hand 24 63 3D coordinates of Output_I hand joints 24 Thickened cloud points of hand Output_T 27 Block 1 Block 2 1024 dense Block 3 Block 4 88

Proposed 3D hand pose estimator architecture (2) 24 24 24 63 3D coordinates of hand joints 24 Thickened cloud points of hand 27 89 Block 1 Block 2 1024 dense Block 3 Block 4

Pipeline of Pose estimator Single frame pose estimation Extracting hand Estimate 3D hand pose and Represent data by based by given transform back to original 50x50x50 volume bounding box coordinates Pose Estimato r 90

Qualitative results of 3D hand pose estimator 91

Evaluation on the 3D hand pose estimation task of HIM2017 benchmark 92

Utilizing a hand detector for tracking and interaction task Testing data Single frame Pose estimator Tracking Hand detector + Pose estimator Interaction We need a hand detector to find where is the hand in real application 93

Architecture of the 3D hand pose tracking system Hand detector + Hand verifier + Pose estimator X Success Hand Hand Pose Detector Verifier Estimator Fail Taking pose from previous frame Hand verifier: 1. Comparing with the previous frame, whether the center of detected hand area shift more than 150 mm; 2. Whether the number of pixels for detected hand area is more than 1000. 94

Qualitative results of 3D hand pose tracking Sequential Frames … … … Depth image … … … Hand mask Estimated … … … hand pose and depth image 95

Evaluation on the 3D hand tracking task of HIM2017 benchmark 96

Applying modified tracking system on Hand object interaction X Hand Pose Detector Estimator 97

Qualitative results of 3D hand-object interaction pose estimation Depth image Hand mask Estimated hand pose and depth image 98

Evaluation on the hand object interaction task of HIM2017 benchmark 99

Evaluation results on all tasks of HIM2017 benchmark 100

and more Yang Wu Nara Institute of Science and Technology - PowerPoint PPT Presentation

Understanding humans : identity, communication, state, and more Yang Wu Nara Institute of Science and Technology 1 NAIST International Collaborative Laboratory for Robotics Vision For

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Defect Detection Thomas Zimmermann The First Bug September 9, 1947 More Bugs More Bugs More

Why Transformers Work. More info blablabla More info blablabla More info blablabla More

Housing Choices Creating more housing options, for more people, in more places Affordable

last chance for mail service ? DKIM TFMC2 01/2006 Mail service status More and more spam,

others to dream more, learn more, do more, and become more, you are a leader. John Adams

Introducing more people Introducing more people Introducing more people Introducing more people

More Words = More Words = More Fluency: More Fluency: Reading Reading Interventions

Trends and drivers More and more of our systems are critical for our business We get more

Reading for Pleasure The more that you read, the more things you will know. The more you learn,

Making Utah a better place to ride More Planning More Public Support More Building A Better

Developing Rising Stars If your actions inspire others to dream more, learn more, do more

Welcome to the Year 7 Curriculum Evening The more that you read, the more things you will

Mini Bookfairs in Schools/Universities More than 50 publishers More than 50 publishers More than

no- more PowerPoint Examples: Notes To Slides no- more | Nytorv 3. 1 st floor, 1450 Kbenhavn

Cypher for Gremlin And more... And more... And more... MATCH

Sit in groups of exactly 4 Early feedback ck #1 Number yourselves: 1-4 No screens Say your

One of three programming paradigms We can identify three paradigms: functional programming,

RandomPad: Usability of Randomized Mobile Keypads for Defeating Inference Attacks Saturday 29th

CAVE2 Unity Tutorial CAVE2 unity tutorial on github Omicron Cave example unity scene Cave2

Haptic Rendering of Textures Katherine J. Kuchenbecker and Heather Culbertson Mechanical

Course Projects CPSC 599.86 / 601.86 Sonny Chan University of Calgary Course Project Requirements

Design Based Research: research, capable of producing generalizable results? The tension

Social Seniors Social Media & Digital Storytelling for Seniors Introduction Introduction

and more Yang Wu Nara Institute of Science and Technology - PowerPoint PPT Presentation

Understanding humans : identity, communication, state, and more Yang Wu Nara Institute of Science and Technology 1 NAIST International Collaborative Laboratory for Robotics Vision For

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Defect Detection Thomas Zimmermann The First Bug September 9, 1947 More Bugs More Bugs More

Why Transformers Work. *More info blablabla *More info blablabla *More info blablabla *More

Housing Choices Creating more housing options, for more people, in more places Affordable

last chance for mail service ? DKIM TFMC2 01/2006 Mail service status More and more spam,

others to dream more, learn more, do more, and become more, you are a leader. John Adams

Introducing more people Introducing more people Introducing more people Introducing more people

More Words = More Words = More Fluency: More Fluency: Reading Reading Interventions

Trends and drivers More and more of our systems are critical for our business We get more

Reading for Pleasure The more that you read, the more things you will know. The more you learn,

Making Utah a better place to ride More Planning More Public Support More Building A Better

Developing Rising Stars If your actions inspire others to dream more, learn more, do more

Welcome to the Year 7 Curriculum Evening The more that you read, the more things you will

Mini Bookfairs in Schools/Universities More than 50 publishers More than 50 publishers More than

no- more PowerPoint Examples: Notes To Slides no- more | Nytorv 3. 1 st floor, 1450 Kbenhavn

Cypher for Gremlin And more... And more... And more... MATCH

Sit in groups of exactly 4 Early feedback ck #1 Number yourselves: 1-4 No screens Say your

One of three programming paradigms We can identify three paradigms: functional programming,

RandomPad: Usability of Randomized Mobile Keypads for Defeating Inference Attacks Saturday 29th

CAVE2 Unity Tutorial CAVE2 unity tutorial on github Omicron Cave example unity scene Cave2

Haptic Rendering of Textures Katherine J. Kuchenbecker and Heather Culbertson Mechanical

Course Projects CPSC 599.86 / 601.86 Sonny Chan University of Calgary Course Project Requirements

Design Based Research: research, capable of producing generalizable results? The tension

Social Seniors Social Media &amp; Digital Storytelling for Seniors Introduction Introduction

Why Transformers Work. More info blablabla More info blablabla More info blablabla More

Social Seniors Social Media & Digital Storytelling for Seniors Introduction Introduction