Towards the subjective-ness in facial expression analysis Jiabei Zeng, Ph.D. August 21, 2019 @ VALSE Webinar
It is subjective for human beings to recognize the facial expression. Different individuals have different understandings of the facial expression. 2
The six ix ba basic ic emo motions tions Universal across culture He was about to fight angry 3
The six ix ba basic ic emo motions tions Universal across culture His child had just died. sad 4
Unive iversal sal ≠ 100% 00% Consis nsistent ent Elfenbein H A, Ambady N. On the universality and cultural specificity of emotion recognition: a meta-analysis. Psychological 5 bulletin, 2002, 128(2): 203.
Humans’ annotations are subjective. How do we make the machines objective? Subjective tive-ness ess of human Subj bjectiv ctive-ne ness ss of th the machines hines Trainin ning g datase set t has s annotati ation on bias Trained ined sy syste tem m has s re recog ognition nition bi bias s 6
Humans’ annotations are subjective. How do we make the machines objective? “ 兼听则明,偏信则暗 ” : To Learn the classifier from multiple datasets instead of the only one To describe facial expression in a more objective way: Facial Action Coding System (FACS) 7
Humans’ annotations are subjective. How do we make the machines objective? “ 兼听则明,偏信则暗 ” : To Learn the classifier from multiple datasets instead of the only one To describe facial expression in a more objective way: Facial Action Coding System (FACS) 8
Challeng allenge How to evaluate the machine? Consistent performance boost on diverse test datasets. How to train the machine? More data by merging multiple training datasets ≠ Better performance of the trained system A+R < R A+R < A 9
Le Learn arn from om da datase asets ts wi with annot notation ation bi biases ases Inconsistent Pseudo Annotations to Latent Truth framework multiple inconsistent annotations unlabeled data Data A Data B Step 1: labelA : happy labelB : sad Train machine coders … … labelB : fear labelA : disgust Model A Model B … … 10
Le Learn arn from om da datase asets ts wi with annot notation ation bi biases ases Inconsistent Pseudo Annotations to Latent Truth framework multiple inconsistent annotations labelA : happy unlabeled data predB : happy Data A Model B labelA : disgust Step 1: Step 2: predB : angry Train machine coders Predict pseudo labels labelB : sad predA : sad Data B Model A labelB : fear predA : angry predA : disgust Model B predB : disgust Data U Model A predA : sad predB : disgust 11
Le Learn arn from om da datase asets ts wi with annot notation ation bi biases ases Inconsistent Pseudo Annotations to Latent Truth framework multiple inconsistent annotations Step 3: unlabeled data Train L atent T ruth Net Step 1: Step 2: labelA : disgust LT : Data A Train machine coders Predict pseudo labels predB : angry disgust estimate LT : Latent labelB : fear Data B predA : angry angry Truth(LT) LT : predA : disgust angry predB : disgust Data U … LT : predA : sad sad predB : disgust 12
Le Learn arn from om da datase asets ts wi with annot notation ation bi biases ases Inconsistent Pseudo Annotations to Latent Truth framework multiple inconsistent annotations unlabeled data Step 3: Step 1: Step 2: Train L atent T ruth Net Train machine coders Predict pseudo labels 13
Conven nventional tional archit hitecture ecture v.s. . Latent ent Truth uth Net Conve nventional ntional archit itec ecture ture p is the predicted probability of each facial expression y is the ground truth label 14
Conven nventional tional archit hitecture ecture v.s. . Latent ent Truth uth Net LTN TNet learns from samples with inconsistent annotations predicted annotation for each coder latent truth probability transition matrix for each coder 15
Expe xperi riments ments on on syn ynthe thetic tic data Synthetic data LTNet can reveal the true labels Make 3 copies of the training set of LTNet-learned latent truth CIFAR-10. Randomly add 20%,30%,40% label Ground truth noises, respectively. Evaluate the methods on the clean test set of CIFAR-10. 16
Expe xperi riments ments on on syn ynthe thetic tic data Evaluations on synthetic data LTNet is compatible with the CNN trained on clean data Test accuracy of different methods 17
Expe xperi riments ments on on FE FER da datase asets ts Training data Dataset A: AffectNet (training part) Dataset B: RAF(training part) Unlabeled data: un-annotated part of AffectNet (~700,000 images) unlabeled facial images downloaded from Bing (~500,000 images) Test data In-the-lab CK+, MMI, CFEE, Oulu-CASIA In-the-wild SFEW , AffectNet (validation part), RAF (test part) 18
Expe xperi riments ments on on FE FER da datase asets ts Evaluation on FER datasets Table 1. Test accuracy of different methods( Bold : best, Underline : second best) 19
Expe xperi riments ments on on FE FER da datase asets ts LTNet-learned transition matrix T for 4 coders Human coder (RAF) is the most reliable Labels in RAF are derived from ~40 annotations per image machine coder machine coder human coder (AffectNet) human coder (RAF) (AffectNet trained model) (RAF trained model) 20
Expe xperi riments ments on on FE FER da datase asets ts Statistics of the samples For majority of the samples, the latent truth agrees with either/both the human annotation or/and model prediction. For few samples, the latent truth differs from both the human annotation and model prediction (case2, case3) 21
Expe xperi riments ments on on FE FER da datase asets ts Examples in the 5 cases LTNet-learned latent truth is reasonable H : human annotation G : LTNet- learned latent truth A : predictions by AffectNet- trained model R : predictions by RAF-trained model 22
Humans’ annotations are subjective. How do we make the machines objective? “ 兼听则明,偏信则暗 ” : To Learn the classifier from multiple datasets instead of the only one To describe facial expression in a more objective way: Facial Action Coding System (FACS) 23
From subjective bjective-ness ness to to obj bjective ective-ness ness emotional category Facial Action Coding System 24
Facia cial Ac Action ion Coding ing System em (FACS) CS) Taxonomizes facial muscle movements by their appearance Human-defined facial action units (AUs) AU1: Inner brow raiser AU2: Outer brow raiser AU4: Brow lowerer AU5: Upper lid raiser AU7: Lid tightener * Pictures are from “ Facial Action Coding System, Manual” by P. Ekman, V. Friesen, J. C. Hager 25
Wh What di did we we usually ually do do? manually BP4D annotated data DISFA AlexNet supervised learning state-of-the-art (e.g., VGGNet JAA-Net, ECCV’18) 26
Can Can we we le learn arn from om the the unlabeled labeled vi vide deos? os? Le Lear arn n from om the he cha hang ngings! ings! Facial actions appear as the local changings of the faces between frames! Changings are easy to be detected without manual annotations! 27
Can Can we we le learn arn from om the the unlabeled labeled vi vide deos? os? changing of facial actions changing of changing of face head poses 28
Can Can we we le learn arn from om the the unlabeled labeled vi vide deos? os? Supervisory task Change the facial actions or head poses of the source frame to those of the target frame by predicting the related movements respectively changing of facial actions changing of changing of face head poses 29
Self lf-super supervised vised le learning arning from vi vide deos os source image target image 30
Self lf-super supervised vised le learning arning from vi vide deos os re-generate facial action changes AU feature ≈ source image target image 31
Self lf-super supervised vised le learning arning from vi vide deos os re-generate facial action changes AU feature ≈ source image target image ≈ pose changes re-generate 32
Self lf-super supervised vised le learning arning from vi vide deos os re-generate facial action changes AU feature ≈ ≈ source image target image ≈ pose changes re-generate 33
Twi winCycle cle AutoEncoder ncoder Fea eature ture di disent entangle anglement ment target source AU-related displacements sparse: local small values: subtle 34
Twi winCycle cle AutoEncoder ncoder source Cyc ycle le wit ith AU AU changed ged target source Pixel consistency 35
Twi winCycle cle AutoEncoder ncoder Cyc ycle le with po pose changed ged target source source Pixel consistency 36
Twi winCycle cle AutoEncoder ncoder Targe get rec econs nstruction truction target target source source Pixel consistency 37
Recommend
More recommend