towards the subjective ness in
play

Towards the subjective-ness in facial expression analysis Jiabei - PowerPoint PPT Presentation

Towards the subjective-ness in facial expression analysis Jiabei Zeng, Ph.D. August 21, 2019 @ VALSE Webinar It is subjective for human beings to recognize the facial expression. Different individuals have different understandings of the


  1. Towards the subjective-ness in facial expression analysis Jiabei Zeng, Ph.D. August 21, 2019 @ VALSE Webinar

  2. It is subjective for human beings to recognize the facial expression. Different individuals have different understandings of the facial expression. 2

  3. The six ix ba basic ic emo motions tions  Universal across culture He was about to fight angry 3

  4. The six ix ba basic ic emo motions tions  Universal across culture His child had just died. sad 4

  5. Unive iversal sal ≠ 100% 00% Consis nsistent ent Elfenbein H A, Ambady N. On the universality and cultural specificity of emotion recognition: a meta-analysis. Psychological 5 bulletin, 2002, 128(2): 203.

  6. Humans’ annotations are subjective. How do we make the machines objective? Subjective tive-ness ess of human Subj bjectiv ctive-ne ness ss of th the machines hines Trainin ning g datase set t has s annotati ation on bias Trained ined sy syste tem m has s re recog ognition nition bi bias s 6

  7. Humans’ annotations are subjective. How do we make the machines objective?  “ 兼听则明,偏信则暗 ” : To Learn the classifier from multiple datasets instead of the only one  To describe facial expression in a more objective way: Facial Action Coding System (FACS) 7

  8. Humans’ annotations are subjective. How do we make the machines objective?  “ 兼听则明,偏信则暗 ” : To Learn the classifier from multiple datasets instead of the only one  To describe facial expression in a more objective way: Facial Action Coding System (FACS) 8

  9. Challeng allenge  How to evaluate the machine?  Consistent performance boost on diverse test datasets.  How to train the machine?  More data by merging multiple training datasets ≠ Better performance of the trained system A+R < R A+R < A 9

  10. Le Learn arn from om da datase asets ts wi with annot notation ation bi biases ases  Inconsistent Pseudo Annotations to Latent Truth framework  multiple inconsistent annotations  unlabeled data Data A Data B Step 1: labelA : happy labelB : sad Train machine coders … … labelB : fear labelA : disgust Model A Model B … … 10

  11. Le Learn arn from om da datase asets ts wi with annot notation ation bi biases ases  Inconsistent Pseudo Annotations to Latent Truth framework  multiple inconsistent annotations labelA : happy  unlabeled data predB : happy Data A Model B labelA : disgust Step 1: Step 2: predB : angry Train machine coders Predict pseudo labels labelB : sad predA : sad Data B Model A labelB : fear predA : angry predA : disgust Model B predB : disgust Data U Model A predA : sad predB : disgust 11

  12. Le Learn arn from om da datase asets ts wi with annot notation ation bi biases ases  Inconsistent Pseudo Annotations to Latent Truth framework  multiple inconsistent annotations Step 3:  unlabeled data Train L atent T ruth Net Step 1: Step 2: labelA : disgust LT : Data A Train machine coders Predict pseudo labels predB : angry disgust estimate LT : Latent labelB : fear Data B predA : angry angry Truth(LT) LT : predA : disgust angry predB : disgust Data U … LT : predA : sad sad predB : disgust 12

  13. Le Learn arn from om da datase asets ts wi with annot notation ation bi biases ases  Inconsistent Pseudo Annotations to Latent Truth framework  multiple inconsistent annotations  unlabeled data Step 3: Step 1: Step 2: Train L atent T ruth Net Train machine coders Predict pseudo labels 13

  14. Conven nventional tional archit hitecture ecture v.s. . Latent ent Truth uth Net  Conve nventional ntional archit itec ecture ture  p is the predicted probability of each facial expression  y is the ground truth label 14

  15. Conven nventional tional archit hitecture ecture v.s. . Latent ent Truth uth Net  LTN TNet learns from samples with inconsistent annotations predicted annotation for each coder latent truth probability transition matrix for each coder 15

  16. Expe xperi riments ments on on syn ynthe thetic tic data  Synthetic data  LTNet can reveal the true labels  Make 3 copies of the training set of LTNet-learned latent truth CIFAR-10.  Randomly add 20%,30%,40% label Ground truth noises, respectively.  Evaluate the methods on the clean test set of CIFAR-10. 16

  17. Expe xperi riments ments on on syn ynthe thetic tic data  Evaluations on synthetic data  LTNet is compatible with the CNN trained on clean data Test accuracy of different methods 17

  18. Expe xperi riments ments on on FE FER da datase asets ts  Training data  Dataset A: AffectNet (training part)  Dataset B: RAF(training part)  Unlabeled data:  un-annotated part of AffectNet (~700,000 images)  unlabeled facial images downloaded from Bing (~500,000 images)  Test data  In-the-lab  CK+, MMI, CFEE, Oulu-CASIA  In-the-wild  SFEW , AffectNet (validation part), RAF (test part) 18

  19. Expe xperi riments ments on on FE FER da datase asets ts  Evaluation on FER datasets Table 1. Test accuracy of different methods( Bold : best, Underline : second best) 19

  20. Expe xperi riments ments on on FE FER da datase asets ts  LTNet-learned transition matrix T for 4 coders  Human coder (RAF) is the most reliable  Labels in RAF are derived from ~40 annotations per image machine coder machine coder human coder (AffectNet) human coder (RAF) (AffectNet trained model) (RAF trained model) 20

  21. Expe xperi riments ments on on FE FER da datase asets ts  Statistics of the samples  For majority of the samples, the latent truth agrees with either/both the human annotation or/and model prediction.  For few samples, the latent truth differs from both the human annotation and model prediction (case2, case3) 21

  22. Expe xperi riments ments on on FE FER da datase asets ts  Examples in the 5 cases  LTNet-learned latent truth is reasonable H : human annotation G : LTNet- learned latent truth A : predictions by AffectNet- trained model R : predictions by RAF-trained model 22

  23. Humans’ annotations are subjective. How do we make the machines objective?  “ 兼听则明,偏信则暗 ” : To Learn the classifier from multiple datasets instead of the only one  To describe facial expression in a more objective way: Facial Action Coding System (FACS) 23

  24. From subjective bjective-ness ness to to obj bjective ective-ness ness emotional category Facial Action Coding System 24

  25. Facia cial Ac Action ion Coding ing System em (FACS) CS)  Taxonomizes facial muscle movements by their appearance  Human-defined facial action units (AUs) AU1: Inner brow raiser AU2: Outer brow raiser AU4: Brow lowerer AU5: Upper lid raiser AU7: Lid tightener * Pictures are from “ Facial Action Coding System, Manual” by P. Ekman, V. Friesen, J. C. Hager 25

  26. Wh What di did we we usually ually do do? manually BP4D annotated data DISFA AlexNet supervised learning state-of-the-art (e.g., VGGNet JAA-Net, ECCV’18) 26

  27. Can Can we we le learn arn from om the the unlabeled labeled vi vide deos? os? Le Lear arn n from om the he cha hang ngings! ings! Facial actions appear as the local changings of the faces between frames! Changings are easy to be detected without manual annotations! 27

  28. Can Can we we le learn arn from om the the unlabeled labeled vi vide deos? os? changing of facial actions changing of changing of face head poses 28

  29. Can Can we we le learn arn from om the the unlabeled labeled vi vide deos? os?  Supervisory task  Change the facial actions or head poses of the source frame to those of the target frame by predicting the related movements respectively changing of facial actions changing of changing of face head poses 29

  30. Self lf-super supervised vised le learning arning from vi vide deos os source image target image 30

  31. Self lf-super supervised vised le learning arning from vi vide deos os re-generate facial action changes AU feature ≈ source image target image 31

  32. Self lf-super supervised vised le learning arning from vi vide deos os re-generate facial action changes AU feature ≈ source image target image ≈ pose changes re-generate 32

  33. Self lf-super supervised vised le learning arning from vi vide deos os re-generate facial action changes AU feature ≈ ≈ source image target image ≈ pose changes re-generate 33

  34. Twi winCycle cle AutoEncoder ncoder  Fea eature ture di disent entangle anglement ment target source AU-related displacements  sparse: local  small values: subtle 34

  35. Twi winCycle cle AutoEncoder ncoder source  Cyc ycle le wit ith AU AU changed ged target source  Pixel consistency 35

  36. Twi winCycle cle AutoEncoder ncoder  Cyc ycle le with po pose changed ged target source source  Pixel consistency 36

  37. Twi winCycle cle AutoEncoder ncoder  Targe get rec econs nstruction truction target target source source  Pixel consistency 37

Recommend


More recommend