model architectures and training techniques for high
play

Model Architectures and Training Techniques for High-Precision - PowerPoint PPT Presentation

Model Architectures and Training Techniques for High-Precision Landmark Localization Sina Honari Pavlo Molchanov Jason Yosinski Stephen Tyree Pascal Vincent Jan Kautz Christopher Pal KEYPOINT DETECTION / LANDMARK LOCALIZATION The problem


  1. Model Architectures and Training Techniques for High-Precision Landmark Localization Sina Honari Pavlo Molchanov Jason Yosinski Stephen Tyree Pascal Vincent Jan Kautz Christopher Pal

  2. KEYPOINT DETECTION / LANDMARK LOCALIZATION The problem of localizing important points on images Keypoints for human face can be: left/right eye • nose • left/right mouth corner • Applications include: Face alignment/rectification • Emotion recognition • • Head pose estimation • Person identification 2

  3. MOTIVATION Conventional ConvNets : alternating convolutional and max-pooling layers Max-pooled Features : Precision loses precise spatial information, but gets robust features Robustness Networks of only Conv layers : lots of false positives, False positives but keeps spatial information Precision Can we take robust pooled features and keep positional information ? 3

  4. SUMMATION-BASED NETWORKS (SUMNETS) Sums features of different granularity (FCN[1], HyperColumn[2]) Coarse to Fine Branches C =Convolution P =Pooling U = Upsampling Branch =horizontal C, U layers [1] J. Long, E. Shelhamer and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR , 2015 [2] B.Hariharan,P.Arbela ́ ez,R.Girshick,andJ.Malik.Hyper- columns for object segmentation and fine-grained localization. In CVPR , 2015. 4

  5. SUMMATION-BASED NETWORKS (SUMNETS) Pre-softmax activations Softmax probabilities Coarse to Fine Branches Branches Sum of 5

  6. SUMMATION-BASED NETWORKS (SUMNETS) Pre-softmax activations Softmax probabilities Coarse to Fine Branches Branches Sum of 6

  7. Montreal Institue for Learning Algorithms Recombinator Networks Learning Coarse-to-Fine Feature Aggregation (CVPR 2016) Sina Honari Jason Yosinski Pascal Vincent Christopher Pal

  8. Recombinator Networks (RCNs) The model feeds coarse features into finer layers early in their computation Coarse to Fine Branches C =Convolution P =Pooling U = Upsampling K = Concatenation Branch =horizontal C, U layers

  9. Recombinator Networks (RCNs) A Convolutional Encoder-Decoder Network with Skip Connections

  10. SumNets vs. RCNs Summation-Based Networks (SumNets) Recombinator Networks (RCNs)

  11. SumNets vs. RCNs Pre-softmax activations SumNet RCN

  12. SumNets vs. RCNs SumNet Pre-softmax activations Softmax probabilities RCN Pre-softmax activations Softmax probabilities

  13. Prediction Samples TCDCN SumNet RCN Red is model prediction Green is GT key-points

  14. Comparison with SOTA Landmark Estimation Error (as percent; lower is better) 300W Dataset MTFL Dataset Error=Euclidean distance between GT & predicted key-points normalized by inter-ocular distance

  15. Improving Landmark Localization with Semi-Supervised Learning (CVPR 2018) Sina Honari Pavlo Molchanov Jan Kautz Stephen Tyree Pascal Vincent Christopher Pal 15 15

  16. MOTIVATION Manual landmark localization is a tedious task (to build datasets) Image Landmarks Attribute Smiling Labeling: Labeling: ~60s ~1s Looking straight Very Fast time consuming 16 16

  17. MOTIVATION Manual landmark localization is a tedious task (to build datasets) Image Landmarks Attribute Smiling Labeling: Labeling: ~60s ~1s Looking straight Very Fast time consuming Can we use an attribute (e.g. head pose) to guide landmark localization? 17 17

  18. SEMI-SUPERVISED LEARNING 1 Using CNNs with sequential multitasking 1. First predict landmarks 2. Use predicted landmarks to predict attribute 19 19

  19. SEMI-SUPERVISED LEARNING 1 Using CNNs with sequential multitasking 1. First predict landmarks 2. Use predicted landmarks to predict attribute 3. Get gradient from attribute to the landmark localization network Forward pass: landmarks help attribute Backward pass 20 20

  20. SEMI-SUPERVISED LEARNING 1 Soft-argmax To train the entire network end-to-end we use soft-argmax to predict landmarks Soft-argmax estimates location of the scaled centrum of mass: ü Continuous, not discrete ü Differentiable 21 21

  21. SEMI-SUPERVISED LEARNING 2 Equivariant Landmark Transformation Ask the model to make equivariant prediction w.r.t to image transformations • Predict landmarks (L) on an image I • Apply a transformation T to image I Predict landmarks ( !′ ) on an image I ′ • • Apply transformation T to landmarks L Compare !′ with $ ⊙ ! • • Get gradient from ELT loss 22 22

  22. SEMI-SUPERVISED LEARNING 2 Equivariant Landmark Transformation Ask the model to make equivariant prediction w.r.t to image transformations • Predict landmarks (L) on an image I • Apply a transformation T to image I Predict landmarks ( !′ ) on an image I ′ • • Apply transformation T to landmarks L Compare !′ with $ ⊙ ! • • Get gradient from ELT loss 23 23

  23. LEARNING LANDMARKS Making use of all data Loss from GT Landmarks • Loss from Attributes (A) using • Sequential Multi-tasking • Loss from Equivariant Landmark Transformation (ELT) S << M <= N • 24 24

  24. RESULTS Faces Training data Method 100% just CNN 25

  25. RESULTS Faces Training data Method 100% just CNN 5% just CNN 26

  26. RESULTS Faces Training data Method 100% just CNN 5% just CNN semi supervised 5% CNN 27

  27. COMPARISON WITH SOTA Faces 6 100% labeled data 1% 5% 100% 5.43 5 4.35 4.25 Error metric 4.05 3.92 4 3.73 2.88 3 2.72 2.46 2.17 2.03 2 1.59 AFLW dataset: 19 landmarks 1 • Head pose: yaw, pitch, roll • 0 F S r ~25k images M T M R L . L L L • S l u B C S S S R P D D a F O S S S L C C E C S C + + + t R e r r r u u u v O O O L 28

  28. COMPARISON WITH SOTA Faces 6 100% labeled data 1% 5% 100% 5.43 5 4.35 4.25 Error metric 4.05 3.92 4 3.73 2.88 3 2.72 2.46 2.17 2.03 2 1.59 AFLW dataset: 19 landmarks 1 • Head pose: yaw, pitch, roll • 0 F S r ~25k images M T M R L . L L L • S l u B C S S S R P D D a F O S S S L C C E C S C + + + t R e r r r u u u v O O O L 29

  29. COMPARISON WITH SOTA Faces 6 100% labeled data 1% 5% 100% 5.43 5 4.35 4.25 Error metric 4.05 3.92 4 3.73 2.88 3 2.72 2.46 2.17 2.03 2 1.59 AFLW dataset: 19 landmarks 1 • Head pose: yaw, pitch, roll • 0 F S r ~25k images M T M R L . L L L • S l u B C S S S R P D D a F O S S S L C C E C S C + + + t R e r r r u u u v O O O L 30

  30. COMPARISON WITH SOTA Faces 6 100% labeled data 1% 5% 100% 5.43 5 4.35 4.25 Error metric 4.05 3.92 4 3.73 2.88 3 2.72 2.46 2.17 2.03 2 1.59 AFLW dataset: 19 landmarks 1 • Head pose: yaw, pitch, roll • 0 F S r ~25k images M T M R L . L L L • S l u B C S S S R P D D a F O S S S L C C E C S C + + + t R e r r r u u u v O O O L 31

  31. COMPARISON WITH SOTA Faces RCN+ (L) 1% (L+ELT+A) RCN+ 1% (L+ELT+A) RCN+ 100% 32

  32. CONCLUSIONS • Fuse predictions from multiple granularity • Let network to learn fusing method • Additional attributes and prior improve results with semi-supervised learning • Works with landmarks on hands as well • Read more in our CVPR2016 paper: https://arxiv.org/abs/1511.07356 , and CVPR2018: https://arxiv.org/abs/1709.01591 33

  33. THANK YOU! Pavlo Molchanov pmolchanov@nvidia.com Sina Honari honaris@iro.umontreal.ca

  34. Comparison with SOTA

  35. Loss & Error Loss: Softmax loss on key-points + L2-norm Error : Euclidean distance between GT & predicted key-points normalized by inter-ocular distance

  36. Masking Branches SumNet Coarse to Fine SumNet RCN Mask AFLW AFW AFLW AFW coarse → fine 1, 0, 0, 0 10.54 10.63 10.61 10.89 0, 1, 0, 0 11.28 11.43 11.56 11.87 1, 1, 0, 0 9.47 9.65 9.31 9.44 RCN 0, 0, 1, 0 16.14 16.35 15.78 15.91 0, 0, 0, 1 45.39 47.97 46.87 48.61 Coarse to Fine 0, 0, 1, 1 13.90 14.14 12.67 13.53 0, 1, 1, 1 7.91 8.22 7.62 7.95 1, 0, 0, 1 6.91 7.51 6.79 7.27 1, 1, 1, 1 6.44 6.78 6.37 6.43 Mask : 0 branch is omitted, 1 branch in included.

  37. Adding More Branches SumNet Coarse to Fine RCN Coarse to Fine

Recommend


More recommend