Model Architectures and Training Techniques for High-Precision Landmark Localization Sina Honari Pavlo Molchanov Jason Yosinski Stephen Tyree Pascal Vincent Jan Kautz Christopher Pal
KEYPOINT DETECTION / LANDMARK LOCALIZATION The problem of localizing important points on images Keypoints for human face can be: left/right eye • nose • left/right mouth corner • Applications include: Face alignment/rectification • Emotion recognition • • Head pose estimation • Person identification 2
MOTIVATION Conventional ConvNets : alternating convolutional and max-pooling layers Max-pooled Features : Precision loses precise spatial information, but gets robust features Robustness Networks of only Conv layers : lots of false positives, False positives but keeps spatial information Precision Can we take robust pooled features and keep positional information ? 3
SUMMATION-BASED NETWORKS (SUMNETS) Sums features of different granularity (FCN[1], HyperColumn[2]) Coarse to Fine Branches C =Convolution P =Pooling U = Upsampling Branch =horizontal C, U layers [1] J. Long, E. Shelhamer and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR , 2015 [2] B.Hariharan,P.Arbela ́ ez,R.Girshick,andJ.Malik.Hyper- columns for object segmentation and fine-grained localization. In CVPR , 2015. 4
SUMMATION-BASED NETWORKS (SUMNETS) Pre-softmax activations Softmax probabilities Coarse to Fine Branches Branches Sum of 5
SUMMATION-BASED NETWORKS (SUMNETS) Pre-softmax activations Softmax probabilities Coarse to Fine Branches Branches Sum of 6
Montreal Institue for Learning Algorithms Recombinator Networks Learning Coarse-to-Fine Feature Aggregation (CVPR 2016) Sina Honari Jason Yosinski Pascal Vincent Christopher Pal
Recombinator Networks (RCNs) The model feeds coarse features into finer layers early in their computation Coarse to Fine Branches C =Convolution P =Pooling U = Upsampling K = Concatenation Branch =horizontal C, U layers
Recombinator Networks (RCNs) A Convolutional Encoder-Decoder Network with Skip Connections
SumNets vs. RCNs Summation-Based Networks (SumNets) Recombinator Networks (RCNs)
SumNets vs. RCNs Pre-softmax activations SumNet RCN
SumNets vs. RCNs SumNet Pre-softmax activations Softmax probabilities RCN Pre-softmax activations Softmax probabilities
Prediction Samples TCDCN SumNet RCN Red is model prediction Green is GT key-points
Comparison with SOTA Landmark Estimation Error (as percent; lower is better) 300W Dataset MTFL Dataset Error=Euclidean distance between GT & predicted key-points normalized by inter-ocular distance
Improving Landmark Localization with Semi-Supervised Learning (CVPR 2018) Sina Honari Pavlo Molchanov Jan Kautz Stephen Tyree Pascal Vincent Christopher Pal 15 15
MOTIVATION Manual landmark localization is a tedious task (to build datasets) Image Landmarks Attribute Smiling Labeling: Labeling: ~60s ~1s Looking straight Very Fast time consuming 16 16
MOTIVATION Manual landmark localization is a tedious task (to build datasets) Image Landmarks Attribute Smiling Labeling: Labeling: ~60s ~1s Looking straight Very Fast time consuming Can we use an attribute (e.g. head pose) to guide landmark localization? 17 17
SEMI-SUPERVISED LEARNING 1 Using CNNs with sequential multitasking 1. First predict landmarks 2. Use predicted landmarks to predict attribute 19 19
SEMI-SUPERVISED LEARNING 1 Using CNNs with sequential multitasking 1. First predict landmarks 2. Use predicted landmarks to predict attribute 3. Get gradient from attribute to the landmark localization network Forward pass: landmarks help attribute Backward pass 20 20
SEMI-SUPERVISED LEARNING 1 Soft-argmax To train the entire network end-to-end we use soft-argmax to predict landmarks Soft-argmax estimates location of the scaled centrum of mass: ü Continuous, not discrete ü Differentiable 21 21
SEMI-SUPERVISED LEARNING 2 Equivariant Landmark Transformation Ask the model to make equivariant prediction w.r.t to image transformations • Predict landmarks (L) on an image I • Apply a transformation T to image I Predict landmarks ( !′ ) on an image I ′ • • Apply transformation T to landmarks L Compare !′ with $ ⊙ ! • • Get gradient from ELT loss 22 22
SEMI-SUPERVISED LEARNING 2 Equivariant Landmark Transformation Ask the model to make equivariant prediction w.r.t to image transformations • Predict landmarks (L) on an image I • Apply a transformation T to image I Predict landmarks ( !′ ) on an image I ′ • • Apply transformation T to landmarks L Compare !′ with $ ⊙ ! • • Get gradient from ELT loss 23 23
LEARNING LANDMARKS Making use of all data Loss from GT Landmarks • Loss from Attributes (A) using • Sequential Multi-tasking • Loss from Equivariant Landmark Transformation (ELT) S << M <= N • 24 24
RESULTS Faces Training data Method 100% just CNN 25
RESULTS Faces Training data Method 100% just CNN 5% just CNN 26
RESULTS Faces Training data Method 100% just CNN 5% just CNN semi supervised 5% CNN 27
COMPARISON WITH SOTA Faces 6 100% labeled data 1% 5% 100% 5.43 5 4.35 4.25 Error metric 4.05 3.92 4 3.73 2.88 3 2.72 2.46 2.17 2.03 2 1.59 AFLW dataset: 19 landmarks 1 • Head pose: yaw, pitch, roll • 0 F S r ~25k images M T M R L . L L L • S l u B C S S S R P D D a F O S S S L C C E C S C + + + t R e r r r u u u v O O O L 28
COMPARISON WITH SOTA Faces 6 100% labeled data 1% 5% 100% 5.43 5 4.35 4.25 Error metric 4.05 3.92 4 3.73 2.88 3 2.72 2.46 2.17 2.03 2 1.59 AFLW dataset: 19 landmarks 1 • Head pose: yaw, pitch, roll • 0 F S r ~25k images M T M R L . L L L • S l u B C S S S R P D D a F O S S S L C C E C S C + + + t R e r r r u u u v O O O L 29
COMPARISON WITH SOTA Faces 6 100% labeled data 1% 5% 100% 5.43 5 4.35 4.25 Error metric 4.05 3.92 4 3.73 2.88 3 2.72 2.46 2.17 2.03 2 1.59 AFLW dataset: 19 landmarks 1 • Head pose: yaw, pitch, roll • 0 F S r ~25k images M T M R L . L L L • S l u B C S S S R P D D a F O S S S L C C E C S C + + + t R e r r r u u u v O O O L 30
COMPARISON WITH SOTA Faces 6 100% labeled data 1% 5% 100% 5.43 5 4.35 4.25 Error metric 4.05 3.92 4 3.73 2.88 3 2.72 2.46 2.17 2.03 2 1.59 AFLW dataset: 19 landmarks 1 • Head pose: yaw, pitch, roll • 0 F S r ~25k images M T M R L . L L L • S l u B C S S S R P D D a F O S S S L C C E C S C + + + t R e r r r u u u v O O O L 31
COMPARISON WITH SOTA Faces RCN+ (L) 1% (L+ELT+A) RCN+ 1% (L+ELT+A) RCN+ 100% 32
CONCLUSIONS • Fuse predictions from multiple granularity • Let network to learn fusing method • Additional attributes and prior improve results with semi-supervised learning • Works with landmarks on hands as well • Read more in our CVPR2016 paper: https://arxiv.org/abs/1511.07356 , and CVPR2018: https://arxiv.org/abs/1709.01591 33
THANK YOU! Pavlo Molchanov pmolchanov@nvidia.com Sina Honari honaris@iro.umontreal.ca
Comparison with SOTA
Loss & Error Loss: Softmax loss on key-points + L2-norm Error : Euclidean distance between GT & predicted key-points normalized by inter-ocular distance
Masking Branches SumNet Coarse to Fine SumNet RCN Mask AFLW AFW AFLW AFW coarse → fine 1, 0, 0, 0 10.54 10.63 10.61 10.89 0, 1, 0, 0 11.28 11.43 11.56 11.87 1, 1, 0, 0 9.47 9.65 9.31 9.44 RCN 0, 0, 1, 0 16.14 16.35 15.78 15.91 0, 0, 0, 1 45.39 47.97 46.87 48.61 Coarse to Fine 0, 0, 1, 1 13.90 14.14 12.67 13.53 0, 1, 1, 1 7.91 8.22 7.62 7.95 1, 0, 0, 1 6.91 7.51 6.79 7.27 1, 1, 1, 1 6.44 6.78 6.37 6.43 Mask : 0 branch is omitted, 1 branch in included.
Adding More Branches SumNet Coarse to Fine RCN Coarse to Fine
Recommend
More recommend