matching guided distillation
play

Matching Guided Distillation ECCV 2020 Kaiyu Yue, Jiangfan Deng, - PowerPoint PPT Presentation

Matching Guided Distillation ECCV 2020 Kaiyu Yue, Jiangfan Deng, and Feng Zhou Algorithm Research, Aibee Inc. Motivation Motivation Distillation Obstacle The gap in semantic feature structure between the intermediate features of teacher


  1. Matching Guided Distillation ECCV 2020 Kaiyu Yue, Jiangfan Deng, and Feng Zhou Algorithm Research, Aibee Inc.

  2. Motivation

  3. Motivation Distillation Obstacle • The gap in semantic feature structure between the intermediate features of teacher and student Classic Scheme • Transform intermediate features by adding the adaptation modules, such the conv layer Problems • 1) The adaptation module brings more parameters into training 2) The adaptation module with random initialization or special transformation isn’t friendly for distilling a pre-trained student

  4. Matching Guided Distillation Framework

  5. Matching Guided Distillation – Matching Given two feature sets from teacher and student, we use Hungarian method to achieve the flow guided matrix M . Flow guided matrix M indicates the matched relationships.

  6. Matching Guided Distillation – Channels Reduction One student channel could match multiple teacher channels. We perform reduction into one tensor for guiding the student. Channels Reduction

  7. Matching Guided Distillation – Distillation After reducing teacher channels, we start to distill student using partial distance training loss, such as L2 loss. Channels Reduction Distance Loss

  8. Matching Guided Distillation – Coordinate Descent Optimization The overall training takes a coordinate-descent approach between two optimization objects — flow guided matrix update and parameters update. Updating flow guided matrix M Coordinate Descent Optimization Channels Reduction Training student model using SGD Distance Loss

  9. Matching Guided Distillation Reduction Methods

  10. Matching Guided Distillation – Channels Reduction We propose three efficient methods for reducing teacher channels: Sparse Matching, Random Drop and Absolute Max Pooling.

  11. Matching Guided Distillation – Sparse Matching Each student channel will only match the most related teacher channel. Unmatched teacher channels are ignored. Matching Distance Loss

  12. Matching Guided Distillation – Random Drop Sampling a random teacher channel from the ones associated with each student channel. Matching Distance Loss

  13. Matching Guided Distillation – Absolute Max Pooling To keep both positive and negative feature information of teacher, we propose a novel pooling mechanism that reduce features according to the absolute value at the same feature structure location. Matching Distance Loss

  14. Matching Guided Distillation Main Results

  15. Results – Fine-Grained Recognition on CUB-200 + 3.97 % on top1 + 5.44 % on top1

  16. Results – Large-Scale Classification on ImageNet-1K + 1.83 % on top1 + 2.6 % on top1

  17. Results – Object Detection and Instance Segmentation on COCO

  18. Summary MGD is lightweight and efficient for various tasks • MGD gets rid of channels number constraint between teacher and student, it’s flexible to plug into network • MGD is friendly for distilling a pre-trained student • Project webpage: http://kaiyuyue.com/mgd

Recommend


More recommend