Matching Guided Distillation ECCV 2020 Kaiyu Yue, Jiangfan Deng, and Feng Zhou Algorithm Research, Aibee Inc.
Motivation
Motivation Distillation Obstacle • The gap in semantic feature structure between the intermediate features of teacher and student Classic Scheme • Transform intermediate features by adding the adaptation modules, such the conv layer Problems • 1) The adaptation module brings more parameters into training 2) The adaptation module with random initialization or special transformation isn’t friendly for distilling a pre-trained student
Matching Guided Distillation Framework
Matching Guided Distillation – Matching Given two feature sets from teacher and student, we use Hungarian method to achieve the flow guided matrix M . Flow guided matrix M indicates the matched relationships.
Matching Guided Distillation – Channels Reduction One student channel could match multiple teacher channels. We perform reduction into one tensor for guiding the student. Channels Reduction
Matching Guided Distillation – Distillation After reducing teacher channels, we start to distill student using partial distance training loss, such as L2 loss. Channels Reduction Distance Loss
Matching Guided Distillation – Coordinate Descent Optimization The overall training takes a coordinate-descent approach between two optimization objects — flow guided matrix update and parameters update. Updating flow guided matrix M Coordinate Descent Optimization Channels Reduction Training student model using SGD Distance Loss
Matching Guided Distillation Reduction Methods
Matching Guided Distillation – Channels Reduction We propose three efficient methods for reducing teacher channels: Sparse Matching, Random Drop and Absolute Max Pooling.
Matching Guided Distillation – Sparse Matching Each student channel will only match the most related teacher channel. Unmatched teacher channels are ignored. Matching Distance Loss
Matching Guided Distillation – Random Drop Sampling a random teacher channel from the ones associated with each student channel. Matching Distance Loss
Matching Guided Distillation – Absolute Max Pooling To keep both positive and negative feature information of teacher, we propose a novel pooling mechanism that reduce features according to the absolute value at the same feature structure location. Matching Distance Loss
Matching Guided Distillation Main Results
Results – Fine-Grained Recognition on CUB-200 + 3.97 % on top1 + 5.44 % on top1
Results – Large-Scale Classification on ImageNet-1K + 1.83 % on top1 + 2.6 % on top1
Results – Object Detection and Instance Segmentation on COCO
Summary MGD is lightweight and efficient for various tasks • MGD gets rid of channels number constraint between teacher and student, it’s flexible to plug into network • MGD is friendly for distilling a pre-trained student • Project webpage: http://kaiyuyue.com/mgd
Recommend
More recommend