Disentanglement of Visual Concepts from Classifying and Synthesizing Scenes Bolei Zhou The Chinese University of Hong Kong
Representation Learning The purpose of representation learning: “To identify and disentangle the underlying explanatory factors hidden in the observed milieu of low-level sensory data.” Bengio, et al. Representation Learning: A review and new perspectives.
Sources of Deep Representations Image Classification Self Supervised Learning Image Generation Audio prediction, ECCV’16 Object Recognition Colorization Scene Recognition ECCV’16 and CVPR’17
Outline • Disentanglement of Concepts from Classifying Scenes • Sanity Check Experiment: Mixture of MNIST • Disentanglement of Visual Concepts from Synthesizing Scenes • Future Directions
My Previous Talks • On the importance of single units CVPR’18 Tutorial talk: https://www.youtube.com/watch?v=1aSS5GEH58U • Interpretable representation learning for visual intelligence MIT thesis defense: https://www.youtube.com/watch?v=J7Zz_33ZeJc
Neural Networks for Scene Classification http://places2.csail.mit.edu/demo.html https://github.com/CSAILVision/places365
What are the internal units for classifying scenes? Convolutional Neural Network (CNN) Cafeteria (0.9) Units as concept detectors Unit 22 at Layer 5: Face Unit2 at Layer4: Lamp Unit42 at Layer3 : Trademark Unit 57 at Layer4: Windows
What is a unit doing? - Visualize the unit Back-propagation Image Synthesis Deconvolution [Simonyan et al., ICLR’15] [Springerberg et al., ICLR’15] [Selvaraju, ICCV’17] [Nguyen et al., NIPS’16] [Dosovitskiy et al., CVPR’16] [Zeiler et al., ECCV’14] [Mahendran, et al., CVPR’15] [Girshick et al., CVPR’14]
Data Driven Visualization Unit1: Top activated images Unit2: Top activated images Unit3: Top activated images https://github.com/metalbubble/cnnvisualizer Layer 5
Annotating the Interpretation of Units Amazon Mechanical Turk Word/Description to summarize the images: Which category the description Lamp ______ belongs to: - Scene - Region or surface - Object - Object part - Texture or material - Simple elements or colors [Zhou, Khosla, Lapedriza, Oliva, Torralba. ICLR 2015]
Interpretable Representations for Objects and Scenes 59 units as objects at conv5 of 151 units as objects at conv5 of AlexNet on ImageNet AlexNet on Places dog building dog windows bird baseball field face tie
Quantify the Interpretability of Networks [Zhou*, Bau*, et al. TPAMI’18, CVPR 2017] Network Dissection units water 0 6 conv5 unit 41 (texture) conv5 unit 107 (object) tree grass plant windowpane car Interpretable Units honeycombed airplane sea mountain skyscraper road ceiling building dog person road painting IoU 0.13 IoU 0.16 stove bed chair horse conv5 unit 144 (object) conv5 unit 79 (object) floor house sky track waterfall bus mountain sink cabinet car pool table shelf sidewalk mountain snowy book ball pit 32 objects IoU 0.13 IoU 0.14 skyscraper street building facade pantry conv5 unit 88 (object) conv5 unit 252 (texture) 6 scenes hair wheel shop window head screen crosswalk waffled 6 parts grass food wood 2 materials lined dotted studded banded honeycombed zigzagged IoU 0.13 IoU 0.14 grid paisley potholed meshed conv5 unit 229 (texture) conv5 unit 191 (texture) swirly spiralled freckled sprinkled fibrous waffled pleated paisley grooved grid cracked chequered cobwebbed matted stratified perforated IoU 0.12 IoU 0.13 woven 25 textures red 1 color
Evaluate Unit for Semantic Segmentation Testing Dataset: 60,000 images annotated with 1,200 concepts Unit 1: Top activated images from the Testing Dataset Top Concept: Lamp, Intersection over Union (IoU)= 0.23
Layer5 unit 79 car (object) IoU=0.13 Layer5 unit 107 road (object) IoU=0.15 118/256 units covering 72 unique concepts
AlexNet ResNet GoogLeNet VGG House Airplane
More results in the TPAMI extension paper Comparison of different network architectures Comparison of supervisions (supervised v.s. self-supervised) Interpreting Deep Visual Representations via Network Dissection https://arxiv.org/pdf/1711.05611.pdf
Sanity Check Experiment for Disentanglement • How to quantitatively evaluate the solution reached by CNN? • What are the hidden factors in object recognition and scene recognition? Object Recognition Scene Recognition
Sanity Check Experiment for Disentanglement A controlled classification experiment: Mixture of MNIST 10 digits from MNIST Pairwise combination of digits Class 1 (3,6) Class 2 (0,2) Class 3 (4,5) … Class N With Wentao Zhu (PKU)
Solving Mixture of MNIST To classify the given image into one of 45 classes • Training data � 20,000 images • Accuracy on validation set: 91.7% A simple convnet for classification Class number Layer1: 10 units Layer2: 20 units Layer3: 10 units Global average pooling Softmax: 45 classes
Digit Detectors Emerge from Solving Mixture of MNIST Unit03 for detecting digit 0 Precision: @100=1.00 @300=1.00 @500=1.00 @700=0.99 @(recall=0.25)=0.99 @(recall=0.50)=0.98 @(recall=0.75)=0.90 Top activated images: Activation:
Digit Detectors Emerge from Solving Mixture of MNIST Two metrics for unit importance: alignment score and ablation effect
Dropout Affects the Unit as Digit Detector Baseline Baseline + Dropout on the conv3
Dropout Affects the Unit as Digit Detector Baseline Baseline + Dropout on the conv3
Layer Width Affects the Unit as Digit Detector • Wider network performs better at disentanglement • Less reliance on single units Baseline Baseline with tripling the number of units at conv3
Layer Width Affects the Unit as Digit Detector • Wider layer performs better at disentanglement • Less reliance on single units Baseline Baseline with tripling the number of units at conv3
Wider layer + Dropout Baseline Baseline with wider layer Baseline with wider layer + dropout
Usefulness Experiment • Take 8 and 9 as redundant digits (randomly shown in all classes) • Effective digits: 0-7 • Number of classes: 28
Deep Neural Networks for Synthesizing Scenes Generative Adversarial Networks Goodfellow, et al. NIPS’14 Radford, et al. ICLR’15 T Karras et al. 2017 A. Brock, et al. 2018
T Karras et al. 2017
How to Add or Modify Contents? Input: Output: Random noise Synthesized image Add trees Add domes
Understanding the Internal Units in GANs Output: Synthesized image Input: Random noise What are they doing? David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, J. Tenenbaum, W. Freeman, A. Torralba. GAN Dissection: Visualizing and Understanding GANs. ICLR’19. https://arxiv.org/pdf/1811.10597.pdf
Framework of GAN Dissection David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, J. Tenenbaum, W. Freeman, A. Torralba. GAN Dissection: Visualizing and Understanding GANs. https://arxiv.org/pdf/1811.10597.pdf
Units Emerge as Drawing Objects Unit 365 draws trees. Unit 43 draws domes. Unit 14 draws grass. Unit 276 draws towers.
Manipulating the Images Synthesized Images Synthesized Images with Unit 4 removed Unit 4 for drawing Lamp
Interactive Image Manipulation All the code and paper are available at http://gandissect.csail.mit.edu
Latest Work on Using GAN to Manipulate Real Image • Challenge: Invert hidden code for any given image Output: Synthesized image Input: Hidden code z
Future Directions Defend and Attack by Generalization Adversarial Samples & Overfitting Interpretable Deep Learning Network GAN & Deep RL Compression Plasticity & Transfer Learning
Why Care About Interpretability? ‘Alchemy’ of Deep Learning ‘Chemistry’ of Deep Learning Scientific Understanding
Recommend
More recommend