Deep Learning: Review & Discussion Chiyuan Zhang, CSAIL, CBMM 2015.07.22
Overview • What has been done? • Applications • Main Challenges • Empirical Analysis • Theoretical Analysis • What is to be done?
Deep Learning What has been done?
Applications • Computer Vision • ConvNets, dominating • Speech Recognition • Deep Nets, Recurrent Neural Networks (RNNs), dominating, industrial deployment • Natural Language Processing • Matched previous state-of-the-art, but no revolutionized results yet • Reinforcement Learning, Structured Prediction, Graphical Models, Unsupervised Learning, … • “Unrolling” iteration as NN layer
Image Classification • Imagenet Large Scale Visual Recognition Challenge (ILSVRC) http://image-net.org/challenges/LSVRC/ • Tasks • Classification: 1000-way multiclass learning • Detection: classify and locate (bounding box) • State-of-the-art • ConvNets since 2012 Olga Russakovsky, . . ., Andrej Karpathy and Li Fei-Fei et.al. • ImageNet Large Scale Visual Recognition Challenge. arXiv: 1409.0575 [cs.CV].
Surpassing “Human Level” Performance • Try it yourself: http://cs.stanford.edu/people/karpathy/ilsvrc/ • For human • Difficult & painful task (1000 classes) • One guy trained himself with 500 images and tested on 1500 (!!) images • ~ 1 minute to classify 1 images: ~ 25 hours… • ~ 5% error, the so-called “human level” performance • Human and machines are making different kinds of errors, for details see http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/
• Models pre-trained on Imagenet turns out to be very good feature extractor or initialization model for many • Typically takes ~1 week to train on a descent GPU node • Imagenet challenge training set ~1.2M images (p > N) • e.g. google “Inception”, 27 layers, ~7M parameters; VGG ~100M parameters (table 2, arXiv:1409.1556). ConvNets on ImageNet softmax2 other vision related tasks even on different datasets; popular in both academia and industry (startups) SoftmaxActivation FC AveragePool 7x7+1(V) DepthConcat Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool 1x1+1(S) 1x1+1(S) 3x3+1(S) DepthConcat Conv Conv Conv Conv softmax1 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool SoftmaxActivation 1x1+1(S) 1x1+1(S) 3x3+1(S) MaxPool FC 3x3+2(S) DepthConcat FC Conv Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S) Conv Conv MaxPool AveragePool 1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V) DepthConcat Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool 1x1+1(S) 1x1+1(S) 3x3+1(S) DepthConcat softmax0 Conv Conv Conv Conv SoftmaxActivation 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool FC 1x1+1(S) 1x1+1(S) 3x3+1(S) DepthConcat FC Conv Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S) Conv Conv MaxPool AveragePool 1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V) DepthConcat Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool 1x1+1(S) 1x1+1(S) 3x3+1(S) MaxPool 3x3+2(S) DepthConcat Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool 1x1+1(S) 1x1+1(S) 3x3+1(S) DepthConcat Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool 1x1+1(S) 1x1+1(S) 3x3+1(S) MaxPool 3x3+2(S) LocalRespNorm Conv 3x3+1(S) Conv 1x1+1(V) LocalRespNorm MaxPool 3x3+2(S) Conv 7x7+2(S) input
Andrej Karpathy and Li Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions. CVPR 2015. Kelvin Xu et. al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015. Remi Lebret et. al. Phrase-based Image Captioning. ICML 2015. … Fancier applications Image Captioning
Zheng, Shuai et al. “Conditional Random Fields as Recurrent Neural Networks.” arXiv.org cs.CV (2015).
Unrolling Iterative Algorithms as Layers of Deep Nets • Zheng, Shuai et al. “Conditional Random Fields as Recurrent Neural Networks.” arXiv.org cs.CV (2015).
Unrolling Multiplicative NMF Iterations Jonathan Le Roux et. al. Deep NMF for Speech Separation. ICASSP 2015.
Speech Recognition • RNNs: Non-fixed-length input, using context / memory for current prediction • Very deep neural network when unfolded in time, hard to train Image source: Li Deng and Dong Yu. Deep Learning – Methods and Applications. Realtime conversation translation
Video Pinball Boxing Breakout Star Gunner Robotank Reinforcement Learning & more Atlantis Crazy Climber Gopher Demon Attack Name This Game Krull Assault Road Runner Kangaroo James Bond Tennis Pong Space Invaders Beam Rider Tutankham Kung-Fu Master Freeway Time Pilot Enduro Fishing Derby Up and Down Ice Hockey Q*bert At human-level or above H.E.R.O. Asterix Below human-level Battle Zone Wizard of Wor Chopper Command Centipede Bank Heist River Raid Zaxxon Amidar Alien Venture Seaquest Double Dunk • Google Deep Mind. Human-level control through deep Bowling Ms. Pac-Man reinforcement learning. Nature, Feb. 2015. Asteroids Frostbite Gravitar DQN Private Eye Best linear learner • Google Deep Mind. Neural Turing Machines. ArXiv 2014. Montezuma's Revenge 0 100 200 300 400 500 600 1,000 4,500%
Deep Learning What are the challenges?
Convergence of Optimization • Gradients diminishing, lower layers hard to train • ReLU, empirically faster convergence • Gradients explode or diminish • Clever initialization (preserve variance / scale in each layer) • Xaiver and variants : Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ArXiv 2015. • Identity : Q. V. Le, N. Jaitly, G. E. Hinton. A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. ArXiv 2015. • Memory gates: LSTM, Highway Networks (Rupesh Kumar Srivastava, Klaus Greff, Jürgen Schmidhuber. Highway Networks. ArXiv 2015), etc. • Batch normalization: Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML 2015. • Many more tricks out there…
Regularization Overfitting problems do exist in deep learning • “Baidu Overfitting Imagenet”: http://www.image-net.org/challenges/LSVRC/announcement-June-2-2015 • Data augmentation commonly used in • computer vision (random translation, rotation, cropping, mirroring…) • speech recognition • e.g. Andrew Y. Ng et. al. Deep Speech: Scaling up end-to-end speech recognition. ArXiv 2015. 100,1000 hours (~11 years) of augmented speech data
Regularization Overfitting problems do exist in deep learning • Dropout • Intuition: forced to be robust; model averaging. • Justification • Wager, Stefan, Sida Wang, and Percy S Liang. “Dropout Training as Adaptive Regularization.” NIPS 2013. • David McAllester. A PAC-Bayesian Tutorial MNIST TIMIT with A Dropout Bound. ArXiv 2013. figure source: http://winsty.net/talks/dropout.pptx • Variations: DropConnect, DropLabel…
Regularization Overfitting problems do exist in deep learning • (Structured) sparsity comes into play • Computer vision: ConvNets — sparse connection with weight sharing • Speech recognition: RNNs — time index correspondence, weight sharing • Unrolling: structure from algorithms • Behnam Neyshabur, Ryota Tomioka, Nathan Srebro. Norm-Based Capacity Control in Neural Networks. COLT 2015. • Q : is the sparsity pattern learnable?
Computation • Hashing • e.g. K.Q. Weinberger et. al. Compressing Neural Networks with the Hashing Trick. ICML 2015. • Limited numerical precision computing with stochastic rounding • Suyog Gupta et. al. Deep Learning with Limited Numerical Precision. ICML 2015.
Deep Learning Existing Empirical Analysis
Network Visualization • Visualizing the learned filters • Visualizing high-response input images • Adversarial images • Reconstruction (what kind of information is perserved) Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 2012.
Matthew D. Zeiler, Rob Fergus. Visualizing and Understanding Convolutional Networks. ECCV 2014. the top 9 activations in a random subset of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach.
Adversarial images for a trained CNN (or any classifier) 1st column: • original images. 2nd column: • perturbations. 3rd column: • perturbed images, all classified as “ostrich, Struthio camelus”. Christian Szegedy, …, Rob Fergus. Intriguing properties of neural networks. ICLR 2014.
Anh Nguyen, Jason Yosinski, Jeff Clune. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. CVPR 2015. http://www.evolvingai.org/ fooling see also Supernormal Stimuli for human and animals: https://imgur.com/ a/ibMUn
Reconstruction from each layer of a CNN • Aravindh Mahendran, Andrea Vedaldi. Understanding Deep Image Representations by Inverting Them. CVPR 2015. • Jonathan Long, Ning Zhang, Trevor Darrell. Do Convnets Learn Correspondence? NIPS 2014.
Recommend
More recommend