computer vision and deep learning
play

Computer Vision and Deep Learning Introduction to Data Science 2019 - PowerPoint PPT Presentation

Computer Vision and Deep Learning Introduction to Data Science 2019 University of Helsinki Mats Sj oberg mats.sjoberg@csc.fi CSC IT Center for Science September 23, 2019 Computer vision Giving computers the ability to understand


  1. Computer Vision and Deep Learning Introduction to Data Science 2019 University of Helsinki Mats Sj¨ oberg mats.sjoberg@csc.fi CSC – IT Center for Science September 23, 2019

  2. Computer vision Giving computers the ability to understand visual information Examples: ◮ A robot that can move around obstacles by analysing the input of its camera(s) ◮ A computer system finding images of cats among millions of images (e.g., on the Internet). 2/45

  3. From picture to pixels ◮ The camera image needs to be digitised for computer processing ◮ Turning it into millions of discrete picture elements, or pixels “There’s a cat 0.4941 0.4941 0.4745 0.4901 0.4745 � 0.5098 0.5058 0.5215 0.5098 0.5058 � 0.4941 0.4941 0.5058 0.4941 0.4980 among some flowers 0.4980 0.4941 0.4862 0.4705 0.4941 0.5019 0.4980 0.4980 0.4901 0.5098 in the grass” ◮ How do we get from pixels to understanding? ◮ . . . or even some kind of useful/actionable interpretation. 3/45

  4. Deep learning Before ◮ Hand-crafted features, e.g., colour distributions, edge histograms ◮ Complicated feature selection mechanisms ◮ “Classical” machine learning, e.g., kernel methods (SVM) About 5 years ago: deep learning ◮ End-to-end learning, i.e., the network itself learns the features ◮ Each layer typically learns a higher level of representation ◮ However: entirely data-driven, features can be hard to interpret Computer vision was one of the first breakthroughs of deep learning. 4/45

  5. Deep learning = neural networks Fully connected or dense layer x 1 w ji y 1 f ( · ) x 2 y j = f ( � n i =1 w ji x i ) . . . . . . y m x n f ( · )     w 11 w 12 w 1 n   . . . x 1 w 21 w 22 w 2 n    .  . . .  = f ( W T x ) . y = f . . .   ...     . . . .     . . .      x n w m 1 w m 2 w mn . . . (we’re ignoring the bias term here . . . ) 5/45

  6. Learning in neural networks ◮ Feedforward network has a huge number of parameters that need to be learned ◮ Each output node interacts with every input node via the weights in W ◮ n × m weights (and that’s just one layer!) ◮ Learning is typically done with stochastic gradient descent http://ruder.io/optimizing-gradient-descent/ ◮ Gradients for each neuron obtained with backpropagation ◮ Given enough time and data the network can in theory learn to model any complex phenomena (Universal approximation theorem) ◮ In practice, we often use domain knowledge to restrict the number of parameters that need to be learned. http://playground.tensorflow.org/ 6/45

  7. Deep learning for vision While we don’t hand-craft features anymore . . . In practice we still apply some “expert knowledge” to make learning feasible . . . ◮ Neighbouring pixels are probably related (convolutions) ◮ There are common image features which can appear anywhere such as edges, corners, etc (weight sharing) ◮ Often the exact location of a feature isn’t important (max pooling) ⇒ Convolutional neural networks (CNN, ConvNet). 7/45

  8. Feedforward to convolutional net w 11 w 1 x 1 y 1 x 1 y 1 w 21 w 2 . . w 1 . x 2 y 2 x 2 y 2 w 2 w 1 y 3 y 3 x 3 x 3 w 2 w 1 x 4 y 4 x 4 y 4 w 2 w 1 x 5 y 5 x 5 y 5 w 2 w 1 y 6 y 6 x 6 x 6 w 2 w 1 x 7 y 7 x 7 y 7 w 77 to this. Network changes from this . . . 8/45

  9. Convolution in 2D ◮ We arrange the input and output neurons in 2D ◮ The output is the result of a weighted sum of a small local area in the previous layer – convolution � � S ( i , j ) = I ( i + m , j + n ) K ( m , n ) m n ◮ The weights K ( m , n ) is what is learned . 9/45

  10. Convolution in 2D ◮ We arrange the input and output neurons in 2D ◮ The output is the result of a weighted sum of a small local area in the previous layer – convolution � � S ( i , j ) = I ( i + m , j + n ) K ( m , n ) m n ◮ The weights K ( m , n ) is what is learned . 10/45

  11. Learning in layers ◮ The convolutional layer learns several sets of weights, each a kind of feature detector ◮ These are built up in layers ◮ Until we get our end result, e.g., an object detector. “cat” 11/45

  12. Visualising convolutional layers Krizhevsky et al 2012 12/45

  13. deconvnet Map activations back to the image space Zeiler and Fergus 2014, https://arxiv.org/abs/1311.2901 13/45

  14. Real convolutional neural nets ◮ What we call CNNs, actually also contain other types of operations/layers: fully connected layers, non-linearities ◮ Modern CNNs have a huge bag of tricks: pooling, various training shortcuts, 1x1 convolutions, inception modules, residual connections, etc. C3: f. maps 16@10x10 C1: feature maps S4: f. maps 16@5x5 INPUT 6@28x28 32x32 S2: f. maps C5: layer OUTPUT F6: layer 6@14x14 120 10 84 Gaussian connections Full connection Subsampling Subsampling Full connection Convolutions Convolutions LeNet5 (LeCun et al 1998) 14/45

  15. Examples of real CNNs AlexNet (Krizhevsky et al 2012) 15/45

  16. Examples of real CNNs GoogLeNet (Szegedy et al 2014) 16/45

  17. Examples of real CNNs Inception v3 (Szegedy et al 2015) 17/45

  18. Examples of real CNNs ResNet-152 (He et al 2015) https://github.com/KaimingHe/deep-residual-networks 18/45

  19. Object recognition challenge ImageNet benchmark ◮ ImageNet Large Scale Visual Recognition Challenge (ILSVRC) ◮ More than 1 million images ◮ Task: classify into 1000 object categories. 19/45

  20. Object recognition challenge ◮ First time won by a CNN in 2012 (Krizhevsky et al) ◮ Wide margin: top-5 error rate from 26% to 16% ◮ CNNs have ruled ever since. 20/45

  21. Accuracy vs model complexity ◮ Accuracy vs number of inference operations ◮ Circle size represents number of parameters ◮ Newer nets are both better, faster and have fewer parameters. Image from https://arxiv.org/pdf/1605.07678.pdf 21/45

  22. Computer vision applications 22/45

  23. Object detection and localisation Rich feature hierarchies for accurate object detection and semantic segmentation. Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. CVPR 2014. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. arXiv:1506.01497 23/45

  24. Semantic segmentation Learning Deconvolution Network for Semantic Segmentation. Hyeonwoo Noh, Seunghoon Hong, Bohyung Han. arXiv: 1505.04366 24/45

  25. Object detection and localisation https://github.com/facebookresearch/Detectron 25/45

  26. Describing an image Show and Tell: A Neural Image Caption Generator. Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan. arXiv:1411.4555 26/45

  27. Describing an image DenseCap: Fully Convolutional Localization Networks for Dense Captioning, Justin Johnson, Andrej Karpathy, Li Fei-Fei, CVPR 2016. 27/45

  28. Visual question answering VQA: Visual Question Answering. Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh. ICCV 2015. 28/45

  29. Generative Adversarial Networks (GANs) “The coolest idea in machine learning in the last twenty years” – Yann LeCun ◮ We have two networks: generator and discriminator ◮ The generator produces samples, while the discriminator tries to distinguish between real data items and the generated samples ◮ The discriminator tries to learn to classify correctly, while the generator in turn tries to learn to fool the discriminator. 29/45

  30. GAN examples Generated bedrooms https://arxiv.org/abs/1511.06434v2 30/45

  31. GAN examples Generated “celebrities” Progressive Growing of GANs for Improved Quality, Stability, and Variation. Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen. arXiv: 1710.10196 31/45

  32. GAN examples CycleGAN Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks https://junyanz.github.io/CycleGAN/ 32/45

  33. GAN examples Generative Adversarial Text to Image Synthesis https://arxiv.org/pdf/1605.05396.pdf 33/45

  34. Neural style A Neural Algorithm of Artistic Style https://arxiv.org/pdf/1508.06576.pdf https://github.com/jcjohnson/neural-style 34/45

  35. AI vs humans? 35/45

  36. AI vs humans? Recall our ImageNet benchmark . . . where do humans stand? http: //karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/ 36/45

  37. AI better than humans? ◮ Don’t confuse classification accuracy with understanding! ◮ Neural nets learn to optimize for a particular problem pretty well ◮ But in the end it’s just pixel statistics ◮ Humans can generalize and understand the context. 37/45

  38. AI better than humans? Microsoft CaptionBot: “I think it’s a group of people standing next to a man in a suit and tie.” https://karpathy.github.io/2012/10/22/state-of-computer-vision/ 38/45

  39. 39/45

  40. Adversarial examples ◮ Deep nets fooled by deliberately crafted inputs ◮ Revealing: what deep nets learn is quite different from what humans learn https://blog.openai.com/adversarial-example-research/ 40/45

  41. Conclusion ◮ Deep learning has been a big leap for computer vision ◮ We can solve some specific problems really well ◮ Still far away from true understanding of visual information 41/45

  42. About CSC 42/45

Recommend


More recommend