Computer Vision and Deep Learning Introduction to Data Science 2019 University of Helsinki Mats Sj¨ oberg mats.sjoberg@csc.fi CSC – IT Center for Science September 23, 2019
Computer vision Giving computers the ability to understand visual information Examples: ◮ A robot that can move around obstacles by analysing the input of its camera(s) ◮ A computer system finding images of cats among millions of images (e.g., on the Internet). 2/45
From picture to pixels ◮ The camera image needs to be digitised for computer processing ◮ Turning it into millions of discrete picture elements, or pixels “There’s a cat 0.4941 0.4941 0.4745 0.4901 0.4745 � 0.5098 0.5058 0.5215 0.5098 0.5058 � 0.4941 0.4941 0.5058 0.4941 0.4980 among some flowers 0.4980 0.4941 0.4862 0.4705 0.4941 0.5019 0.4980 0.4980 0.4901 0.5098 in the grass” ◮ How do we get from pixels to understanding? ◮ . . . or even some kind of useful/actionable interpretation. 3/45
Deep learning Before ◮ Hand-crafted features, e.g., colour distributions, edge histograms ◮ Complicated feature selection mechanisms ◮ “Classical” machine learning, e.g., kernel methods (SVM) About 5 years ago: deep learning ◮ End-to-end learning, i.e., the network itself learns the features ◮ Each layer typically learns a higher level of representation ◮ However: entirely data-driven, features can be hard to interpret Computer vision was one of the first breakthroughs of deep learning. 4/45
Deep learning = neural networks Fully connected or dense layer x 1 w ji y 1 f ( · ) x 2 y j = f ( � n i =1 w ji x i ) . . . . . . y m x n f ( · ) w 11 w 12 w 1 n . . . x 1 w 21 w 22 w 2 n . . . . = f ( W T x ) . y = f . . . ... . . . . . . . x n w m 1 w m 2 w mn . . . (we’re ignoring the bias term here . . . ) 5/45
Learning in neural networks ◮ Feedforward network has a huge number of parameters that need to be learned ◮ Each output node interacts with every input node via the weights in W ◮ n × m weights (and that’s just one layer!) ◮ Learning is typically done with stochastic gradient descent http://ruder.io/optimizing-gradient-descent/ ◮ Gradients for each neuron obtained with backpropagation ◮ Given enough time and data the network can in theory learn to model any complex phenomena (Universal approximation theorem) ◮ In practice, we often use domain knowledge to restrict the number of parameters that need to be learned. http://playground.tensorflow.org/ 6/45
Deep learning for vision While we don’t hand-craft features anymore . . . In practice we still apply some “expert knowledge” to make learning feasible . . . ◮ Neighbouring pixels are probably related (convolutions) ◮ There are common image features which can appear anywhere such as edges, corners, etc (weight sharing) ◮ Often the exact location of a feature isn’t important (max pooling) ⇒ Convolutional neural networks (CNN, ConvNet). 7/45
Feedforward to convolutional net w 11 w 1 x 1 y 1 x 1 y 1 w 21 w 2 . . w 1 . x 2 y 2 x 2 y 2 w 2 w 1 y 3 y 3 x 3 x 3 w 2 w 1 x 4 y 4 x 4 y 4 w 2 w 1 x 5 y 5 x 5 y 5 w 2 w 1 y 6 y 6 x 6 x 6 w 2 w 1 x 7 y 7 x 7 y 7 w 77 to this. Network changes from this . . . 8/45
Convolution in 2D ◮ We arrange the input and output neurons in 2D ◮ The output is the result of a weighted sum of a small local area in the previous layer – convolution � � S ( i , j ) = I ( i + m , j + n ) K ( m , n ) m n ◮ The weights K ( m , n ) is what is learned . 9/45
Convolution in 2D ◮ We arrange the input and output neurons in 2D ◮ The output is the result of a weighted sum of a small local area in the previous layer – convolution � � S ( i , j ) = I ( i + m , j + n ) K ( m , n ) m n ◮ The weights K ( m , n ) is what is learned . 10/45
Learning in layers ◮ The convolutional layer learns several sets of weights, each a kind of feature detector ◮ These are built up in layers ◮ Until we get our end result, e.g., an object detector. “cat” 11/45
Visualising convolutional layers Krizhevsky et al 2012 12/45
deconvnet Map activations back to the image space Zeiler and Fergus 2014, https://arxiv.org/abs/1311.2901 13/45
Real convolutional neural nets ◮ What we call CNNs, actually also contain other types of operations/layers: fully connected layers, non-linearities ◮ Modern CNNs have a huge bag of tricks: pooling, various training shortcuts, 1x1 convolutions, inception modules, residual connections, etc. C3: f. maps 16@10x10 C1: feature maps S4: f. maps 16@5x5 INPUT 6@28x28 32x32 S2: f. maps C5: layer OUTPUT F6: layer 6@14x14 120 10 84 Gaussian connections Full connection Subsampling Subsampling Full connection Convolutions Convolutions LeNet5 (LeCun et al 1998) 14/45
Examples of real CNNs AlexNet (Krizhevsky et al 2012) 15/45
Examples of real CNNs GoogLeNet (Szegedy et al 2014) 16/45
Examples of real CNNs Inception v3 (Szegedy et al 2015) 17/45
Examples of real CNNs ResNet-152 (He et al 2015) https://github.com/KaimingHe/deep-residual-networks 18/45
Object recognition challenge ImageNet benchmark ◮ ImageNet Large Scale Visual Recognition Challenge (ILSVRC) ◮ More than 1 million images ◮ Task: classify into 1000 object categories. 19/45
Object recognition challenge ◮ First time won by a CNN in 2012 (Krizhevsky et al) ◮ Wide margin: top-5 error rate from 26% to 16% ◮ CNNs have ruled ever since. 20/45
Accuracy vs model complexity ◮ Accuracy vs number of inference operations ◮ Circle size represents number of parameters ◮ Newer nets are both better, faster and have fewer parameters. Image from https://arxiv.org/pdf/1605.07678.pdf 21/45
Computer vision applications 22/45
Object detection and localisation Rich feature hierarchies for accurate object detection and semantic segmentation. Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. CVPR 2014. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. arXiv:1506.01497 23/45
Semantic segmentation Learning Deconvolution Network for Semantic Segmentation. Hyeonwoo Noh, Seunghoon Hong, Bohyung Han. arXiv: 1505.04366 24/45
Object detection and localisation https://github.com/facebookresearch/Detectron 25/45
Describing an image Show and Tell: A Neural Image Caption Generator. Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan. arXiv:1411.4555 26/45
Describing an image DenseCap: Fully Convolutional Localization Networks for Dense Captioning, Justin Johnson, Andrej Karpathy, Li Fei-Fei, CVPR 2016. 27/45
Visual question answering VQA: Visual Question Answering. Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh. ICCV 2015. 28/45
Generative Adversarial Networks (GANs) “The coolest idea in machine learning in the last twenty years” – Yann LeCun ◮ We have two networks: generator and discriminator ◮ The generator produces samples, while the discriminator tries to distinguish between real data items and the generated samples ◮ The discriminator tries to learn to classify correctly, while the generator in turn tries to learn to fool the discriminator. 29/45
GAN examples Generated bedrooms https://arxiv.org/abs/1511.06434v2 30/45
GAN examples Generated “celebrities” Progressive Growing of GANs for Improved Quality, Stability, and Variation. Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen. arXiv: 1710.10196 31/45
GAN examples CycleGAN Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks https://junyanz.github.io/CycleGAN/ 32/45
GAN examples Generative Adversarial Text to Image Synthesis https://arxiv.org/pdf/1605.05396.pdf 33/45
Neural style A Neural Algorithm of Artistic Style https://arxiv.org/pdf/1508.06576.pdf https://github.com/jcjohnson/neural-style 34/45
AI vs humans? 35/45
AI vs humans? Recall our ImageNet benchmark . . . where do humans stand? http: //karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/ 36/45
AI better than humans? ◮ Don’t confuse classification accuracy with understanding! ◮ Neural nets learn to optimize for a particular problem pretty well ◮ But in the end it’s just pixel statistics ◮ Humans can generalize and understand the context. 37/45
AI better than humans? Microsoft CaptionBot: “I think it’s a group of people standing next to a man in a suit and tie.” https://karpathy.github.io/2012/10/22/state-of-computer-vision/ 38/45
39/45
Adversarial examples ◮ Deep nets fooled by deliberately crafted inputs ◮ Revealing: what deep nets learn is quite different from what humans learn https://blog.openai.com/adversarial-example-research/ 40/45
Conclusion ◮ Deep learning has been a big leap for computer vision ◮ We can solve some specific problems really well ◮ Still far away from true understanding of visual information 41/45
About CSC 42/45
Recommend
More recommend