15-780 Graduate Artificial Intelligence: Convolutional and - PowerPoint PPT Presentation

15-780 – Graduate Artificial Intelligence: Convolutional and recurrent networks J. Zico Kolter (this lecture) and Ariel Procaccia Carnegie Mellon University Spring 2017 1

Outline Convolutional neural networks Applications of convolutional networks Recurrent networks Applications of recurrent networks 2

The problem with fully-connected networks A 256x256 (RGB) image ⟹ ~200K dimensional input 𝑦 A fully connected network would need a very large number of parameters, very likely to overfit the data Generic deep network also does not capture the “natural” invariances we expect in images (translation, scale) z i z i z i +1 z i +1 ( W i ) 1 ( W i ) 2 4

Convolutional neural networks To create architectures that can handle large images, restrict the weights in two ways 1. Require that activations between layers only occur in “local” manner 2. Require that all activations share the same weights z i z i z i +1 z i +1 W i W i These lead to an architecture known as a convolutional neural network 5

Convolutions Convolutions are a basic primitive in many computer vision and image processing algorithms Idea is to “slide” the weights 𝑥 (called a filter) over the image to produce a new image, written 𝑧 = 𝑨 ∗ 𝑥 z 11 z 11 z 11 z 11 z 11 z 11 z 11 z 12 z 12 z 12 z 12 z 12 z 12 z 12 z 13 z 13 z 13 z 13 z 13 z 13 z 13 z 14 z 14 z 14 z 14 z 14 z 14 z 14 z 15 z 15 z 15 z 15 z 15 z 15 z 15 w 11 w 12 w 13 w 11 w 12 w 13 w 11 w 12 w 13 w 11 w 12 w 13 w 11 w 12 w 13 w 11 w 12 w 13 w 11 w 12 w 13 y 11 y 11 y 11 y 11 y 11 y 11 y 11 y 12 y 12 y 12 y 12 y 12 y 12 y 12 y 13 y 13 y 13 y 13 y 13 y 13 y 13 z 21 z 21 z 21 z 21 z 21 z 21 z 21 z 22 z 22 z 22 z 22 z 22 z 22 z 22 z 23 z 23 z 23 z 23 z 23 z 23 z 23 z 24 z 24 z 24 z 24 z 24 z 24 z 24 z 25 z 25 z 25 z 25 z 25 z 25 z 25 w 21 w 22 w 23 w 21 w 22 w 23 w 21 w 22 w 23 w 21 w 22 w 23 w 21 w 22 w 23 w 21 w 22 w 23 w 21 w 22 w 23 y 21 y 21 y 21 y 21 y 21 y 21 y 21 y 22 y 22 y 22 y 22 y 22 y 22 y 22 y 23 y 23 y 23 y 23 y 23 y 23 y 23 = = = = = = ∗ ∗ ∗ ∗ ∗ ∗ ∗ = z 31 z 31 z 31 z 31 z 31 z 31 z 31 z 32 z 32 z 32 z 32 z 32 z 32 z 32 z 33 z 33 z 33 z 33 z 33 z 33 z 33 z 34 z 34 z 34 z 34 z 34 z 34 z 34 z 35 z 35 z 35 z 35 z 35 z 35 z 35 w 31 w 32 w 33 w 31 w 32 w 33 w 31 w 32 w 33 w 31 w 32 w 33 w 31 w 32 w 33 w 31 w 32 w 33 w 31 w 32 w 33 y 31 y 31 y 31 y 31 y 31 y 31 y 31 y 32 y 32 y 32 y 32 y 32 y 32 y 32 y 33 33 33 33 33 33 33 z 41 z 41 z 41 z 41 z 41 z 41 z 41 z 42 z 42 z 42 z 42 z 42 z 42 z 42 z 43 z 43 z 43 z 43 z 43 z 43 z 43 z 44 z 44 z 44 z 44 z 44 z 44 z 44 z 45 z 45 z 45 z 45 z 45 z 45 z 45 y 23 = z 23 w 11 + z 24 w 12 + z 25 w 13 + z 33 w 21 + . . . y 22 = z 22 w 11 + z 23 w 12 + z 24 w 13 + z 32 w 21 + . . . y 21 = z 21 w 11 + z 22 w 12 + z 23 w 13 + z 31 w 21 + . . . y 11 = z 11 w 11 + z 12 w 12 + z 13 w 13 + z 21 w 21 + . . . y 13 = z 13 w 11 + z 14 w 12 + z 15 w 13 + z 23 w 21 + . . . y 12 = z 12 w 11 + z 13 w 12 + z 14 w 13 + z 22 w 21 + . . . z 51 z 51 z 51 z 51 z 51 z 51 z 51 z 52 z 52 z 52 z 52 z 52 z 52 z 52 z 53 z 53 z 53 z 53 z 53 z 53 z 53 z 54 z 54 z 54 z 54 z 54 z 54 z 54 z 55 z 55 z 55 z 55 z 55 z 55 z 55 6

Convolutions in image processing Convolutions (typically with prespecified filters) are a common operation in many computer vision applications Gaussian blur Image gradient Original image 𝑨 1 4 7 4 1 1 2 2 2 4 16 26 16 4 − 1 0 1 − 1 − 2 − 1 𝑨 ∗ /273 7 26 41 26 7 𝑨 ∗ + 𝑨 ∗ − 2 0 2 0 0 0 4 16 26 16 4 − 1 0 1 1 2 1 1 4 4 4 1 7

Convolutional neural networks Idea of a convolutional neural network, in some sense, is to let the network “learn” the right filters for the specified task In practice, we actually use “3D” convolutions, which apply a separate convolution to multiple layers of the image, then add the results together z i z i z i +1 z i +1 ( W i ) 1 ( W i ) 2 8

Additional on convolutions For anyone with a signal processing background: this is actually not what you call a convolution, this is a correlation (convolution with the filter flipped upside-down and left-right) It’s common to “zero pad” the input image so that the resulting image is the same size Also common to use a max-pooling operation that shrinks images by taking max over a region (also common: strided convolutions) z i z i +1 max 9

Number of parameters Consider a convolutional network that takes as input color (RGB) 32x32 images, and uses the layers (all convolutional layers use zero-padding) 1. 5x5x64 convolution 2. 2x2 Maxpooling 3. 3x3x128 convolution 4. 2x2 Maxpooling 5. Fully-connected to 10-dimensional output How many parameters does this network have? 1. O(10^3) 2. O(10^4) 3. O(10^5) 4. O(10^6) 10

Learning with convolutions How do we apply backpropagation to neural networks with convolutions? 𝑨 푖 + 1 = 𝑔 푖 ( 𝑨 푖 ∗ 𝑥 푖 + 𝑐 푖 ) Remember that for a dense layer 𝑨 푖 + 1 = 𝑔 푖 ( 𝑋 푖 𝑨 푖 + 𝑐 푖 ) , forward pass required multiplication by 𝑋 푖 and backward pass required multiplication by 𝑋 푖 푇 We’re going to show that convolution is a type of (highly structured) matrix multiplication, and show how to compute the multiplication by its tranpose 11

Convolutions as matrix multiplication Consider initially a 1D convolution 𝑨 푖 ∗ 𝑥 푖 for 𝑥 푖 ∈ ℝ 3 , 𝑨 푖 ∈ ℝ 6 Then 𝑨 푖 ∗ 𝑥 푖 = 𝑋 푖 𝑨 푖 for 𝑥 1 𝑥 2 𝑥 3 0 0 0 𝑥 2 𝑥 3 0 𝑥 1 0 0 𝑋 푖 = 𝑥 1 𝑥 2 𝑥 3 0 0 0 𝑥 2 𝑥 3 0 𝑥 1 0 0 푇 ? So how do we multiply by 𝑋 푖 12

Convolutions as matrix multiplication, cont Multiplication by transpose is just 𝑥 1 0 0 0 0 𝑥 2 𝑥 1 0 0 0 𝑥 3 𝑥 2 𝑥 1 0 푇 𝑕 푖 + 1 = 𝑕 푖 + 1 𝑋 푖 𝑕 푖 + 1 = ∗ 𝑥 푖 𝑥 2 𝑥 1 0 𝑥 3 0 𝑥 3 𝑥 2 0 0 0 0 𝑥 3 0 0 where 𝑥 푖 + 1 is just the flipped version of 𝑥 푖 In other words, transpose of convolution is just (zero-padded) convolution by flipped filter ( correlations for signal processing people) Property holds for 2D convolutions, backprop just flips convolutions 13

LeNet network, digit classification The network that started it all (and then stopped for ~14 years) C3: f. maps 16@10x10 C1: feature maps S4: f. maps 16@5x5 INPUT 6@28x28 32x32 S2: f. maps C5: layer OUTPUT F6: layer 6@14x14 120 10 84 Gaussian connections Full connection Subsampling Subsampling Full connection Convolutions Convolutions LeNet-5 (LeCun et al., 1998) architecture, achieves 1% error in MNIST digit classification 15

Image classification Recent ImageNet classification challenges 16

Using intermediate layers as features Increasingly common to use later-stage layers of pre-trained image classification networks as features for image classification tasks https://blog.keras.io/building-powerful-image-classification- models-using-very-little-data.html Classify dogs/cats based upon 2000 images (1000 of each class): Approach 1: Convolution network from scratch: 80% Approach 2: Final-layer from VGG network -> dense net: 90% Approach 3: Also fine-tune last convolution features: 94% 17

Playing Atari games 18

Neural style Adjust input image to make feature activations (really, inner products of feature activations), match target (art) images (Gatys et al., 2016) 19

Detecting cancerous cells in images https://research.googleblog.com/2017/03/assisting- pathologists-in-detecting.html 20

Predicting temporal data So far, the models we have discussed are application to independent inputs 𝑦 1 , … , 𝑦 푚 In practice, we often want to predict a sequence of outputs, given a sequence of inputs (predicting independently would miss correlations) y (1) y (2) y (3) x (1) x (2) x (3) · · · Examples: time series forecasting, sentence labeling, speech to text, etc 22

Recurrent neural networks Maintain hidden state over time, hidden state is a function of current input and previous hidden state y (1) y (2) y (3) ˆ ˆ ˆ W zy W zy W zy 𝑨 푡 = 𝑔 푧 𝑋 푥푧 𝑦 푡 + 𝑋 푧푧 𝑨 푡−1 + 𝑐 푧 W zz W zz W zz 𝑧̂ 푡 = 𝑔 푦 ( 𝑋 푧푦 𝑨 푡 + 𝑐 푦 ) · · · z (1) z (2) z (3) W xz W xz W xz x (1) x (2) x (3) 23

Training recurrent networks Most common training approach is to “unroll” the RNN on some dataset, and minimize the loss function 푚 ∑ ℓ 𝑧̂ 푡 , 𝑧 푡 minimize 푊 푥푧 ,푊 푧푧 ,푊 푧푦 푖 = 1 Note that the network will have the “same” parameters in a lot of places in the network (e.g., the same 𝑋 푧푧 matrix occurs in each step); advance of computation graph approach is that it’s easy to compute these complex gradients Some issues: initializing first hidden layer (just set it to all zeros), how long a sequence (pick something big, like >100) 24

15-780 Graduate Artificial Intelligence: Convolutional and - PowerPoint PPT Presentation

15-780 Graduate Artificial Intelligence: Convolutional and recurrent networks J. Zico Kolter (this lecture) and Ariel Procaccia Carnegie Mellon University Spring 2017 1 Outline Convolutional neural networks Applications of convolutional

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

15-780 Graduate Artificial Intelligence: Convolutional and recurrent networks J. Zico Kolter

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial intelligence Artificial Intelligence is the science of PHILOSOPHY OF ARTIFICIAL

Artificial Intelligence Intro (Chapter 1 of AIMA) Summary Artificial Intelligence What is AI?

15-780 Graduate Artificial Intelligence: Adversarial attacks and provable defenses J. Zico

15-780 - graduate artificial intelligence ai and education i . Shayan Doroudi April 24, 2017 1

15-780 Graduate Artificial Intelligence: Integer programming J. Zico Kolter (this lecture)

15-780 Graduate Artificial Intelligence: Integer programming J. Zico Kolter (this lecture)

15-780 Graduate Artificial Intelligence: Machine learning J. Zico Kolter (this lecture) and

15-780 Graduate Artificial Intelligence: Optimization J. Zico Kolter (this lecture) and Ariel

15-780 - graduate artificial intelligence ai and education iii . Shayan Doroudi May 1, 2017 1

15-780 Graduate Artificial Intelligence: Probabilistic modeling J. Zico Kolter (this lecture)

15-780 Graduate Artificial Intelligence: Probabilistic inference J. Zico Kolter (this

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Swapping Evaluation: A Memory-Scalable Solution for Answer-On-Demand Tabling an 1 Manuel Carro 1

outdoor learning experiences with children Hanin Hussain, PhD Early Childhood and Special Needs

Optimization++ Complexities and strategies of optimization Instruction Scheduling

High level OCaml optimisations Pierre Chambart, OCamlPro OCaml 2013, 23 September 2013 OCaml is

Multiplicity Computing Engineering Software for Reliability, Performance, and Security Alexander

VisTrails: Visualization meets Data Management Erik Anderson, Steven Callahan, Juliana Freire,

1 Topics (cont) Loop Transformations VI. Parallelism and Locality (*) Original code

Port au Port/Bay St. George Fracking Awareness Group. Who are we? Jessica Ernst Port au Port