Deep Neural Networks
Convolutional Networks II
Bhiksha Raj
1
Convolutional Networks II Bhiksha Raj 1 Story so far Pattern - - PowerPoint PPT Presentation
Deep Neural Networks Convolutional Networks II Bhiksha Raj 1 Story so far Pattern classification tasks such as does this picture contain a cat, or does this recording include HELLO are best performed by scanning for the
1
“does this recording include HELLO” are best performed by scanning for the target pattern
equivalent to scanning with individual neurons
– First level neurons scan the input – Higher-level neurons scan the “maps” formed by lower-level neurons – A final “decision” unit or layer makes the final decision
– What is the neural process from eye to recognition?
– largely based on behavioral studies
– and gestalt
– But no real understanding of how the brain processed images
– “Receptive Fields in Cat Striate Cortex”
– “Striate” – defined by structure, “V1” – functional definition
– Anaesthetized with truth serum – Electrodes into brain
– Defines immediate (20ms) response of these cells
units were called receptive fields.
– These fields were usually subdivided into excitatory and inhibitory regions.
– A light stimulus covering the whole receptive field, or diffuse illumination of the whole retina, was ineffective in driving most units, as excitatory regions cancelled inhibitory regions
– Receptive fields could be oriented in a vertical, horizontal or oblique manner.
– A spot of light gave greater response for some directions of movement than others.
mice monkey
From Huberman and Neil, 2011 From Hubel and Wiesel
striate cortex neurons
because lower level neurons responding to a slit also responded to patterns of spots if they were aligned with the same orientation as the slit.
the striate cortex, two levels of processing could be identified
– Between neurons referred to as simple S-cells and complex C-cells. – Both types responded to oriented slits of light, but complex cells were not “confused” by spots of light while simple cells could be confused
Transform from circular retinal receptive fields to elongated fields for simple cells. The simple cells are susceptible to fuzziness and noise Composition of complex receptive fields from simple cells. The C-cell responds to the largest output from a bank of S-cells to achieve oriented response that is robust to distortion
– The “tune” the response of the simple cell and have similar response to the simple cell
simple cells to cleaner response of complex cells
early neural responses
– Successive transformations through Simple-Complex combination layers
– Too horrible to recall
Kunihiko Fukushima
layer of “S-cells” followed by a layer of “C-cells”
– 𝑉𝑇𝑚 is the lth layer of S cells, 𝑉𝐷𝑚 is the lth layer of C cells
response
Figures from Fukushima, ‘80
– All the cells within an S-plane have identical learned responses
– One C-plane per S-plane – All C-cells have identical fixed response
previous plane
Each cell in a plane “looks” at a slightly shifted region of the input to the plane than the adjacent cells in the plane.
specific patterns in the previous layer (C layer or retina)
planes of the S layers
Could simply replace these strange functions with a RELU and a max
– update = product of input and output : ∆𝑥𝑗𝑘 = 𝑦𝑗𝑧𝑘
selected for update
– Also viewed as max-valued cell from each S column – Ensures only one of the planes picks up any feature – But across all positions, multiple planes will be selected
max
– E.g. Given many examples of the character “A” the different cell planes in the S-C layers may learn the patterns shown
– Going up the layers goes from local to global receptor fields
– Produces a class-label output
– All the S-cells within an S-plane have the same weights
every layer
– C-cells are not updated Output class label(s)
– Assuming square receptive fields, rather than elliptical ones – Receptive field of S cells in lth layer is 𝐿𝑚 × 𝐿𝑚 – Receptive field of C cells in lth layer is 𝑀𝑚 × 𝑀𝑚
Output class label(s)
Output class label(s) 𝑽𝑻,𝒎,𝒐 𝒋, 𝒌 = 𝝉
𝒒
𝑙=1 𝑳𝒎
𝑚=1 𝑳𝒎
𝒙𝑻,𝒎,𝒐(𝑞, 𝑙, 𝑚)𝑽𝑫,𝒎−𝟐,𝒒(𝑗 + 𝑚, 𝑘 + 𝑙) 𝑽𝑫,𝒎,𝒐 𝒋, 𝒌 = max
𝑙∈ 𝑗,𝑗+𝑀𝑚 ,𝑘∈(𝑚,𝑚+𝑀𝑚) 𝑽𝑻,𝒎,𝒐 𝒋, 𝒌
“down-sampling” layers
– The two may occur in any sequence, but typically they alternate
Multi-layer Perceptron Output
“downsampling” layers
– The two may occur in any sequence, but typically they alternate
Multi-layer Perceptron Output
– Their parameters must be learned from training data for the target classification task
Multi-layer Perceptron Output
Maps Previous layer
– A linear map, obtained by convolution over maps in the previous layer
– An activation that operates on the output of the convolution
Previous layer Previous layer
Previous layer Previous layer
Previous layer Previous layer
Example 5x5 image with binary pixels
Example 3x3 filter 𝑨 𝑗, 𝑘 =
𝑙=1 3
𝑚=1 3
𝑔 𝑙, 𝑚 𝐽 𝑗 + 𝑚, 𝑘 + 𝑙 + 𝑐 bias
– At each location, the “filter and the underlying map values are multiplied component wise, and the products are added along with the bias
– The filter may proceed by more than 1 pixel at a time – E.g. with a “stride” of two pixels per shift
x1 x0 x1 x0 x1 x0 x1 x1 x0
x1 x0 x1 x0 x1 x0 x1 x1 x0
– The filter may proceed by more than 1 pixel at a time – E.g. with a “hop” of two pixels per shift
x1 x0 x1 x0 x1 x0 x1 x1 x0
– The filter may proceed by more than 1 pixel at a time – E.g. with a “hop” of two pixels per shift
x1 x0 x1 x0 x1 x0 x1 x1 x0
– The filter may proceed by more than 1 pixel at a time – E.g. with a “hop” of two pixels per shift
Previous layer Previous layer
maps in the previous layer
visualization of all the maps in a layer as vertical arrangement to..
Previous layer
Stacked arrangement
Filter applied to kth layer of maps (convolutive component plus bias)
𝑞
𝑙=1 𝑀
𝑚=1 𝑀
𝑞(𝑗 + 𝑚, 𝑘 + 𝑙) + 𝑐 bias
𝑞
𝑙=1 𝑀
𝑚=1 𝑀
𝑞(𝑗 + 𝑚, 𝑘 + 𝑙) + 𝑐 One map bias
𝑞
𝑙=1 𝑀
𝑚=1 𝑀
𝑞(𝑗 + 𝑚, 𝑘 + 𝑙) + 𝑐 All maps bias
𝑞
𝑙=1 𝑀
𝑚=1 𝑀
𝑞(𝑗 + 𝑚, 𝑘 + 𝑙) + 𝑐 All maps bias
𝑞
𝑙=1 𝑀
𝑚=1 𝑀
𝑞(𝑗 + 𝑚, 𝑘 + 𝑙) + 𝑐 All maps bias
𝑞
𝑙=1 𝑀
𝑚=1 𝑀
𝑞(𝑗 + 𝑚, 𝑘 + 𝑙) + 𝑐 All maps bias
𝑞
𝑙=1 𝑀
𝑚=1 𝑀
𝑞(𝑗 + 𝑚, 𝑘 + 𝑙) + 𝑐 All maps bias
𝑞
𝑙=1 𝑀
𝑚=1 𝑀
𝑞(𝑗 + 𝑚, 𝑘 + 𝑙) + 𝑐 All maps bias
– Assuming you’re not allowed to go beyond the edge of the input
𝑇 > 1, then the output map should ideally be the same size as the input
– For stride 1, the result of the convolution is the same size as the original image
1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 Filter bias
1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 Filter bias
with an MLP
– Just run the entire MLP on each block separately, and combine results
– Even computationally, the number of operations in both computations is identical
𝑧 𝑗, 𝑘 =
𝑚
𝑛
𝑦 𝑗 + 𝑚, 𝑘 + 𝑛 𝑥(𝑚, 𝑛)
– 𝑂2 multiplications at each of 𝑁2 positions
– Expensive, for large filters
Correlation M N
Correlation M N
Previous layer Previous layer
“downsampling” (or “pooling”) layers
– Often, they alternate with convolution, though this is not necessary
Multi-layer Perceptron Output
Max
Max
Max
Max
Max
Max
Max
Max
Max
Max
Max
Max
Max
Max
max pool with 2x2 filters and stride 2
Mean pool with 2x2 filters and stride 2 3.25 5.25 2 2
Network applies to each 2x2 block and strides by 2 in this example
Network applies to each 2x2 block and strides by 2 in this example
Network in network
𝐽 × 𝐽 𝑗𝑛𝑏𝑓
𝐽 × 𝐽 𝑗𝑛𝑏𝑓
𝐽 × 𝐽 𝑗𝑛𝑏𝑓 Small enough to capture fine features (particularly important for scaled-down images)
𝐽 × 𝐽 𝑗𝑛𝑏𝑓 What on earth is this? Small enough to capture fine features (particularly important for scaled-down images)
𝐽 × 𝐽 𝑗𝑛𝑏𝑓
– Typically K1 is a power of 2, e.g. 2, 4, 8, 16, 32,.. – Better notation: Filters are typically 5x5(x3), 3x3(x3), or even 1x1(x3) – Typical stride: 1 or 2
𝐽 × 𝐽 𝑗𝑛𝑏𝑓
𝐽 × 𝐽 𝑗𝑛𝑏𝑓
– Filters are “3-D” (third dimension is color) – Convolution followed typically by a RELU activation
𝑍
𝑛 1 (𝑗, 𝑘) = 𝑔 𝑨𝑛 1 (𝑗, 𝑘)
𝑍
1 1
𝑍
2 1
𝑍
𝐿1 1
𝐽 × 𝐽
𝐽 × 𝐽 𝑗𝑛𝑏𝑓
K1 filters of size: 𝑀 × 𝑀 × 3
𝑨𝑛
1 (𝑗, 𝑘) =
𝑑∈{𝑆,𝐻,𝐶}
𝑙=1 𝑀
𝑚=1 𝑀
𝑥𝑛
1 𝑑, 𝑙, 𝑚 𝐽𝑑 𝑗 + 𝑙, 𝑘 + 𝑚 + 𝑐𝑛 (1)
The layer includes a convolution operation followed by an activation (typically RELU)
– For max pooling, during training keep track of which position had the highest value
𝑉1
1
𝑉2
1
𝑉𝐿1
1
𝐽/𝐸 × (𝐽/𝐸
𝑍
1 1
𝑍
2 1
𝑍
𝐿1 1
𝐽 × 𝐽 𝐽 × 𝐽 𝑗𝑛𝑏𝑓
Filter size: 𝑀 × 𝑀 × 3
pool The layer pools PxP blocks
It employs a stride D between adjacent blocks 𝑉𝑛
1 (𝑗, 𝑘) =
max
𝑙∈ 𝑗−1 𝐸+1, 𝑗𝐸 , 𝑚∈ 𝑘−1 𝐸+1, 𝑘𝐸
𝑍
𝑛 1 (𝑙, 𝑚)
– For max pooling, during training keep track of which position had the highest value
𝑉1
1
𝑉2
1
𝐽/𝐸 × (𝐽/𝐸
𝑍
1 1
𝑍
2 1
𝐽 × 𝐽 𝐽 × 𝐽 𝑗𝑛𝑏𝑓
Filter size: 𝑀 × 𝑀 × 3
Parameters to choose: Size of pooling block 𝑄 Pooling stride 𝐸
pool
Choices: Max pooling or mean pooling? Or learned pooling?
𝑉𝐿1
1
𝑍
𝐿1 1
𝑉𝑛
1 (𝑗, 𝑘) =
max
𝑙∈ 𝑗−1 𝐸+1, 𝑗𝐸 , 𝑚∈ 𝑘−1 𝐸+1, 𝑘𝐸
𝑍
𝑛 1 (𝑙, 𝑚)
– For max pooling, during training keep track of which position had the highest value
𝑉1
1
𝑉2
1
𝐽/𝐸 × (𝐽/𝐸
𝑍
1 1
𝑍
2 1
𝐽 × 𝐽 𝐽 × 𝐽 𝑗𝑛𝑏𝑓
Filter size: 𝑀 × 𝑀 × 3
pool 𝑉𝑛
1 (𝑗, 𝑘) = 𝑍 𝑛 1 (𝑄 𝑛 1 (𝑗, 𝑘))
𝑄
𝑛 1 (𝑗, 𝑘) =
argmax
𝑙∈ 𝑗−1 𝐸+1, 𝑗𝐸 , 𝑚∈ 𝑘−1 𝐸+1, 𝑘𝐸
𝑍
𝑛 1 (𝑙, 𝑚)
𝑉𝐿1
1
𝑍
𝐿1 1
𝑋
𝑛: 3 × 𝑀 × 𝑀
𝑛 = 1 … 𝐿1 𝑍
1 1
𝑍
2 1
𝐿1 𝑄𝑝𝑝𝑚: 𝑄 × 𝑄(𝐸)
𝐿1 × 𝐽 × 𝐽 𝐿1 × 𝐽/𝐸 × 𝐽/𝐸
𝐿1 𝑉𝐿1
1
𝑍
𝐿1 1
𝑋
𝑛: 𝐿1 × 𝑀2 × 𝑀2
𝑛 = 1 … 𝐿2 𝑍
𝐿2 2
𝐿2 𝑍
𝑛 𝑜 (𝑗, 𝑘) = 𝑔 𝑨𝑛 𝑜 (𝑗, 𝑘)
𝑨𝑛
𝑜 (𝑗, 𝑘) = 𝑠=1 𝐿𝑜−1
𝑙=1 𝑀𝑜
𝑚=1 𝑀𝑜
𝑥𝑛
𝑜 𝑠, 𝑙, 𝑚 𝑉𝑠 𝑜−1 𝑗 + 𝑙, 𝑘 + 𝑚 + 𝑐𝑛 (𝑜)
𝑋
𝑛: 3 × 𝑀 × 𝑀
𝑛 = 1 … 𝐿1 𝑍
1 1
𝑍
2 1
𝐿1 𝑄𝑝𝑝𝑚: 𝑄 × 𝑄(𝐸)
𝐿1 × 𝐽 × 𝐽 𝐿1 × 𝐽/𝐸 × 𝐽/𝐸
𝐿1 𝑉𝐿1
1
𝑍
𝐿1 1
𝑋
𝑛: 𝐿1 × 𝑀2 × 𝑀2
𝑛 = 1 … 𝐿2 𝑍
𝐿2 2
𝐿2 𝑍
𝑛 𝑜 (𝑗, 𝑘) = 𝑔 𝑨𝑛 𝑜 (𝑗, 𝑘)
𝑨𝑛
𝑜 (𝑗, 𝑘) = 𝑠=1 𝐿𝑜−1
𝑙=1 𝑀𝑜
𝑚=1 𝑀𝑜
𝑥𝑛
𝑜 𝑠, 𝑙, 𝑚 𝑉𝑠 𝑜−1 𝑗 + 𝑙, 𝑘 + 𝑚 + 𝑐𝑛 (𝑜)
𝑋
𝑛: 3 × 𝑀 × 𝑀
𝑛 = 1 … 𝐿1 𝑍
1 1
𝑍
2 1
𝐿1 𝑄𝑝𝑝𝑚: 𝑄 × 𝑄(𝐸)
𝐿1 × 𝐽 × 𝐽 𝐿1 × 𝐽/𝐸 × 𝐽/𝐸
𝐿1 𝑉𝐿1
1
𝑍
𝐿1 1
2 + 1
𝑋
𝑛: 𝐿1 × 𝑀2 × 𝑀2
𝑛 = 1 … 𝐿2 𝑍
𝐿2 2
𝐿2
𝑋
𝑛: 3 × 𝑀 × 𝑀
𝑛 = 1 … 𝐿1 𝑍
1 1
𝑍
2 1
𝐿1 𝑄𝑝𝑝𝑚: 𝑄 × 𝑄(𝐸)
𝐿1 × 𝐽 × 𝐽 𝐿1 × 𝐽/𝐸 × 𝐽/𝐸
𝐿1 𝐿2
maps
𝑉𝑛
𝑜 (𝑗, 𝑘) = 𝑍 𝑛 𝑜 (𝑄 𝑛 𝑜 (𝑗, 𝑘))
𝑄
𝑛 𝑜 (𝑗, 𝑘) =
argmax
𝑙∈ 𝑗−1 𝑒+1, 𝑗𝑒 , 𝑚∈ 𝑘−1 𝑒+1, 𝑘𝑒
𝑍
𝑛 𝑜 (𝑙, 𝑚)
𝑉𝐿1
1
𝑍
𝐿1 1
𝑋
𝑛: 𝐿1 × 𝑀2 × 𝑀2
𝑛 = 1 … 𝐿2 𝑍
𝐿2 2
𝐿2
𝑋
𝑛: 3 × 𝑀 × 𝑀
𝑛 = 1 … 𝐿1 𝑍
1 1
𝑍
2 1
𝐿1 𝑄𝑝𝑝𝑚: 𝑄 × 𝑄(𝐸)
𝐿1 × 𝐽 × 𝐽 𝐿1 × 𝐽/𝐸 × 𝐽/𝐸
𝐿1 𝐿2
maps
𝑉𝑛
𝑜 (𝑗, 𝑘) = 𝑍 𝑛 𝑜 (𝑄 𝑛 𝑜 (𝑗, 𝑘))
𝑄
𝑛 𝑜 (𝑗, 𝑘) =
argmax
𝑙∈ 𝑗−1 𝑒+1, 𝑗𝑒 , 𝑚∈ 𝑘−1 𝑒+1, 𝑘𝑒
𝑍
𝑛 𝑜 (𝑙, 𝑚)
𝑉𝐿1
1
𝑍
𝐿1 1
Parameters to choose: Size of pooling block 𝑄2 Pooling stride 𝐸2
𝑋
𝑛: 𝐿1 × 𝑀2 × 𝑀2
𝑛 = 1 … 𝐿2 𝑍
𝐿2 2
𝐿2
𝑋
𝑛: 3 × 𝑀 × 𝑀
𝑛 = 1 … 𝐿1 𝑍
1 1
𝑍
2 1
𝐿1 𝑄𝑝𝑝𝑚: 𝑄 × 𝑄(𝐸)
𝐿1 × 𝐽 × 𝐽 𝐿1 × 𝐽/𝐸 × 𝐽/𝐸
𝐿1 𝐿2
𝑉𝐿1
1
𝑍
𝐿1 1
– With appropriate zero padding – If performed without zero padding it will decrease the size of the input
layer
– Similarly for pooling, 𝐸 may vary with layer
– And arrangement (order in which they follow one another)
– Number of filters 𝐿𝑗 – Spatial extent of filter 𝑀𝑗 × 𝑀𝑗
– The stride 𝑇𝑗
– Spatial extent of filter 𝑄𝑗 × 𝑄𝑗 – The stride 𝐸𝑗
– Number of layers, and number of neurons in each layer
– The weights of the neurons in the final MLP – The (weights and biases of the) filters for every convolutional layer 𝑋
𝑛: 𝐿1 × 𝑀2 × 𝑀2
𝑛 = 1 … 𝐿2 𝑍
𝑁2 2
𝐿2 𝑋
𝑛: 3 × 𝑀 × 𝑀
𝑛 = 1 … 𝐿1 𝑍
1 1
𝑍
2 1
𝑍
𝑁 1
𝐿1 𝑉𝑁
1
𝑄𝑝𝑝𝑚: 𝑄 × 𝑄(𝐸)
𝐿1 × 𝐽 × 𝐽 𝐿1 × 𝐽/𝐸 × 𝐽/𝐸
𝐿1 𝐿2 learnable learnable learnable
𝐾 maps
– 𝐿0 is the number of maps (colours) in the input
𝐾 𝐿 𝐾−1𝑀𝐾 2 + 1 filter parameters
σ𝐾∈𝑑𝑝𝑜𝑤𝑝𝑚𝑣𝑢𝑗𝑝𝑜𝑏𝑚 𝑚𝑏𝑧𝑓𝑠𝑡 𝐿
𝐾 𝐿 𝐾−1𝑀𝐾 2 + 1
– The only difference is in the structure of the network
in response to any input
𝑉𝐿1
1
𝐿1 𝑍
𝐿2 1
𝐿2 𝐿2
𝑒𝐸𝑗𝑤(𝑃 𝒀 , 𝑒 𝒀 ) 𝑒𝑨𝑛
𝐺 (𝑗)
𝑃(𝒀) 𝑉𝐿1
1
𝐿1 𝑍
𝐿2 1
𝐿2 𝐿2 Conventional backprop until here
𝑒𝐸𝑗𝑤(𝑃 𝒀 , 𝑒 𝒀 ) 𝑒𝑨𝑛
𝐺 (𝑗)
𝑃(𝒀) 𝑉𝐿1
1
𝐿1 𝑍
𝐿2 1
𝐿2 𝐿2 Need adjustments here
𝑜 (𝑗, 𝑘) can be computed via
𝑛 𝑜 (𝑙, 𝑚)
𝑉1
𝑜
𝑉2
𝑜
𝑉𝑛
𝑜 (𝑗, 𝑘) = 𝑍 𝑛 1 (𝑄 𝑛 𝑜 (𝑗, 𝑘))
𝑄
𝑛 𝑜 (𝑗, 𝑘) =
argmax
𝑙∈ 𝑗−1 𝑒+1, 𝑗𝑒 , 𝑚∈ 𝑘−1 𝑒+1, 𝑘𝑒
𝑍
𝑛 𝑜 (𝑙, 𝑚)
𝑍
1 𝑜
𝑍
2 𝑜
𝑉1
𝑜
𝑉𝑛
𝑜 (𝑗, 𝑘) = 𝑍 𝑛 1 (𝑄 𝑛 𝑜 (𝑗, 𝑘))
𝑄
𝑛 𝑜 (𝑗, 𝑘) =
argmax
𝑙∈ 𝑗−1 𝑒+1, 𝑗𝑒 , 𝑚∈ 𝑘−1 𝑒+1, 𝑘𝑒
𝑍
𝑛 𝑜 (𝑙, 𝑚)
𝑍
1 𝑜
𝑒𝐸𝑗𝑤(𝑃 𝒀 , 𝑒 𝒀 ) 𝑒𝑍
𝑛 𝑜 (𝑙, 𝑚)
= ൞ 𝑒𝐸𝑗𝑤(𝑃 𝒀 , 𝑒 𝒀 ) 𝑒𝑉𝑛
𝑜 (𝑗, 𝑘)
𝑗𝑔 𝑙, 𝑚 = 𝑄
𝑛 𝑜 (𝑗, 𝑘)
0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓
𝑍
1 𝑜
𝑍
2 𝑜
𝑍
𝑛 𝑜 (𝑗, 𝑘) = 𝑔 𝑨𝑛 𝑜 (𝑗, 𝑘)
𝑨𝑛
𝑜 (𝑗, 𝑘) = 𝑠=1 𝑁𝑜−1
𝑙=1 𝑀𝑜
𝑚=1 𝑀𝑜
𝑥𝑛
𝑜 (𝑠, 𝑙, 𝑚)𝑉𝑠 𝑜−1 (𝑗 + 𝑙, 𝑘 + 𝑚)
and every free parameter (filter weights)
network
𝑍
1 1
𝑍
2 1
𝑍
𝑁 1
𝑉𝑁
1
𝑁 𝑁 𝑍
𝑁2 2
𝑁2 𝑁2
uniform) Original data Augmented data
and must be calculated
– What patterns in the input do the neurons actually respond to? – We estimate it by setting the output of the neuron to 1, and learning the input by backpropagation
– Conv1: 6 5x5 filters in first conv layer (no zero pad), stride 1
– Pool1: 2x2 max pooling, stride 2
– Conv2: 16 5x5 filters in second conv layer, stride 1, no zero pad
– Pool2: 2x2 max pooling with stride 2 for second conv layer
– FC: Final MLP: 3 layers
– 1.2 million pictures – 1000 categories
Krizhevsky, A., Sutskever, I. and Hinton, G. E. “ImageNet Classification with Deep Convolutional Neural Networks” NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada
– 4096 neurons, 4096 neurons, 1000 output neurons
10 patches
– Made a large difference in convergence
plateaus
networks
– Lowest prior error using conventional classifiers: > 25%
Figure 3: 96 convolutional
kernels of size 11×11×3 learned by the first convolutional layer on the 224×224×3 input images. The top 48 kernels were learned
kernels were learned on GPU
Krizhevsky, A., Sutskever, I. and Hinton, G. E. “ImageNet Classification with Deep Convolutional Neural Networks” NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada
Krizhevsky, A., Sutskever, I. and Hinton, G. E. “ImageNet Classification with Deep Convolutional Neural Networks” NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada Eight ILSVRC-2010 test images and the five labels considered most probable by our model. The correct label is written under each image, and the probability assigned to the correct label is also shown with a red bar (if it happens to be in the top 5). Five ILSVRC-2010 test images in the first
training images that produce feature vectors in the last hidden layer with the smallest Euclidean distance from the feature vector for the test image.
512 1024 512
using 13 conv layers and 3 FC layers
– Combining 7 classifiers – Subsequent to paper, reduced error to 6.8% using only two classifiers
64 pool, 128 conv, 128 conv, 128 pool, 256 conv, 256 conv, 256 conv, 256 pool, 512 conv, 512 conv, 512 conv, 512 pool, 512 conv, 512 conv, 512 conv, 512 pool, FC with 4096, 4096, 1000
Madness!
– Current top-5 error: < 3.5% – Over 150 layers, with “skip” connections..
input to the module
rate)
structure
Feature maps Pooling Non-linearity Convolution (Learned) Input image