Machine Learning for Signal Processing
Neural Networks Continue
Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016
1
Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim - - PowerPoint PPT Presentation
Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice Image N.Net N.Net Text caption Transcription signal Game N.Net Next move
1
N.Net Voice signal Transcription N.Net Image Text caption N.Net Game State Next move
18797/11755 2
– An old question – dating back to Plato and Aristotle..
18797/11755 3
– They represent Boolean functions over linear boundaries – They can represent arbitrary boundaries
– They detect patterns in the input
– Higher-level perceptrons may also be viewed as feature detectors
– Can model any function to arbitrary precision
– The network will fire if the combination of the detected basic features matches an “acceptable” pattern for a desired class of signal
4
– They represent arbitrary Boolean functions over arbitrary linear boundaries
– MLPs are Boolean formulae over these patterns
– Can model any function to arbitrary precision
– Training data are generally many orders of magnitude too few – Even with optimal architectures, we could get rubbish – Depth helps greatly! – Can learn functions that regular classifiers cannot
5
– Not just classification/Boolean functions
– Left: A net with a pair of units can create a pulse of any width at any location – Right: A network of N such pairs approximates the function with N scaled pulses
9
x
1 T1 T2 1 T1 T2 1
T1 T2 x
f(x) x
+
signal
– Will retain all the significant components of the signal
10
DIGIT OR NOT?
11
𝒀 𝒁 𝒀 𝑿 𝑿𝑼
ENCODER DECODER
13
𝒀 𝒀 𝒁 𝑿 𝑿𝑼
𝐙 = 𝐗𝐘 𝐘 = 𝐗𝑈𝐙 𝐹 = 𝐘 − 𝐗𝑈𝐗𝐘 2 Find W to minimize Avg[E]
– “Non linear” PCA – Deeper networks can capture more complicated manifolds
14
ENCODER DECODER
dictionary
15
DECODER
ENCODER DECODER
Cut the AE
16
DECODER
dictionary
17
Sax dictionary
dictionary
18
DECODER
Clarinet dictionary
19
– They can model any decision boundary
– They can model any regression
20
– Only the presence of the pattern
pattern
– Moving it by one component results in an entirely different input that the MLP wont recognize
History
Yann LeCun Hubel and Wiesel: 1959 (biological model), Fukushima: 1980 (computational model), Altas: 1988, Lecunn: 1989 (Backprop in convnets) Kunihiko Fukushima
properties from an image.
algorithm.
All different weights Convolution layer has much smaller number of parameters by local connection and weight sharing All different weights Shared weights
25
Example: 200x200 image 40K hidden units
~2B parameters!!!
training samples anyway..
Ranzato
26
Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters
Ranzato
Note: This parameterization is good when input image is registered (e.g., face recognition).
27
STATIONARITY? Statistics is similar at different locations
Ranzato
Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters
28
Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels
Ranzato
Ranzato
Ranzato
Ranzato
Ranzato
Ranzato
Ranzato
Ranzato
Ranzato
Ranzato
Ranzato
Ranzato
Ranzato
Ranzato
Ranzato
Ranzato
Ranzato
46
Learn multiple filters.
E.g.: 200x200 image 100 Filters Filter size: 10x10 10K parameters
Ranzato
before: now:
input layer hidden layer
32 32 3
32x32x3 image
width height depth
32 32 3
5x5x3 filter 32x32x3 image
Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”
32 32 3
5x5x3 filter 32x32x3 image
Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” Filters always extend the full depth of the input volume
32 32 3
32x32x3 image 5x5x3 filter
1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias)
32 32 3
32x32x3 image 5x5x3 filter
convolve (slide) over all spatial locations activation map 1 28 28
32 32 3
32x32x3 image 5x5x3 filter
convolve (slide) over all spatial locations activation maps 1 28 28
consider a second, green filter
32 32 3 Convolution Layer activation maps 6 28 28
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: We stack these up to get a “new image” of size 28x28x6!
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions 32 32 3 28 28 6 CONV, ReLU e.g. 6 5x5x3 filters
Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions 32 32 3 CONV, ReLU e.g. 6 5x5x3 filters 28 28 6 CONV, ReLU e.g. 10 5x5x6 filters CONV, ReLU
….
10 24 24
57
Let us assume filter is an “eye” detector. Q.: how can we make the detection robust to the exact location of the eye?
Ranzato
58
By “pooling” (e.g., taking max) filter responses at different locations we gain robustness to the exact spatial location
Ranzato
1 1 2 4 5 6 7 8 3 2 1 1 2 3 4 Single depth slice x y
max pool with 2x2 filters and stride 2
6 8 3 4
61
Convol. Pooling One stage (zoom)
courtesy of
Ranzato
Krizhevsky, A., Sutskever, I. and Hinton, G. E. “ImageNet Classification with Deep Convolutional Neural Networks” NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada
Figure 3: 96 convolutional
kernels of size 11×11×3 learned by the first convolutional layer on the 224×224×3 input images. The top 48 kernels were learned
kernels were learned on GPU
Krizhevsky, A., Sutskever, I. and Hinton, G. E. “ImageNet Classification with Deep Convolutional Neural Networks” NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada
Krizhevsky, A., Sutskever, I. and Hinton, G. E. “ImageNet Classification with Deep Convolutional Neural Networks” NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada Eight ILSVRC-2010 test images and the five labels considered most probable by our model. The correct label is written under each image, and the probability assigned to the correct label is also shown with a red bar (if it happens to be in the top 5). Five ILSVRC-2010 test images in the first
training images that produce feature vectors in the last hidden layer with the smallest Euclidean distance from the feature vector for the test image.
structure
Feature maps Pooling Non-linearity Convolution (Learned) Input image
Recurrent networks introduce (RNN) cycles and a notion of time.
can produce sequences of outputs 𝑧1, … , 𝑧𝑛.
𝑦𝑢 𝑧𝑢 ℎ𝑢 ℎ𝑢−1
One-step delay
Elman Nets (1990) – Simple Recurrent Neural Networks
recurrence
sense of time
The state consists of a single “hidden” vector h:
𝑦𝑢 𝑧𝑢 ℎ𝑢 ℎ𝑢−1
One-step delay
RNNs can be unrolled across multiple time steps. This produces a DAG which supports backpropagation. But its size depends on the input sequence length.
𝑦𝑢 𝑧𝑢 ℎ𝑢 ℎ𝑢−1
One-step delay
𝑦0 𝑧0 ℎ0 𝑦1 𝑧1 ℎ1 𝑦2 𝑧2 ℎ2
– Speech, video, Text, Market
specific input sequence is seen. Applications: speech recognition
network sees only part of the sequence. Applications: Time series prediction (stock market, sun spots, etc)
to a specific input sequence. Applications: speech generation
Often layers are stacked vertically (deep RNNs):
𝑦0 𝑧00 ℎ00 𝑦1 𝑧01 ℎ01 𝑦2 𝑧02 ℎ02 𝑦00 𝑦01 𝑦02 𝑧10 𝑧11 𝑧12 ℎ10 ℎ11 ℎ12 Time Abstraction
level features
Same parameters at this level Same parameters at this level
𝑦0 𝑧00 ℎ00 𝑦1 𝑧01 ℎ01 𝑦2 𝑧02 ℎ02 𝑦00 𝑦01 𝑦02 𝑧10 𝑧11 𝑧12 ℎ10 ℎ11 ℎ12 Time Abstraction
level features Activations
Backprop still works: (it called Backpropagation Through Time)
Backprop still works:
𝑦0 𝑧00 ℎ00 𝑦1 𝑧01 ℎ01 𝑦2 𝑧02 ℎ02 𝑦00 𝑦01 𝑦02 𝑧10 𝑧11 𝑧12 ℎ10 ℎ11 ℎ12 Time Abstraction
level features Activations
Backprop still works:
𝑦0 𝑧00 ℎ00 𝑦1 𝑧01 ℎ01 𝑦2 𝑧02 ℎ02 𝑦00 𝑦01 𝑦02 𝑧10 𝑧11 𝑧12 ℎ10 ℎ11 ℎ12 Time Abstraction
level features Activations
Backprop still works:
𝑦0 𝑧00 ℎ00 𝑦1 𝑧01 ℎ01 𝑦2 𝑧02 ℎ02 𝑦00 𝑦01 𝑦02 𝑧10 𝑧11 𝑧12 ℎ10 ℎ11 ℎ12 Time Abstraction
level features Activations
Backprop still works:
𝑦0 𝑧00 ℎ00 𝑦1 𝑧01 ℎ01 𝑦2 𝑧02 ℎ02 𝑦00 𝑦01 𝑦02 𝑧10 𝑧11 𝑧12 ℎ10 ℎ11 ℎ12 Time Abstraction
level features Activations
Backprop still works:
𝑦0 𝑧00 ℎ00 𝑦1 𝑧01 ℎ01 𝑦2 𝑧02 ℎ02 𝑦00 𝑦01 𝑦02 𝑧10 𝑧11 𝑧12 ℎ10 ℎ11 ℎ12 Time Abstraction
level features Activations
Backprop still works:
𝑦0 𝑧00 ℎ00 𝑦1 𝑧01 ℎ01 𝑦2 𝑧02 ℎ02 𝑦00 𝑦01 𝑦02 𝑧10 𝑧11 𝑧12 ℎ10 ℎ11 ℎ12 Time Abstraction
level features Activations
Backprop still works:
𝑦0 𝑧00 ℎ00 𝑦1 𝑧01 ℎ01 𝑦2 𝑧02 ℎ02 𝑦00 𝑦01 𝑦02 𝑧10 𝑧11 𝑧12 ℎ10 ℎ11 ℎ12 Time Abstraction
level features Gradients
Backprop still works:
𝑦0 𝑧00 ℎ00 𝑦1 𝑧01 ℎ01 𝑦2 𝑧02 ℎ02 𝑦00 𝑦01 𝑦02 𝑧10 𝑧11 𝑧12 ℎ10 ℎ11 ℎ12 Time Abstraction
level features Gradients
Backprop still works:
𝑦0 𝑧00 ℎ00 𝑦1 𝑧01 ℎ01 𝑦2 𝑧02 ℎ02 𝑦00 𝑦01 𝑦02 𝑧10 𝑧11 𝑧12 ℎ10 ℎ11 ℎ12 Time Abstraction
level features Gradients
𝑦0 𝑧00 ℎ00 𝑦1 𝑧01 ℎ01 𝑦2 𝑧02 ℎ02 𝑦00 𝑦01 𝑦02 𝑧10 𝑧11 𝑧12 ℎ10 ℎ11 ℎ12 Time Abstraction
level features Gradients
Backprop still works:
Backprop still works:
𝑦0 𝑧00 ℎ00 𝑦1 𝑧01 ℎ01 𝑦2 𝑧02 ℎ02 𝑦00 𝑦01 𝑦02 𝑧10 𝑧11 𝑧12 ℎ10 ℎ11 ℎ12 Time Abstraction
level features Gradients
Backprop still works:
𝑦0 𝑧00 ℎ00 𝑦1 𝑧01 ℎ01 𝑦2 𝑧02 ℎ02 𝑦00 𝑦01 𝑦02 𝑧10 𝑧11 𝑧12 ℎ10 ℎ11 ℎ12 Time Abstraction
level features Gradients
Backprop still works:
𝑦0 𝑧00 ℎ00 𝑦1 𝑧01 ℎ01 𝑦2 𝑧02 ℎ02 𝑦00 𝑦01 𝑦02 𝑧10 𝑧11 ℎ10 ℎ11 ℎ12 Time Abstraction
level features Gradients
𝑧12
information
Standard LSTM
Input gate
New memory LSTM cell takes the following input
𝐷𝑢−1 (all vectors) Forget gate Cell state
Overall picture:
92
information will be let through the memory cell.
should be thrown away from memory cell.
will be passed to expose to the next time step.
RNN
LSTM Memory Cell
– Language Model – Sentiment analysis / text classification – Machine translation and conversation modeling – Sentence skip-thought vectors
Translating Videos to Natural Language Using Deep Recurrent Neural Networks
Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Venugopalan, Huijun Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko North American Chapter of the Association for Computational Linguistics, Denver, Colorado, June 2015.
http://www.hexahedria.com/2015/08/03/composing-music-with-recurrent-neural-networks/
Nets (CNN) give huge gains (state of the art):
“Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks,” ICASSP 2015.
Cortana
– They represent Boolean functions over linear boundaries – They can represent arbitrary boundaries
– They detect patterns in the input
– Higher-level perceptrons may also be viewed as feature detectors
– Can model any function to arbitrary precision – Non linear PCA
– CNN
– RNN, LSTM