Image: Jose-Luis Olivares CMP784 DEEP LEARNING Lecture #03 – Multi-layer Perceptrons Aykut Erdem // Hacettepe University // Spring 2018
Breaking news! • Practical 1 is out! — Learning neural word embeddings — Due Friday, Mar. 16, 23:59:59 • Paper presentations and quizzes will start next week! − Discuss your slides with me 3-4 days prior to your presentation − submit your final slides by the night before the class. − We don’t have any code walker or demonstrator. 2
Previously on CMP784 • Learning problem • Parametric vs. non-parametric models • Nearest—neighbor classifier • Linear classification • Linear regression • Capacity • Hyperparameter • Underfitting • Overfitting • Variance-Bias tradeoff • Model selection • Cross-validation 3
Lecture overview • the perceptron • the multi-layer perceptron • stochastic gradient descent • backpropagation • shallow yet very powerful: word2vec sclaimer: Much of the material and slides for this lecture were borrowed from • Discl — Hugo Larochelle’s Neural networks slides — Nick Locascio’s MIT 6.S191 slides — Efstratios Gavves and Max Willing’s UvA deep learning class — Leonid Sigal’s CPSC532L class — Richard Socher’s CS224d class — Dan Jurafsky’s CS124 class 4
A Brief History of Neural Networks today Image: VUNI Inc. 5
The Perceptron 6
The Perceptron non-linearity sum inputs weights x 0 w 0 x 1 w 1 w 2 x 2 ∑ w n … b x n 1 bias 7
Perceptron Forward Pass • Neuron pre-activation non-linearity sum inputs weights (or input activation) x 0 w 0 i w i x i = b + w > x • a ( x ) = b + P x 1 w 1 P • P • Neuron output activation: w 2 x 2 ∑ • h ( x ) = g ( a ( x )) = g ( b + P i w i x i ) w n … b where x n w are the weights (parameters) 1 b is the bias term bias g (·) is called the activation function • • 8
Output Activation of The Neuron P • non-linearity sum inputs weights • h ( x ) = g ( a ( x )) = g ( b + P i w i x i ) x 0 w 0 x 1 w 1 Range is determined by g (·) w 2 x 2 ∑ w n … s b x n ed Bias only changes the Bi position of 1 bias t the riff ri (from Pascal Vincent’s slides) a ( x Image credit: Pascal Vincent ( x 9
Linear Activation Function P • • h ( x ) = g ( a ( x )) = g ( b + P i w i x i ) non-linearity sum inputs weights • { x 0 • g ( a ) = a w 0 tion x 1 w 1 w 2 x 2 ∑ w n … b x n 1 bias No nonlinear transformation • No input squashing • 10
Sigmoid Activation Function P • • h ( x ) = g ( a ( x )) = g ( b + P i w i x i ) • non-linearity sum inputs weights x 0 1 • g ( a ) = sigm( a ) = s w 0 1+exp( � a ) output between 0 and 1 • x 1 w 1 � output between 0 and 1 Squashes • w 2 the neuron’s x 2 ∑ output w n between … 0 and 1 b x n Always • positive 1 Bounded • bias Strictly • Increasing 11
Perceptron Forward Pass P • non-linearity sum inputs weights • h ( x ) = g ( a ( x )) = g ( b + P i w i x i ) 2 0.1 3 0.5 2.5 -1 ∑ 0.2 5 3.0 1 bias 12
Perceptron Forward Pass • non-linearity sum inputs weights • h ( x ) = g ( a (2*0.1) + 2 0.1 (3*0.5) + 3 0.5 (-1*2.5) + 2.5 -1 ∑ 0.2 (5*0.2) + 5 3.0 (1*3.0) x i ) 1 bias ( x 13 •
Perceptron Forward Pass non-linearity sum inputs weights h ( x ) = g (3 . 2) = σ (3 . 2) 2 0.1 1 3 1 + e − 3 . 2 = 0 . 96 0.5 2.5 -1 ∑ 0.2 5 3.0 1 bias 14
Hyperbolic Tangent (tanh) Activation Function P • • h ( x ) = g ( a ( x )) = g ( b + P i w i x i ) non-linearity sum inputs weights • g ( a ) = tanh( a ) = exp( a ) � exp( � a ) x 0 exp( a )+exp( � a ) w 0 h( a ) = exp( a ) � exp( � a ) exp( a )+exp( � a ) = exp(2 a ) � 1 x 1 w 1 exp(2 a )+1 w 2 x 2 ∑ Squashes the • neuron’s output w n … between b -1 and 1 x n Can be positive • or negative 1 bias Bounded • Strictly • Increasing 15
Rectified Linear (ReLU) Activation Function P • • h ( x ) = g ( a ( x )) = g ( b + P i w i x i ) non-linearity sum inputs weights x 0 • g ( a ) = reclin( a ) = max(0 , a ) w 0 • x 1 w 1 Bounded below • by 0 (always w 2 x 2 ∑ non-negative) w n Not upper • … bounded b Strictly • x n increasing Tends to • 1 bias produce units with sparse activities 16
Decision Boundary of a Neuron • Could do binary classification: — with sigmoid, one can interpret neuron as estimating p ( y = 1 | x ) — also known as logistic regression classifier Decision boundary is linear — if activation is greater than 0.5, predict 1 — otherwise predict 0 han Same idea can be applied to a tanh activation Image credit: Pascal Vincent (from Pascal Vincent’s slides) 17
Capacity of Single Neuron • Can solve linearly separable problems AND ( x 1 , x 2 ) OR ( x 1 , x 2 ) AND ( x 1 , x 2 ) 1 1 1 , x 2 ) , x 2 ) , x 2 ) 0 0 0 0 1 0 1 0 1 ( x 1 ( x 1 ( x 1 18
Capacity of Single Neuron • Can not solve non-linearly separable problems XOR ( x 1 , x 2 ) XOR ( x 1 , x 2 ) AND ( x 1 , x 2 ) 1 1 , x 2 ) ? 0 0 0 1 0 1 ( x 1 AND ( x 1 , x 2 ) • Need to transform the input into a better representation • Remember basis functions ! 19
Perceptron Diagram Simplified non-linearity sum inputs weights x 0 w 0 x 1 w 1 w 2 x 2 ∑ w n … b x n 1 bias 20
Perceptron Diagram Simplified output inputs x 0 x 1 x 2 o 0 … x n 21
Multi-Output Perceptron • Remember multi-way classification output inputs — We need multiple outputs (1 output per class) x 0 — We need to estimate conditional probability p ( y = c| x ) — Discriminative Learning x 1 o 0 • | • Softmax activation function at the output x 2 o n i > h exp( a 1 ) exp( a C ) • o ( a ) = softmax( a ) = c exp( a c ) . . . … P P c exp( a c ) x n — Strictly positive — sums to one • Predict class with the highest estimated class conditional probability. 22
Multi-Layer Perceptron 23
Single Hidden Layer Neural Network • Hidden layer pre-activation: inputs hidden output layer layer • a ( x ) = b (1) + W (1) x h 0 x 0 ⇣ ⌘ a ( x ) i = b (1) j W (1) + P i,j x j i h 1 o 0 x 1 • • Hidden layer activation: h 2 o n … • h ( x ) = g ( a ( x )) x n h n • Output layer activation: > b (2) + w (2) h (1) x ⇣ ⌘ o ( x ) = o 24
Multi-Layer Perceptron (MLP) • Consider a network with L hidden inputs hidden output layer layer layers. - • — layer pre-activation for k>0 h 0 x 0 • a ( k ) ( x ) = b ( k ) + W ( k ) h ( k � 1) ( x ) ( h 1 o 0 x 1 • — hidden layer activation from 1 to L: h 2 o n … • h ( k ) ( x ) = g ( a ( k ) ( x )) x n h n - • — output layer activation (k=L+1) • h ( L +1) ( x ) = o ( a ( L +1) ( x )) = f ( x ) ( h (0) ( x ) = x ) 25
Deep Neural Network inputs hidden output layer layer h 0 h 0 x 0 h 1 h 1 o 0 x 1 … h 2 h 2 o n … x n h n h n 26
Capacity of Neural Networks • Consider a single layer neural network R´ eseaux de neurones x 2 z 1 0 1 -1 0 -1 0 -1 1 x 1 z k Output sortie k y 1 y 2 x 2 x 2 1 -.4 w kj -1 1 y 1 y 2 .7 0 0 1 1 Hidden cach´ ee j -1 -1 -1.5 0 bias 0 biais -1 -1 0 .5 0 -1 -1 1 w ji 1 1 1 1 x 1 x 1 1 Input entr´ ee i x 2 x 1 x (from Pascal Vincent’s slides) Image credit: Pascal Vincent 27
Capacity of Neural Networks • Consider a single layer neural network x 2 z 1 x 1 y 2 z 1 y 3 y 1 y 1 y 2 y 3 y 4 y 4 x 1 x 2 Image credit: Pascal Vincent (from Pascal Vincent’s slides) 28
Capacity of Neural Networks • Consider a single layer neural network (from Pascal Vincent’s slides) Image credit: Pascal Vincent 29
Universal Approximation • Universal Approximation Theorem (Hornik, 1991): — “a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units’’ • This applies for sigmoid, tanh and many other activation functions. • However, this does not mean that there is learning algorithm that can find the necessary parameter values. 30
Applying Neural Networks 31
Example Problem: Will my flight be delayed? Example Problem: Will my Flight be Delayed? h 0 Temperature: -20 F x 0 Wind Speed: 45 mph Predicted: 0.05 h 1 o 0 [-20, 45] x 1 h 2 32
Example Problem: Will my flight be delayed? Example Problem: Will my Flight be Delayed? h 0 x 0 Predicted: 0.05 h 0 [-20, 45] h 1 x 0 o 0 Predicted: 0.05 x 1 h 1 o 0 [-20, 45] h 2 x 1 h 2 33
Example Problem: Will my flight be delayed? Example Problem: Will my Flight be Delayed? h 0 x 0 Predicted: 0.05 h 0 [-20, 45] h 1 Actual: 1 x 0 o 0 Predicted: 0.05 x 1 h 1 o 0 [-20, 45] h 2 x 1 h 2 34
Quantifying Loss h 0 x 0 Predicted: 0.05 [-20, 45] h 1 Actual: 1 o 0 x 1 h 2 ` ( f ( x ( i ) ; ✓ ) , y ( i ) ) Actual Predicted 35
Total Loss Input Predicted Actual h 0 [ [ [ x 0 0.05 1 [-20, 45], 0.02 0 h 1 o 0 [80, 0], 0.96 1 x 1 [4, 15], 0.35 1 [45, 60], h 2 ] ] ] J ( ✓ ) = 1 X ` ( f ( x ( i ) ; ✓ ) , y ( i ) ) N Actual i Predicted 36
Total Loss Input Predicted Actual h 0 [ [ [ x 0 0.05 1 [-20, 45], 0.02 0 h 1 o 0 [80, 0], 0.96 1 x 1 [4, 15], 0.35 1 [45, 60], h 2 ] ] ] J ( ✓ ) = 1 X ` ( f ( x ( i ) ; ✓ ) , y ( i ) ) N Actual i Predicted 37
Recommend
More recommend