Introduction to Neural Networks Philipp Koehn 24 September 2020 Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Linear Models 1 • We used before weighted linear combination of feature values h j and weights λ j � score ( λ, d i ) = λ j h j ( d i ) j • Such models can be illustrated as a ”network” Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Limits of Linearity 2 • We can give each feature a weight • But not more complex value relationships, e.g, – any value in the range [0;5] is equally good – values over 8 are bad – higher than 10 is not worse Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
XOR 3 • Linear models cannot model XOR good bad bad good Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Multiple Layers 4 • Add an intermediate (”hidden”) layer of processing (each arrow is a weight) x h y Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
5 • Have we gained anything so far? Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Non-Linearity 6 • Instead of computing a linear combination � score ( λ, d i ) = λ j h j ( d i ) j • Add a non-linear function � � � score ( λ, d i ) = f λ j h j ( d i ) j • Popular choices 1 tanh(x) sigmoid(x) = relu( x ) = max(0, x ) 1+ e − x (sigmoid is also called the ”logistic function”) Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Deep Learning 7 • More layers = deep learning Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
What Depths Holds 8 • Each layer is a processing step • Having multiple processing steps allows complex functions • Metaphor: NN and computing circuits – computer = sequence of Boolean gates – neural computer = sequence of layers • Deep neural networks can implement complex functions e.g., sorting on input values Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
9 example Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Simple Neural Network 10 3.7 2.9 4.5 3.7 -5.2 2.9 5 . 1 - -2.0 -4.6 1 1 • One innovation: bias units (no inputs, always value 1) Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Sample Input 11 3.7 1.0 2.9 4 . 5 3.7 -5.2 0.0 2.9 5 . 1 - -2.0 -4.6 1 1 • Try out two input values • Hidden unit computation 1 sigmoid ( 1.0 × 3 . 7 + 0.0 × 3 . 7 + 1 × − 1 . 5) = sigmoid (2 . 2) = 1 + e − 2 . 2 = 0 . 90 1 sigmoid ( 1.0 × 2 . 9 + 0.0 × 2 . 9 + 1 × − 4 . 5) = sigmoid ( − 1 . 6) = 1 + e 1 . 6 = 0 . 17 Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Computed Hidden 12 3.7 1.0 .90 2.9 4.5 3.7 -5.2 0.0 .17 2.9 5 . 1 - -2.0 -4.6 1 1 • Try out two input values • Hidden unit computation 1 sigmoid ( 1.0 × 3 . 7 + 0.0 × 3 . 7 + 1 × − 1 . 5) = sigmoid (2 . 2) = 1 + e − 2 . 2 = 0 . 90 1 sigmoid ( 1.0 × 2 . 9 + 0.0 × 2 . 9 + 1 × − 4 . 5) = sigmoid ( − 1 . 6) = 1 + e 1 . 6 = 0 . 17 Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Compute Output 13 3.7 1.0 .90 2.9 4.5 3.7 -5.2 0.0 .17 2.9 5 . 1 - -2.0 -4.6 1 1 • Output unit computation 1 sigmoid ( .90 × 4 . 5 + .17 × − 5 . 2 + 1 × − 2 . 0) = sigmoid (1 . 17) = 1 + e − 1 . 17 = 0 . 76 Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Computed Output 14 3.7 1.0 .90 2.9 4.5 3.7 -5.2 0.0 .17 .76 2.9 5 . 1 - -2.0 -4.6 1 1 • Output unit computation 1 sigmoid ( .90 × 4 . 5 + .17 × − 5 . 2 + 1 × − 2 . 0) = sigmoid (1 . 17) = 1 + e − 1 . 17 = 0 . 76 Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Output for all Binary Inputs 15 Input x 0 Input x 1 Hidden h 0 Hidden h 1 Output y 0 0 0 0.12 0.02 0.18 → 0 0 1 0.88 0.27 0.74 → 1 1 0 0.73 0.12 0.74 → 1 1 1 0.99 0.73 0.33 → 0 • Network implements XOR – hidden node h 0 is OR – hidden node h 1 is AND – final layer operation is h 0 − − h 1 • Power of deep neural networks: chaining of processing steps just as: more Boolean circuits → more complex computations possible Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
16 why ”neural” networks? Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Neuron in the Brain 17 • The human brain is made up of about 100 billion neurons Dendrite Axon terminal Soma Axon Nucleus • Neurons receive electric signals at the dendrites and send them to the axon Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Neural Communication 18 • The axon of the neuron is connected to the dendrites of many other neurons Neurotransmitter Synaptic vesicle Neurotransmitter Axon transporter terminal Voltage gated Ca++ channel Receptor Postsynaptic density Synaptic cleft Dendrite Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
The Brain vs. Artificial Neural Networks 19 • Similarities – Neurons, connections between neurons – Learning = change of connections, not change of neurons – Massive parallel processing • But artificial neural networks are much simpler – computation within neuron vastly simplified – discrete time steps – typically some form of supervised learning with massive number of stimuli Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
20 back-propagation training Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Error 21 3.7 1.0 .90 2.9 4.5 3.7 -5.2 0.0 .17 .76 2.9 5 . 1 - -2.0 -4.6 1 1 • Computed output: y = .76 • Correct output: t = 1.0 ⇒ How do we adjust the weights? Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Key Concepts 22 • Gradient descent – error is a function of the weights – we want to reduce the error – gradient descent: move towards the error minimum – compute gradient → get direction to the error minimum – adjust weights towards direction of lower error • Back-propagation – first adjust last set of weights – propagate error back to each previous layer – adjust their weights Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Gradient Descent 23 error( λ ) gradient = 1 λ optimal λ current λ Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Gradient Descent 24 Current Point Gradient for w 1 Optimum Combined Gradient Gradient for w 2 Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Derivative of Sigmoid 25 1 • Sigmoid sigmoid ( x ) = 1 + e − x • Reminder: quotient rule = g ( x ) f ′ ( x ) − f ( x ) g ′ ( x ) � f ( x ) � ′ g ( x ) 2 g ( x ) d sigmoid ( x ) 1 = d • Derivative 1 + e − x dx dx = 0 × (1 − e − x ) − ( − e − x ) (1 + e − x ) 2 e − x 1 � � = 1 + e − x 1 + e − x 1 1 � � = 1 − 1 + e − x 1 + e − x = sigmoid ( x )(1 − sigmoid ( x )) Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Final Layer Update 26 • Linear combination of weights s = � k w k h k • Activation function y = sigmoid ( s ) • Error (L2 norm) E = 1 2 ( t − y ) 2 • Derivative of error with regard to one weight w k dE = dE dy ds dw k dy ds dw k Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Final Layer Update (1) 27 • Linear combination of weights s = � k w k h k • Activation function y = sigmoid ( s ) • Error (L2 norm) E = 1 2 ( t − y ) 2 • Derivative of error with regard to one weight w k dE = dE dy ds dw k dy ds dw k • Error E is defined with respect to y 1 dE dy = d 2( t − y ) 2 = − ( t − y ) dy Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Final Layer Update (2) 28 • Linear combination of weights s = � k w k h k • Activation function y = sigmoid ( s ) • Error (L2 norm) E = 1 2 ( t − y ) 2 • Derivative of error with regard to one weight w k dE = dE dy ds dw k dy ds dw k • y with respect to x is sigmoid ( s ) ds = d sigmoid ( s ) dy = sigmoid ( s )(1 − sigmoid ( s )) = y (1 − y ) ds Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Final Layer Update (3) 29 • Linear combination of weights s = � k w k h k • Activation function y = sigmoid ( s ) • Error (L2 norm) E = 1 2 ( t − y ) 2 • Derivative of error with regard to one weight w k dE = dE dy ds dw k dy ds dw k • x is weighted linear combination of hidden node values h k ds d � = w k h k = h k dw k dw k k Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Recommend
More recommend