Computation Graphs Philipp Koehn 29 September 2020 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Neural Network Cartoon 1 • A common way to illustrate a neural network x h y Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Neural Network Math 2 • Hidden layer h = sigmoid ( W 1 x + b 1 ) • Final layer y = sigmoid ( W 2 h + b 2 ) Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Computation Graph 3 W 1 x b 1 prod sum W 2 sigmoid b 2 prod sum sigmoid Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Simple Neural Network 4 3.7 2.9 4.5 3.7 -5.2 2.9 -1.5 0 . -4.6 2 - 1 1 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Computation Graph 5 � � 3 . 7 3 . 7 W 1 x 2 . 9 2 . 9 � � − 1 . 5 b 1 prod − 4 . 6 sum � 4 . 5 − 5 . 2 � W 2 sigmoid � − 2 . 0 � b 2 prod sum sigmoid Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Processing Input 6 � � 3 . 7 3 . 7 � � 1 . 0 W 1 x 2 . 9 2 . 9 0 . 0 � � − 1 . 5 b 1 prod − 4 . 6 sum � 4 . 5 − 5 . 2 � W 2 sigmoid � − 2 . 0 � b 2 prod sum sigmoid Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Processing Input 7 � � 3 . 7 3 . 7 � � 1 . 0 W 1 x 2 . 9 2 . 9 0 . 0 � � − 1 . 5 � � 3 . 7 b 1 prod − 4 . 6 2 . 9 sum � 4 . 5 − 5 . 2 � W 2 sigmoid � − 2 . 0 � b 2 prod sum sigmoid Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Processing Input 8 � � 3 . 7 3 . 7 � � 1 . 0 W 1 x 2 . 9 2 . 9 0 . 0 � � − 1 . 5 � � 3 . 7 b 1 prod − 4 . 6 2 . 9 � � 2 . 2 sum − 1 . 6 � 4 . 5 − 5 . 2 � W 2 sigmoid � − 2 . 0 � b 2 prod sum sigmoid Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Processing Input 9 � � 3 . 7 3 . 7 � � 1 . 0 W 1 x 2 . 9 2 . 9 0 . 0 � � − 1 . 5 � � 3 . 7 b 1 prod − 4 . 6 2 . 9 � � 2 . 2 sum − 1 . 6 � � . 900 � 4 . 5 − 5 . 2 � W 2 sigmoid . 168 � − 2 . 0 � b 2 prod sum sigmoid Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Processing Input 10 � � 3 . 7 3 . 7 � � 1 . 0 W 1 x 2 . 9 2 . 9 0 . 0 � � − 1 . 5 � � 3 . 7 b 1 prod − 4 . 6 2 . 9 � � 2 . 2 sum − 1 . 6 � � . 900 � 4 . 5 − 5 . 2 � W 2 sigmoid . 168 � − 2 . 0 � b 2 � 3 . 18 � prod � 1 . 18 � sum � . 765 � sigmoid Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Error Function 11 • For training, we need a measure how well we do ⇒ Cost function also known as objective function, loss, gain, cost, ... • For instance L2 norm error = 1 2( t − y ) 2 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Gradient Descent 12 • We view the error as a function of the trainable parameters error ( λ ) • We want to optimize error ( λ ) by moving it towards its optimum 2 error( λ ) error( λ ) error( λ ) = t n e i d a r g gradient = 1 gradient = 0.2 λ λ λ optimal λ current λ optimal λ current λ optimal λ current λ • Why not just set it to its optimum? – we are updating based on one training example, do not want to overfit to it – we are also changing all the other parameters, the curve will look different Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Calculus Refresher: Chain Rule 13 • Formula for computing derivative of composition of two or more functions – functions f and g – composition f ◦ g maps x to f ( g ( x )) • Chain rule ( f ◦ g ) ′ = ( f ′ ◦ g ) · g ′ or F ′ ( x ) = f ′ ( g ( x )) g ′ ( x ) • Leibniz’s notation dy · dy dx = dz dz dx if z = f ( y ) and y = g ( x ) , then dy · dy dz dx = dz dx = f ′ ( y ) g ′ ( x ) = f ′ ( g ( x )) g ′ ( x ) Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Final Layer Update 14 • Linear combination of weights s = � k w k h k • Activation function y = sigmoid ( s ) • Error (L2 norm) E = 1 2 ( t − y ) 2 • Derivative of error with regard to one weight w k dE = dE dy ds dw k dy ds dw k Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Error Computation in Computation Graph 15 W 1 x b 1 prod sum W 2 sigmoid b 2 prod sum sigmoid t L2 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Error Propagation in Computation Graph 16 A B E • Compute derivative at node A : dE dA = dE dB dB dA • Assume that we already computed dE dB (backward pass through graph) • So now we only have to get the formula for dB dA • For instance B is a square node – forward computation: B = A 2 dA = dA 2 – backward computation: dB dA = 2 A Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Derivatives for Each Node 17 W 1 x b 1 prod sum W 2 sigmoid b 2 prod sum sigmoid t 2 ( t − i ) 2 = t − i d sigmoid = do d L2 di = d 1 L2 di Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Derivatives for Each Node 18 W 1 x b 1 prod sum W 2 sigmoid b 2 prod sum d sigmoid = do di = d sigmoid di σ ( i ) = σ ( i )(1 − σ ( i )) t d sum 2 ( t − i ) 2 = t − i d sigmoid = do d L2 di = d 1 L2 di Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Derivatives for Each Node 19 W 1 x b 1 prod sum W 2 sigmoid b 2 prod d prod = do d sum di 1 i 1 + i 2 = 1 , do d di 1 = di 2 = 1 sum d sigmoid = do di = d sigmoid di σ ( i ) = σ ( i )(1 − σ ( i )) t d sum 2 ( t − i ) 2 = t − i d sigmoid = do d L2 di = d 1 L2 di Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Derivatives for Each Node 20 W 1 x b 1 prod sum W 2 sigmoid d sum d prod = do di 1 i 1 i 2 = i 2 , do d b 2 di 1 = di 2 = i 1 prod d prod = do d sum di 1 i 1 + i 2 = 1 , do d di 1 = di 2 = 1 sum d sigmoid = do di = d sigmoid di σ ( i ) = σ ( i )(1 − σ ( i )) t d sum 2 ( t − i ) 2 = t − i d sigmoid = do d L2 di = d 1 L2 di Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Backward Pass: Derivative Computation 21 � � 3 . 7 3 . 7 � � 1 . 0 W 1 x 2 . 9 2 . 9 0 . 0 � � − 1 . 5 b 1 � � 3 . 7 − 4 . 6 prod i 2 , i 1 2 . 9 � � 2 . 2 sum 1 , 1 − 1 . 6 � � . 900 � 4 . 5 − 5 . 2 � W 2 sigmoid σ ′ ( i ) . 17 � 3 . 18 � prod i 2 , i 1 � − 2 . 0 � b 2 � 1 . 18 � 1 , 1 sum � 1 . 0 � t � . 765 � sigmoid σ ′ ( i ) � . 0277 � � . 235 � i 2 − i 1 L2 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Backward Pass: Derivative Computation 22 � � 3 . 7 3 . 7 � � 1 . 0 W 1 x 2 . 9 2 . 9 0 . 0 � � − 1 . 5 b 1 � � 3 . 7 − 4 . 6 prod i 2 , i 1 2 . 9 � � 2 . 2 sum 1 , 1 − 1 . 6 � � . 900 � 4 . 5 − 5 . 2 � W 2 sigmoid σ ′ ( i ) . 17 � 3 . 18 � prod i 2 , i 1 � − 2 . 0 � b 2 � 1 . 18 � 1 , 1 sum � 1 . 0 � t � . 765 � � . 180 � � . 235 � � . 0424 � sigmoid σ ′ ( i ) = × � . 0277 � � . 235 � i 2 − i 1 L2 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Backward Pass: Derivative Computation 23 � � 3 . 7 3 . 7 � � 1 . 0 W 1 x 2 . 9 2 . 9 0 . 0 � � − 1 . 5 b 1 � � 3 . 7 − 4 . 6 prod i 2 , i 1 2 . 9 � � 2 . 2 sum 1 , 1 − 1 . 6 � � . 900 � 4 . 5 − 5 . 2 � W 2 sigmoid σ ′ ( i ) . 17 � 3 . 18 � prod i 2 , i 1 � − 2 . 0 � b 2 � 1 . 18 � � . 0424 � � . 0424 � 1 , 1 sum , � 1 . 0 � t � . 765 � � . 180 � � . 235 � � . 0424 � sigmoid σ ′ ( i ) = × � . 0277 � � . 235 � i 2 − i 1 L2 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Backward Pass: Derivative Computation 24 � � 3 . 7 3 . 7 � � 1 . 0 W 1 x 2 . 9 2 . 9 0 . 0 � � − 1 . 5 b 1 � � � � � � 3 . 7 − . 0260 0171 0 − 4 . 6 prod i 2 , i 1 , 2 . 9 − . 0260 − . 0308 0 � � � � � � 2 . 2 . 0171 . 0171 sum 1 , 1 , − 1 . 6 − . 0308 − . 0308 � � � � . 900 . 0171 � 4 . 5 − 5 . 2 � W 2 sigmoid σ ′ ( i ) . 17 − . 0308 � � . 191 � 3 . 18 � � . 0382 . 00712 � prod i 2 , i 1 , � − 2 . 0 � b 2 − . 220 � 1 . 18 � � . 0424 � � . 0424 � 1 , 1 sum , � 1 . 0 � t � . 765 � � . 180 � � . 235 � � . 0424 � sigmoid σ ′ ( i ) = × � . 0277 � � . 235 � i 2 − i 1 L2 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Gradients for Parameter Update 25 � � 3 . 7 3 . 7 W 1 x 2 . 9 2 . 9 � � − 1 . 5 b 1 � � . 0171 0 − 4 . 6 prod i 2 , i 1 − . 0308 0 � � . 0171 sum 1 , 1 − . 0308 � 4 . 5 − 5 . 2 � W 2 sigmoid σ ′ ( i ) � . 0382 . 00712 � prod i 2 , i 1 � − 2 . 0 � b 2 � . 0424 � 1 , 1 sum t sigmoid σ ′ ( i ) i 2 − i 1 L2 Philipp Koehn Machine Translation: Computation Graphs 29 September 2020
Recommend
More recommend