backpropagation
play

Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&A - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&A 3 BACKPROPAGATION 4 A Recipe for Background


  1. 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1

  2. Q&A 3

  3. BACKPROPAGATION 4

  4. A Recipe for Background Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function 5

  5. Approaches to Training Differentiation • Question 1: When can we compute the gradients of the parameters of an arbitrary neural network? • Question 2: When can we make the gradient computation efficient? 6

  6. Approaches to Training Differentiation 1. Finite Difference Method – Pro: Great for testing implementations of backpropagation – Con: Slow for high dimensional inputs / outputs – Required: Ability to call the function f( x ) on any input x 2. Symbolic Differentiation – Note: The method you learned in high-school – Note: Used by Mathematica / Wolfram Alpha / Maple – Pro: Yields easily interpretable derivatives – Con: Leads to exponential computation time if not carefully implemented – Required: Mathematical expression that defines f( x ) 3. Automatic Differentiation - Reverse Mode – Note: Called Backpropagation when applied to Neural Nets – Pro: Computes partial derivatives of one output f (x) i with respect to all inputs x j in time proportional to computation of f( x ) – Con: Slow for high dimensional outputs (e.g. vector-valued functions) – Required: Algorithm for computing f( x ) 4. Automatic Differentiation - Forward Mode – Note: Easy to implement. Uses dual numbers. – Pro: Computes partial derivatives of all outputs f (x) i with respect to one input x j in time proportional to computation of f( x ) – Con: Slow for high dimensional inputs (e.g. vector-valued x ) – Required: Algorithm for computing f( x ) 7

  7. Finite Difference Method Training Notes: • Suffers from issues of floating point precision, in practice • Typically only appropriate to use on small examples with an appropriately chosen epsilon 8

  8. Symbolic Differentiation Training Differentiation Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? 9

  9. Symbolic Differentiation Training Differentiation Quiz #2: … … … 11

  10. Chain Rule Training Whiteboard – Chain Rule of Calculus 12

  11. Chain Rule Training } manner: y = g ( u ) and u = h ( x ) . Given: quantities. Chain Rule: J dy i dy i du j X = , 8 i, k dx k du j dx k j =1 … 13

  12. Chain Rule Training } manner: y = g ( u ) and u = h ( x ) . Given: quantities. Chain Rule: J dy i dy i du j X = , 8 i, k dx k du j dx k j =1 Backpropagation … is just repeated application of the chain rule from Calculus 101. 14

  13. Error Back-Propagation 15 Slide from (Stoyanov & Eisner, 2012)

  14. Error Back-Propagation 16 Slide from (Stoyanov & Eisner, 2012)

  15. Error Back-Propagation 17 Slide from (Stoyanov & Eisner, 2012)

  16. Error Back-Propagation 18 Slide from (Stoyanov & Eisner, 2012)

  17. Error Back-Propagation 19 Slide from (Stoyanov & Eisner, 2012)

  18. Error Back-Propagation 20 Slide from (Stoyanov & Eisner, 2012)

  19. Error Back-Propagation 21 Slide from (Stoyanov & Eisner, 2012)

  20. Error Back-Propagation 22 Slide from (Stoyanov & Eisner, 2012)

  21. Error Back-Propagation 23 Slide from (Stoyanov & Eisner, 2012)

  22. Error Back-Propagation p(y| x (i) ) ϴ z y (i) 24 Slide from (Stoyanov & Eisner, 2012)

  23. Backpropagation Training Whiteboard – Example: Backpropagation for Chain Rule #1 Differentiation Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? 25

  24. Backpropagation Training Automatic Differentiation – Reverse Mode (aka. Backpropagation) Forward Computation 1. Write an algorithm for evaluating the function y = f( x ). The algorithm defines a directed acyclic graph , where each variable is a node (i.e. the “ computation graph” ) 2. Visit each node in topological order . For variable u i with inputs v 1 ,…, v N a. Compute u i = g i (v 1 ,…, v N ) b. Store the result at the node Backward Computation 1. Initialize all partial derivatives dy/du j to 0 and dy/dy = 1. 2. Visit each node in reverse topological order . For variable u i = g i (v 1 ,…, v N ) a. We already know dy/du i b. Increment dy/dv j by (dy/du i )(du i /dv j ) (Choice of algorithm ensures computing (du i /dv j ) is easy) Return partial derivatives dy/du i for all variables 26

  25. � ��� � � � � � Backpropagation Training Simple Example: The goal is to compute J = ��� ( ��� ( x 2 ) + 3 x 2 ) on the forward pass and the derivative dJ dx on the backward pass. Forward Backward J = cos ( u ) u = u 1 + u 2 u 1 = sin ( t ) u 2 = 3 t t = x 2 27

  26. Backpropagation Training Simple Example: The goal is to compute J = ��� ( ��� ( x 2 ) + 3 x 2 ) on the forward pass and the derivative dJ dx on the backward pass. Forward Backward dJ J = cos ( u ) du � = − sin ( u ) dJ � = dJ du du dJ � = dJ du du u = u 1 + u 2 = 1 = 1 , , du 1 du du 1 du 1 du 2 du du 2 du 2 dJ dt � = dJ du 1 du 1 u 1 = sin ( t ) dt = ��� ( t ) dt , du 1 dJ dt � = dJ du 2 du 2 u 2 = 3 t dt = 3 dt , du 2 dJ dx � = dJ dt dt t = x 2 dx = 2 x dx, dt 28

  27. Backpropagation Training Output Case 1: Logistic θ 2 θ 3 θ M θ 1 Regression … Input Forward Backward y + (1 − y ∗ ) dJ dy = y ∗ J = y ∗ ��� y + (1 − y ∗ ) ��� (1 − y ) y − 1 1 ��� ( − a ) dJ da = dJ dy da, dy y = da = 1 + ��� ( − a ) ( ��� ( − a ) + 1) 2 dy D dJ = dJ da , da � a = = x j θ j x j d θ j da d θ j d θ j j =0 dJ = dJ da , da = θ j dx j da dx j dx j 29

  28. Backpropagation Training (F) Loss (E) Output (sigmoid) 1 y = 1+ ��� ( − b ) Output (D) Output (linear) b = � D j =0 β j z j … Hidden Layer (C) Hidden (sigmoid) 1 z j = 1+ ��� ( − a j ) , ∀ j … Input (B) Hidden (linear) a j = � M i =0 α ji x i , ∀ j (A) Input Given x i , ∀ i 30

  29. Backpropagation Training (F) Loss J = 1 2 ( y − y ∗ ) 2 (E) Output (sigmoid) 1 y = 1+ ��� ( − b ) Output (D) Output (linear) b = � D j =0 β j z j … Hidden Layer (C) Hidden (sigmoid) 1 z j = 1+ ��� ( − a j ) , ∀ j … Input (B) Hidden (linear) a j = � M i =0 α ji x i , ∀ j (A) Input Given x i , ∀ i 31

  30. Backpropagation Training Case 2: Neural Network … … 32

  31. Backpropagation Training Case 2: Neural Loss Network Sigmoid … … Linear Sigmoid Linear 33

  32. Derivative of a Sigmoid 34

  33. Backpropagation Training Case 2: Neural Loss Network Sigmoid … … Linear Sigmoid Linear 35

  34. Backpropagation Training Case 2: Neural Loss Network Sigmoid … … Linear Sigmoid Linear 36

  35. Backpropagation Training Whiteboard – SGD for Neural Network – Example: Backpropagation for Neural Network 37

  36. Backpropagation Training Backpropagation (Auto.Diff. - Reverse Mode) Forward Computation 1. Write an algorithm for evaluating the function y = f( x ). The algorithm defines a directed acyclic graph , where each variable is a node (i.e. the “ computation graph” ) 2. Visit each node in topological order . a. Compute the corresponding variable’s value b. Store the result at the node Backward Computation 1. Initialize all partial derivatives dy/du j to 0 and dy/dy = 1. 2. Visit each node in reverse topological order . For variable u i = g i (v 1 ,…, v N ) a. We already know dy/du i b. Increment dy/dv j by (dy/du i )(du i /dv j ) (Choice of algorithm ensures computing (du i /dv j ) is easy) Return partial derivatives dy/du i for all variables 38

  37. A Recipe for Background Gradients Machine Learning 1. Given training data: 3. Define goal: Backpropagation can compute this gradient! And it’s a special case of a more general algorithm called reverse- 2. Choose each of these: mode automatic differentiation that – Decision function 4. Train with SGD: can compute the gradient of any differentiable function efficiently! (take small steps opposite the gradient) – Loss function 39

  38. Summary 1. Neural Networks … – provide a way of learning features – are highly nonlinear prediction functions – (can be) a highly parallel network of logistic regression classifiers – discover useful hidden representations of the input 2. Backpropagation … – provides an efficient way to compute gradients – is a special case of reverse-mode automatic differentiation 40

Recommend


More recommend