neural networks and computation graphs
play

Neural Networks and Computation Graphs CS 6956: Deep Learning for - PowerPoint PPT Presentation

Neural Networks and Computation Graphs CS 6956: Deep Learning for NLP Based on slides and material from Geoffrey Hinton, Richard Socher, Yoav Goldberg, and others. The computation graph slides are based on a tutorial Practical Neural


  1. Let’s see some functions as graphs 𝐲 Y 𝐁𝐲 Expression Graph f 𝐍, 𝐰 = 𝐍𝐰 f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯 = 𝐯 𝐔 𝐁 𝐲 32

  2. Let’s see some functions as graphs 𝐲 Y 𝐁𝐲 Expression Graph f 𝐍, 𝐰 = 𝐍𝐰 f 𝐯, 𝐍 = 𝐯 𝐔 𝐍𝐯 f 𝐕, 𝐖 = 𝐕𝐖 𝐲 𝐁 f 𝐯 = 𝐯 𝐔 𝐁 We could have written the same function with a different graph. 𝐲 Computation graphs are not necessarily unique for a function 33

  3. Let’s see some functions as graphs 𝐲 Y 𝐁𝐲 Expression Remember: The nodes also know how to compute derivatives with respect to each parent Graph f 𝐯, 𝐍 = 𝐯 𝐔 𝐍𝐯 𝐲 𝐁 34

  4. Let’s see some functions as graphs 𝐲 Y 𝐁𝐲 Expression Remember: The nodes also know how to compute derivatives with respect to each parent Graph f 𝐯, 𝐍 = 𝐯 𝐔 𝐍𝐯 Derivative with respect to this 𝐲 𝐁 parent πœ–π‘” πœ–π― = 𝐍 Y + 𝐍 𝐯 35

  5. Let’s see some functions as graphs 𝐲 Y 𝐁𝐲 Expression Remember: The nodes also know how to compute derivatives with respect to each parent Graph f 𝐯, 𝐍 = 𝐯 𝐔 𝐍𝐯 Derivative with respect to this 𝐲 𝐁 parent πœ–π‘” πœ–π = 𝐯𝐯 ' 36

  6. Let’s see some functions as graphs 𝐲 Y 𝐁𝐲 Expression Remember: The nodes also know πœ–π‘” πœ–π‘” πœ–π² = 𝐁 Y + 𝐁 𝐲 πœ–π = 𝐲𝐲 ' how to compute derivatives with respect to each parent Graph Together, we can compute derivatives of any function with f 𝐯, 𝐍 = 𝐯 𝐔 𝐍𝐯 respect to all its inputs, for any value of the input 𝐲 𝐁 πœ–π‘” πœ–π‘” πœ–π― = 𝐍 Y + 𝐍 𝐯 πœ–π = 𝐯𝐯 ' 37

  7. οΏ½ Let’s see some functions as graphs 𝐲 Y 𝐁𝐲 + 𝐜 Y 𝐲 + 𝐝 Expression f 𝑦 L , 𝑦 K , 𝑦 g = h 𝑦 / 𝒋 Graph f 𝐍, 𝐰 = 𝐍𝐰 f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯, 𝐰 = 𝐯 𝐔 𝐰 𝐝 f 𝐯 = 𝐯 𝐔 𝐁 𝐜 𝐲 38

  8. οΏ½ Let’s see some functions as graphs 𝑧 = 𝐲 Y 𝐁𝐲 + 𝐜 Y 𝐲 + 𝐝 Expression f 𝑦 L , 𝑦 K , 𝑦 g = h 𝑦 / 𝑧 𝒋 Graph f 𝐍, 𝐰 = 𝐍𝐰 f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯, 𝐰 = 𝐯 𝐔 𝐰 𝐝 f 𝐯 = 𝐯 𝐔 𝐁 𝐜 𝐲 39

  9. οΏ½ Let’s see some functions as graphs We can name variables by labeling nodes 𝑧 = 𝐲 Y 𝐁𝐲 + 𝐜 Y 𝐲 + 𝐝 Expression f 𝑦 L , 𝑦 K , 𝑦 g = h 𝑦 / 𝑧 𝒋 Graph f 𝐍, 𝐰 = 𝐍𝐰 f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯, 𝐰 = 𝐯 𝐔 𝐰 𝐝 f 𝐯 = 𝐯 𝐔 𝐁 𝐜 𝐲 40

  10. Why are computation graphs interesting? 1. For starters, we can write neural networks as computation graphs. 2. We can write loss functions as computation graphs. Or loss functions within the innermost stochastic gradient descent. 3. They are plug-and-play: We can construct a graph and use it in a program that someone else wrote For eg: We can write down a neural network and plug it into a loss function and a minimization function from a library 4. They allow efficient gradient computation. 41

  11. An example two layer neural network 𝐒 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐒 + 𝐛 42

  12. An example two layer neural network 𝐒 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐒 + 𝐛 𝐒 𝐠 𝐰 = tanh(𝐰) 𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐜 𝐠 𝐍, 𝐰 = 𝐍𝐰 𝐗 𝐲 43

  13. An example two layer neural network 𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐳 𝐒 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐒 + 𝐛 𝐛 𝐠 𝐍, 𝐰 = 𝐍𝐰 𝐖 𝐒 𝐠 𝐰 = tanh(𝐰) 𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐜 𝐠 𝐍, 𝐰 = 𝐍𝐰 𝐗 𝐲 44

  14. οΏ½ Exercises Write the following functions as computation graphs: β€’ 𝑔 𝑦 = 𝑦 g βˆ’ log (𝑦) L β€’ 𝑔 𝑦 = Lqrst (uv) (0, 1 βˆ’ 𝑧w Y x) β€’ 𝑔 w, x, 𝑧 = max L K w ' w + 𝐷 βˆ‘ max (0, 1 βˆ’ 𝑧 / w Y 𝑦 / ) β€’ min / x 45

  15. Where are we? β€’ What is a neural network? β€’ Computation Graphs β€’ Algorithms over computation graphs – The forward pass – The backward pass 46

  16. Three computational questions 1. Forward propagation – Given inputs to the graph, compute the value of the function expressed by the graph Something to think about: Given a node, can we say which nodes are inputs? – Which nodes are outputs? 2. Backpropagation After computing the function value for an input, compute the gradient of the – function at that input – Or equivalently: How does the output change if I make a small change to the input? 3. Constructing graphs – Need an easy-to-use framework to construct graphs The size of the graph may be input dependent – A templating language that creates graphs on the fly β€’ Tensorflow, PyTorch are the most popular frameworks today – 47

  17. Forward propagation 48

  18. Three computational questions 1. Forward propagation – Given inputs to the graph, compute the value of the function expressed by the graph Something to think about: Given a node, can we say which nodes are inputs? – Which nodes are outputs? 2. Backpropagation After computing the function value for an input, compute the gradient of the – function at that input – Or equivalently: How does the output change if I make a small change to the input? 3. Constructing graphs – Need an easy-to-use framework to construct graphs The size of the graph may be input dependent – A templating language that creates graphs on the fly β€’ Tensorflow, PyTorch are the most popular frameworks today – 49

  19. οΏ½ Forward pass: An example h 𝑣 / / log 𝑣 𝑣𝑀 𝑣 K 𝑣 + 𝑀 𝑦 𝑧 Conventions: 1. Any expression next to a node is the function it computes 2. All the variables in the expression are inputs to the node from left to right. 50

  20. οΏ½ Forward pass What function does this compute? h 𝑣 / / log 𝑣 𝑣𝑀 𝑣 K 𝑣 + 𝑀 𝑦 𝑧 51

  21. οΏ½ Forward pass What function does this compute? h 𝑣 / / log 𝑣 𝑣𝑀 𝑣 K 𝑣 + 𝑀 𝑦 𝑧 Suppose we shade nodes whose values we know (i.e. we have computed). 52

  22. οΏ½ Forward pass What function does this compute? 𝑦 h 𝑣 / / log 𝑣 𝑣𝑀 𝑣 K 𝑣 + 𝑀 𝑦 𝑧 Suppose we shade nodes whose values we know (i.e. we have computed). 53

  23. οΏ½ Forward pass What function does this compute? 𝑦 h 𝑣 / 𝑧 / log 𝑣 𝑣𝑀 𝑣 K 𝑣 + 𝑀 𝑦 𝑧 Suppose we shade nodes whose values we know (i.e. we have computed). 54

  24. οΏ½ Forward pass What function does this compute? 𝑦 h 𝑣 / 𝑧 / log 𝑣 𝑣𝑀 𝑣 K 𝑣 + 𝑀 𝑦 𝑧 Suppose we shade nodes whose values we know (i.e. we have computed). We can only compute the value of a node if we know the values of all its inputs 55

  25. οΏ½ Forward pass What function does this compute? 𝑦 h 𝑣 / 𝑧 / 𝑦 + 𝑧 log 𝑣 𝑣𝑀 𝑣 K 𝑣 + 𝑀 𝑦 𝑧 Suppose we shade nodes whose values we know (i.e. we have computed). We can only compute the value of a node if we know the values of all its inputs 56

  26. οΏ½ Forward pass What function does this compute? 𝑦 h 𝑣 / 𝑧 / 𝑦 + 𝑧 log 𝑣 𝑣𝑀 𝑧 K 𝑣 K 𝑣 + 𝑀 𝑦 𝑧 Suppose we shade nodes whose values we know (i.e. we have computed). We can only compute the value of a node if we know the values of all its inputs 57

  27. οΏ½ Forward pass What function does this compute? 𝑦 h 𝑣 / 𝑧 / 𝑦 + 𝑧 log 𝑣 𝑣𝑀 𝑧 K 𝑦(𝑦 + 𝑧) 𝑣 K 𝑣 + 𝑀 𝑦 𝑧 Suppose we shade nodes whose values we know (i.e. we have computed). We can only compute the value of a node if we know the values of all its inputs 58

  28. οΏ½ Forward pass What function does this compute? 𝑦 h 𝑣 / 𝑧 / 𝑦 + 𝑧 log 𝑣 𝑣𝑀 𝑧 K 𝑦(𝑦 + 𝑧) 𝑣 K 𝑣 + 𝑀 log (𝑦 + 𝑧) 𝑦 𝑧 Suppose we shade nodes whose values we know (i.e. we have computed). We can only compute the value of a node if we know the values of all its inputs 59

  29. οΏ½ Forward pass What function does this compute? 𝑦 h 𝑣 / 𝑧 / 𝑦 + 𝑧 log 𝑣 𝑣𝑀 𝑧 K 𝑦(𝑦 + 𝑧) 𝑣 K 𝑣 + 𝑀 log (𝑦 + 𝑧) x x + y + log 𝑦 + 𝑧 + 𝑧 K 𝑦 𝑧 Suppose we shade nodes whose values we know (i.e. we have computed). We can only compute the value of a node if we know the values of all its inputs 60

  30. οΏ½ Forward pass What function does this compute? 𝑦 h 𝑣 / 𝑧 / 𝑦 + 𝑧 log 𝑣 𝑣𝑀 𝑧 K 𝑦(𝑦 + 𝑧) 𝑣 K 𝑣 + 𝑀 log (𝑦 + 𝑧) x x + y + log 𝑦 + 𝑧 + 𝑧 K 𝑦 𝑧 This gives us the function Suppose we shade nodes whose values we know (i.e. we have computed). We can only compute the value of a node if we know the values of all its inputs 61

  31. οΏ½ A second example f 𝑦 L , 𝑦 K , 𝑦 g = h 𝑦 / 𝑧 𝒋 f 𝐍, 𝐰 = 𝐍𝐰 f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯, 𝐰 = 𝐯 𝐔 𝐰 𝐝 f 𝐯 = 𝐯 𝐔 𝐁 𝐜 𝐲 62

  32. οΏ½ A second example To compute the function, we need the values of the f 𝑦 L , 𝑦 K , 𝑦 g = h 𝑦 / 𝑧 leaves of this DAG 𝒋 f 𝐍, 𝐰 = 𝐍𝐰 f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯, 𝐰 = 𝐯 𝐔 𝐰 𝐝 f 𝐯 = 𝐯 𝐔 𝐁 𝐜 𝐲 63

  33. οΏ½ A second example To compute the function, we need the values of the f 𝑦 L , 𝑦 K , 𝑦 g = h 𝑦 / 𝑧 leaves of this DAG 𝒋 f 𝐍, 𝐰 = 𝐍𝐰 f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯, 𝐰 = 𝐯 𝐔 𝐰 𝐝 f 𝐯 = 𝐯 𝐔 𝐁 𝐜 𝐲 64

  34. οΏ½ A second example Let’s also highlight which nodes can f 𝑦 L , 𝑦 K , 𝑦 g = h 𝑦 / be computed 𝑧 using what we 𝒋 know so far f 𝐍, 𝐰 = 𝐍𝐰 f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯, 𝐰 = 𝐯 𝐔 𝐰 𝐝 f 𝐯 = 𝐯 𝐔 𝐁 𝐜 𝐲 65

  35. οΏ½ A second example f 𝑦 L , 𝑦 K , 𝑦 g = h 𝑦 / 𝑧 𝒋 f 𝐍, 𝐰 = 𝐍𝐰 f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯, 𝐰 = 𝐯 𝐔 𝐰 𝐝 f 𝐯 = 𝐯 𝐔 𝐁 𝐲 ' 𝐜 𝐲 66

  36. οΏ½ A second example f 𝑦 L , 𝑦 K , 𝑦 g = h 𝑦 / 𝑧 𝒋 f 𝐍, 𝐰 = 𝐍𝐰 f 𝐕, 𝐖 = 𝐕𝐖 𝐜 ' 𝐲 f 𝐯, 𝐰 = 𝐯 𝐔 𝐰 𝐝 f 𝐯 = 𝐯 𝐔 𝐁 𝐲 ' 𝐜 𝐲 67

  37. οΏ½ A second example f 𝑦 L , 𝑦 K , 𝑦 g = h 𝑦 / 𝑧 𝒋 f 𝐍, 𝐰 = 𝐍𝐰 f 𝐕, 𝐖 = 𝐕𝐖 𝐜 ' 𝐲 𝐲 ' 𝐁 f 𝐯, 𝐰 = 𝐯 𝐔 𝐰 𝐝 f 𝐯 = 𝐯 𝐔 𝐁 𝐲 ' 𝐜 𝐲 68

  38. οΏ½ A second example f 𝑦 L , 𝑦 K , 𝑦 g = h 𝑦 / 𝑧 𝒋 f 𝐍, 𝐰 = 𝐍𝐰 𝐲 ' 𝐁𝐲 f 𝐕, 𝐖 = 𝐕𝐖 𝐜 ' 𝐲 𝐲 ' 𝐁 f 𝐯, 𝐰 = 𝐯 𝐔 𝐰 𝐝 f 𝐯 = 𝐯 𝐔 𝐁 𝐲 ' 𝐜 𝐲 69

  39. οΏ½ A second example 𝐲 ' 𝐁𝐲 + 𝐜 Y 𝐲 + 𝐝 f 𝑦 L , 𝑦 K , 𝑦 g = h 𝑦 / 𝑧 𝒋 f 𝐍, 𝐰 = 𝐍𝐰 𝐲 ' 𝐁𝐲 f 𝐕, 𝐖 = 𝐕𝐖 𝐜 ' 𝐲 𝐲 ' 𝐁 f 𝐯, 𝐰 = 𝐯 𝐔 𝐰 𝐝 f 𝐯 = 𝐯 𝐔 𝐁 𝐲 ' 𝐜 𝐲 70

  40. Forward propagation Given a computation graph G and values of its input nodes: For each node in the graph, in topological order: Compute the value of that node Why topological order: Ensures that children are computed before parents. 71

  41. Forward propagation Given a computation graph G and values of its input nodes: For each node in the graph, in topological order : Compute the value of that node Why topological order: Ensures that children are computed before parents. 72

  42. Backpropagation with computation graphs 73

  43. Three computational questions 1. Forward propagation – Given inputs to the graph, compute the value of the function expressed by the graph Something to think about: Given a node, can we say which nodes are inputs? – Which nodes are outputs? 2. Backpropagation After computing the function value for an input, compute the gradient of the – function at that input – Or equivalently: How does the output change if I make a small change to the input? 3. Constructing graphs – Need an easy-to-use framework to construct graphs The size of the graph may be input dependent – A templating language that creates graphs on the fly β€’ Tensorflow, PyTorch are the most popular frameworks today – 74

  44. Calculus refresher: The chain rule Suppose we have two functions 𝑔 and 𝑕 We wish to compute the gradient of y = 𝑔 𝑕 𝑦 . {| {v = 𝑔 } 𝑕 𝑦 We know that β‹… 𝑕′(𝑦) Or equivalently: if 𝑨 = 𝑕(𝑦) and 𝑧 = 𝑔(𝑨) , then 𝑒𝑧 𝑒𝑦 = 𝑒𝑧 𝑒𝑨 β‹… 𝑒𝑨 𝑒𝑦 75

  45. Or equivalently: In terms of computation graphs The forward pass gives us 𝑨 and 𝑧 f 𝑧 g 𝑨 𝑦 76

  46. Or equivalently: In terms of computation graphs The forward pass gives us 𝑨 and 𝑧 f 𝑧 Remember that each node knows not only how to compute its value given inputs, but also how to compute gradients g 𝑨 𝑦 77

  47. Or equivalently: In terms of computation graphs The forward pass gives us 𝑨 and 𝑧 f 𝑧 Remember that each node knows not only how to compute its value given inputs, but also how to compute gradients 𝑒𝑧 g 𝑨 𝑒𝑨 Start from the root of the graph and work backwards. 𝑦 78

  48. Or equivalently: In terms of computation graphs The forward pass gives us 𝑨 and 𝑧 f 𝑧 Remember that each node knows not only how to compute its value given inputs, but also how to compute gradients 𝑒𝑧 g 𝑨 𝑒𝑨 Start from the root of the graph and work backwards. 𝑒𝑧 𝑒𝑨 β‹… 𝑒𝑨 When traversing an edge backwards to a new node: 𝑦 𝑒𝑦 the gradient of the root with respect to that node is the product of the gradient at the parent with the derivative along that edge 79

  49. A concrete example 𝑧 = 1 𝑦 K 𝑔 𝑣 = 1 𝑧 𝑣 g u = u K 𝑨 𝑦 80

  50. A concrete example 𝑧 = 1 𝑦 K 𝑔 𝑣 = 1 𝑒𝑔 𝑒𝑣 = βˆ’ 1 𝑧 𝑣 𝑣 K 𝑒𝑕 g u = u K 𝑨 𝑒𝑣 = 2𝑣 𝑦 Let’s also explicitly write down the derivatives. 81

  51. A concrete example 𝑧 = 1 𝑦 K 𝑔 𝑣 = 1 𝑒𝑧 𝑒𝑔 𝑒𝑣 = βˆ’ 1 𝑧 𝑒𝑧 = 1 𝑣 𝑣 K 𝑒𝑕 g u = u K 𝑨 𝑒𝑣 = 2𝑣 Now, we can proceed backwards from the output 𝑦 At each step, we compute the gradient of the function represented by the graph with respect to the node that we are at. 82

  52. Μ‡ A concrete example 𝑧 = 1 𝑦 K 𝑔 𝑣 = 1 𝑒𝑧 𝑒𝑣 = βˆ’ 1 𝑒𝑔 𝑧 𝑒𝑧 = 1 𝑣 𝑣 K 𝑒𝑕 𝑒𝑧 𝑒𝑨 = 𝑒𝑧 β‹… 𝑒𝑔 = 1 β‹… βˆ’ 1 = βˆ’ 1 g u = u K 𝑨 𝑒𝑣 = 2𝑣 𝑨 K 𝑨 K 𝑒𝑧 𝑒𝑣 β€žβ€¦β€  𝑦 Product of the gradient so far and the derivative computed at this step 83

  53. A concrete example 𝑧 = 1 𝑦 K 𝑔 𝑣 = 1 𝑒𝑧 𝑒𝑔 𝑒𝑣 = βˆ’ 1 𝑧 𝑒𝑧 = 1 𝑣 𝑣 K 𝑒𝑕 𝑒𝑧 𝑒𝑨 = βˆ’ 1 g u = u K 𝑨 𝑒𝑣 = 2𝑣 𝑨 K 𝑒𝑧 𝑒𝑦 = 𝑒𝑧 𝑒𝑨 β‹… 𝑒𝑕 = βˆ’ 1 𝑨 K β‹… 2𝑦 = βˆ’ 2x 𝑦 z K 𝑒𝑣 β€žβ€¦v 84

  54. A concrete example 𝑧 = 1 𝑦 K 𝑔 𝑣 = 1 𝑒𝑧 𝑒𝑔 𝑒𝑣 = βˆ’ 1 𝑧 𝑒𝑧 = 1 𝑣 𝑣 K 𝑒𝑕 𝑒𝑧 𝑒𝑨 = βˆ’ 1 g u = u K 𝑨 𝑒𝑣 = 2𝑣 𝑨 K 𝑒𝑧 𝑒𝑦 = 𝑒𝑧 𝑒𝑨 β‹… 𝑒𝑕 = βˆ’ 1 𝑨 K β‹… 2𝑦 = βˆ’ 2x 𝑦 z K 𝑒𝑣 β€žβ€¦v We can simplify this to get βˆ’ K v Λ† 85

  55. A concrete example with multiple outgoing edges 𝑧 = 1 𝑦 𝑔 𝑣, 𝑀 = 𝑀 𝑧 𝑣 g u = u K 𝑨 𝑦 86

  56. A concrete example with multiple outgoing edges 𝑧 = 1 𝑦 𝑒𝑔 𝑒𝑣 = βˆ’ 𝑀 𝑣 K 𝑔 𝑣, 𝑀 = 𝑀 𝑧 𝑒𝑀 = 1 𝑒𝑔 𝑣 𝑣 𝑒𝑕 g u = u K 𝑨 𝑒𝑣 = 2𝑣 𝑦 Let’s also explicitly write down the derivatives. Note that 𝑔 has two derivatives because it has two inputs. 87

  57. A concrete example with multiple outgoing edges 𝑧 = 1 𝑦 𝑒𝑔 𝑒𝑣 = βˆ’ 𝑀 𝑣 K 𝑔 𝑣, 𝑀 = 𝑀 𝑒𝑧 𝑧 𝑒𝑧 = 1 𝑒𝑀 = 1 𝑒𝑔 𝑣 𝑣 𝑒𝑕 g u = u K 𝑨 𝑒𝑣 = 2𝑣 𝑦 88

  58. A concrete example with multiple outgoing edges 𝑧 = 1 𝑦 𝑒𝑣 = βˆ’ 𝑀 𝑒𝑔 𝑣 K 𝑔 𝑣, 𝑀 = 𝑀 𝑒𝑧 𝑧 𝑒𝑧 = 1 𝑒𝑀 = 1 𝑒𝑔 𝑣 𝑣 𝑒𝑕 g u = u K 𝑨 𝑒𝑣 = 2𝑣 At this point, we can compute the gradient of y with respect to z by following the edge from y to z. 𝑦 But we can not follow the edge from y to x because all of x’s descendants are not marked as done. 89

  59. Μ‡ A concrete example with multiple outgoing edges 𝑧 = 1 𝑦 𝑒𝑣 = βˆ’ 𝑀 𝑒𝑔 𝑣 K 𝑔 𝑣, 𝑀 = 𝑀 𝑒𝑧 𝑧 𝑒𝑧 = 1 𝑒𝑔 𝑒𝑀 = 1 𝑣 𝑣 𝑒𝑕 𝑒𝑨 = 𝑒𝑧 𝑒𝑧 β‹… 𝑒𝑔 = 1 β‹… βˆ’ 𝑦 = βˆ’ 𝑦 g u = u K 𝑨 𝑒𝑣 = 2𝑣 𝑨 K 𝑨 K 𝑒𝑧 𝑒𝑣 β€žβ€¦β€  𝑦 Product of the gradient so far and the derivative computed at this step 90

  60. Μ‡ A concrete example with multiple outgoing edges 𝑧 = 1 𝑦 𝑒𝑔 𝑒𝑣 = βˆ’ 𝑀 𝑣 K 𝑔 𝑣, 𝑀 = 𝑀 𝑒𝑧 𝑧 𝑒𝑧 = 1 𝑒𝑔 𝑒𝑀 = 1 𝑣 𝑣 𝑒𝑕 𝑒𝑧 𝑒𝑨 = 𝑒𝑧 β‹… 𝑒𝑔 = 1 β‹… βˆ’ 𝑦 = βˆ’ 𝑦 g u = u K 𝑨 𝑒𝑣 = 2𝑣 𝑨 K 𝑨 K 𝑒𝑧 𝑒𝑣 β€žβ€¦β€  𝑦 Now we can get to x There are multiple backward paths into x. The general rule: Add the gradients along all the paths. 91

  61. A concrete example with multiple outgoing edges 𝑧 = 1 𝑦 𝑒𝑣 = βˆ’ 𝑀 𝑒𝑔 𝑣 K 𝑔 𝑣, 𝑀 = 𝑀 𝑒𝑧 𝑧 𝑒𝑧 = 1 𝑒𝑀 = 1 𝑒𝑔 𝑣 𝑣 𝑒𝑧 𝑒𝑨 = βˆ’ 𝑦 𝑒𝑕 g u = u K 𝑨 𝑒𝑣 = 2𝑣 𝑨 K 𝑦 𝑒𝑧 𝑒𝑦 = 𝑒𝑧 𝑒𝑨 β‹… 𝑒𝑕 + 𝑒𝑧 𝑒𝑧 β‹… 𝑒𝑔 𝑒𝑣 β€žβ€¦v 𝑒𝑀 ‰…v There are multiple backward paths into x. The general rule: Add the gradients along all the paths. 92

  62. A concrete example with multiple outgoing edges 𝑧 = 1 𝑦 𝑒𝑣 = βˆ’ 𝑀 𝑒𝑔 𝑣 K 𝑔 𝑣, 𝑀 = 𝑀 𝑒𝑧 𝑧 𝑒𝑧 = 1 𝑒𝑀 = 1 𝑒𝑔 𝑣 𝑣 𝑒𝑧 𝑒𝑨 = βˆ’ 𝑦 𝑒𝑕 g u = u K 𝑨 𝑒𝑣 = 2𝑣 𝑨 K 𝑦 𝑒𝑧 𝑒𝑦 = 𝑒𝑧 𝑒𝑨 β‹… 𝑒𝑕 + 𝑒𝑧 𝑒𝑧 β‹… 𝑒𝑔 𝑒𝑣 β€žβ€¦v 𝑒𝑀 ‰…v There are multiple backward paths into x. The general rule: Add the gradients along all the paths. 93

  63. A concrete example with multiple outgoing edges 𝑧 = 1 𝑦 𝑒𝑣 = βˆ’ 𝑀 𝑒𝑔 𝑣 K 𝑔 𝑣, 𝑀 = 𝑀 𝑒𝑧 𝑧 𝑒𝑧 = 1 𝑒𝑀 = 1 𝑒𝑔 𝑣 𝑣 𝑒𝑧 𝑒𝑨 = βˆ’ 𝑦 𝑒𝑕 g u = u K 𝑨 𝑒𝑣 = 2𝑣 𝑨 K 𝑦 𝑒𝑧 𝑒𝑦 = 𝑒𝑧 𝑒𝑨 β‹… 𝑒𝑕 + 𝑒𝑧 𝑒𝑧 β‹… 𝑒𝑔 𝑒𝑣 β€žβ€¦v 𝑒𝑀 ‰…v There are multiple backward paths into x. The general rule: Add the gradients along all the paths. 94

  64. A concrete example with multiple outgoing edges 𝑧 = 1 𝑦 𝑒𝑣 = βˆ’ 𝑀 𝑒𝑔 𝑣 K 𝑔 𝑣, 𝑀 = 𝑀 𝑒𝑧 𝑧 𝑒𝑧 = 1 𝑒𝑔 𝑒𝑀 = 1 𝑣 𝑣 𝑒𝑧 𝑒𝑨 = βˆ’ 𝑦 𝑒𝑕 g u = u K 𝑨 𝑒𝑣 = 2𝑣 𝑨 K 𝑦 𝑒𝑦 = 𝑒𝑧 𝑒𝑧 𝑒𝑨 β‹… 𝑒𝑕 + 𝑒𝑧 𝑒𝑧 β‹… 𝑒𝑔 𝑒𝑣 β€žβ€¦v 𝑒𝑀 ‰…v 𝑨 = βˆ’ 2𝑦 K 𝑒𝑧 𝑒𝑦 = βˆ’ 𝑦 𝑨 K β‹… 2𝑦 + 1 β‹… 1 𝑨 K + 1 𝑨 = βˆ’ 1 𝑦 K 95

  65. A neural network example 𝐒 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐒 + 𝐛 πŸ‘ 𝑴 = 1 𝒛 βˆ’ 𝒛 βˆ— 2 This is the same two-layer network we saw before. But this time we have added a new loss term at the end. Suppose our goal is to compute the derivative of the loss with respect to 𝐗, 𝐖, 𝐛, 𝐜 96

  66. A neural network πŸ‘ π’ˆ(𝐯, 𝐰) = 1 𝑴 𝐯 βˆ’ 𝐰 2 𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝒛 𝐒 = tanh 𝐗𝐲 + 𝐜 𝐳 βˆ— 𝒛 = 𝐖𝐒 + 𝐛 πŸ‘ 𝑴 = 1 𝒃 𝒛 βˆ’ 𝒛 βˆ— 𝐠 𝐍, 𝐰 = 𝐍𝐰 2 𝐖 𝐒 𝐠 𝐰 = tanh(𝐰) 𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐜 𝐠 𝐍, 𝐰 = 𝐍𝐰 𝐗 𝐲 97

  67. A neural network πŸ‘ π’ˆ(𝐯, 𝐰) = 1 𝑴 𝐯 βˆ’ 𝐰 2 𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝒛 𝐒 = tanh 𝐗𝐲 + 𝐜 𝐳 βˆ— 𝒛 = 𝐖𝐒 + 𝐛 πŸ‘ 𝑴 = 1 π’œ πŸ“ 𝐛 𝒛 βˆ’ 𝒛 βˆ— 𝐠 𝐍, 𝐰 = 𝐍𝐰 2 𝐖 𝐴 πŸ’ 𝐒 𝐠 𝐰 = tanh(𝐰) 𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 π’œ πŸ‘ To simplify notation, let π’œ 𝟐 𝐜 𝐠 𝐍, 𝐰 = 𝐍𝐰 us name all the nodes 𝐗 𝐲 98

  68. A neural network πŸ‘ π’ˆ(𝐯, 𝐰) = 1 𝑴 𝐯 βˆ’ 𝐰 2 𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝒛 𝐒 = tanh 𝐗𝐲 + 𝐜 𝐳 βˆ— 𝒛 = 𝐖𝐒 + 𝐛 πŸ‘ 𝑴 = 1 π’œ πŸ“ 𝐛 𝒛 βˆ’ 𝒛 βˆ— 𝐠 𝐍, 𝐰 = 𝐍𝐰 2 𝐖 𝐴 πŸ’ 𝐒 𝐠 𝐰 = tanh(𝐰) 𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 π’œ πŸ‘ 𝑒𝑀 𝑒𝑀 = 1 π’œ 𝟐 𝐜 𝐠 𝐍, 𝐰 = 𝐍𝐰 Let us highlight nodes that are done 𝐗 𝐲 99

Recommend


More recommend