Letβs see some functions as graphs π² Y ππ² Expression Graph f π, π° = ππ° f π, π = ππ f π― = π― π π π² 32
Letβs see some functions as graphs π² Y ππ² Expression Graph f π, π° = ππ° f π―, π = π― π ππ― f π, π = ππ π² π f π― = π― π π We could have written the same function with a different graph. π² Computation graphs are not necessarily unique for a function 33
Letβs see some functions as graphs π² Y ππ² Expression Remember: The nodes also know how to compute derivatives with respect to each parent Graph f π―, π = π― π ππ― π² π 34
Letβs see some functions as graphs π² Y ππ² Expression Remember: The nodes also know how to compute derivatives with respect to each parent Graph f π―, π = π― π ππ― Derivative with respect to this π² π parent ππ ππ― = π Y + π π― 35
Letβs see some functions as graphs π² Y ππ² Expression Remember: The nodes also know how to compute derivatives with respect to each parent Graph f π―, π = π― π ππ― Derivative with respect to this π² π parent ππ ππ = π―π― ' 36
Letβs see some functions as graphs π² Y ππ² Expression Remember: The nodes also know ππ ππ ππ² = π Y + π π² ππ = π²π² ' how to compute derivatives with respect to each parent Graph Together, we can compute derivatives of any function with f π―, π = π― π ππ― respect to all its inputs, for any value of the input π² π ππ ππ ππ― = π Y + π π― ππ = π―π― ' 37
οΏ½ Letβs see some functions as graphs π² Y ππ² + π Y π² + π Expression f π¦ L , π¦ K , π¦ g = h π¦ / π Graph f π, π° = ππ° f π, π = ππ f π―, π° = π― π π° π f π― = π― π π π π² 38
οΏ½ Letβs see some functions as graphs π§ = π² Y ππ² + π Y π² + π Expression f π¦ L , π¦ K , π¦ g = h π¦ / π§ π Graph f π, π° = ππ° f π, π = ππ f π―, π° = π― π π° π f π― = π― π π π π² 39
οΏ½ Letβs see some functions as graphs We can name variables by labeling nodes π§ = π² Y ππ² + π Y π² + π Expression f π¦ L , π¦ K , π¦ g = h π¦ / π§ π Graph f π, π° = ππ° f π, π = ππ f π―, π° = π― π π° π f π― = π― π π π π² 40
Why are computation graphs interesting? 1. For starters, we can write neural networks as computation graphs. 2. We can write loss functions as computation graphs. Or loss functions within the innermost stochastic gradient descent. 3. They are plug-and-play: We can construct a graph and use it in a program that someone else wrote For eg: We can write down a neural network and plug it into a loss function and a minimization function from a library 4. They allow efficient gradient computation. 41
An example two layer neural network π’ = tanh ππ² + π π = ππ’ + π 42
An example two layer neural network π’ = tanh ππ² + π π = ππ’ + π π’ π π° = tanh(π°) π π―, π° = π― + π° π π π, π° = ππ° π π² 43
An example two layer neural network π π―, π° = π― + π° π³ π’ = tanh ππ² + π π = ππ’ + π π π π, π° = ππ° π π’ π π° = tanh(π°) π π―, π° = π― + π° π π π, π° = ππ° π π² 44
οΏ½ Exercises Write the following functions as computation graphs: β’ π π¦ = π¦ g β log (π¦) L β’ π π¦ = Lqrst (uv) (0, 1 β π§w Y x) β’ π w, x, π§ = max L K w ' w + π· β max (0, 1 β π§ / w Y π¦ / ) β’ min / x 45
Where are we? β’ What is a neural network? β’ Computation Graphs β’ Algorithms over computation graphs β The forward pass β The backward pass 46
Three computational questions 1. Forward propagation β Given inputs to the graph, compute the value of the function expressed by the graph Something to think about: Given a node, can we say which nodes are inputs? β Which nodes are outputs? 2. Backpropagation After computing the function value for an input, compute the gradient of the β function at that input β Or equivalently: How does the output change if I make a small change to the input? 3. Constructing graphs β Need an easy-to-use framework to construct graphs The size of the graph may be input dependent β A templating language that creates graphs on the fly β’ Tensorflow, PyTorch are the most popular frameworks today β 47
Forward propagation 48
Three computational questions 1. Forward propagation β Given inputs to the graph, compute the value of the function expressed by the graph Something to think about: Given a node, can we say which nodes are inputs? β Which nodes are outputs? 2. Backpropagation After computing the function value for an input, compute the gradient of the β function at that input β Or equivalently: How does the output change if I make a small change to the input? 3. Constructing graphs β Need an easy-to-use framework to construct graphs The size of the graph may be input dependent β A templating language that creates graphs on the fly β’ Tensorflow, PyTorch are the most popular frameworks today β 49
οΏ½ Forward pass: An example h π£ / / log π£ π£π€ π£ K π£ + π€ π¦ π§ Conventions: 1. Any expression next to a node is the function it computes 2. All the variables in the expression are inputs to the node from left to right. 50
οΏ½ Forward pass What function does this compute? h π£ / / log π£ π£π€ π£ K π£ + π€ π¦ π§ 51
οΏ½ Forward pass What function does this compute? h π£ / / log π£ π£π€ π£ K π£ + π€ π¦ π§ Suppose we shade nodes whose values we know (i.e. we have computed). 52
οΏ½ Forward pass What function does this compute? π¦ h π£ / / log π£ π£π€ π£ K π£ + π€ π¦ π§ Suppose we shade nodes whose values we know (i.e. we have computed). 53
οΏ½ Forward pass What function does this compute? π¦ h π£ / π§ / log π£ π£π€ π£ K π£ + π€ π¦ π§ Suppose we shade nodes whose values we know (i.e. we have computed). 54
οΏ½ Forward pass What function does this compute? π¦ h π£ / π§ / log π£ π£π€ π£ K π£ + π€ π¦ π§ Suppose we shade nodes whose values we know (i.e. we have computed). We can only compute the value of a node if we know the values of all its inputs 55
οΏ½ Forward pass What function does this compute? π¦ h π£ / π§ / π¦ + π§ log π£ π£π€ π£ K π£ + π€ π¦ π§ Suppose we shade nodes whose values we know (i.e. we have computed). We can only compute the value of a node if we know the values of all its inputs 56
οΏ½ Forward pass What function does this compute? π¦ h π£ / π§ / π¦ + π§ log π£ π£π€ π§ K π£ K π£ + π€ π¦ π§ Suppose we shade nodes whose values we know (i.e. we have computed). We can only compute the value of a node if we know the values of all its inputs 57
οΏ½ Forward pass What function does this compute? π¦ h π£ / π§ / π¦ + π§ log π£ π£π€ π§ K π¦(π¦ + π§) π£ K π£ + π€ π¦ π§ Suppose we shade nodes whose values we know (i.e. we have computed). We can only compute the value of a node if we know the values of all its inputs 58
οΏ½ Forward pass What function does this compute? π¦ h π£ / π§ / π¦ + π§ log π£ π£π€ π§ K π¦(π¦ + π§) π£ K π£ + π€ log (π¦ + π§) π¦ π§ Suppose we shade nodes whose values we know (i.e. we have computed). We can only compute the value of a node if we know the values of all its inputs 59
οΏ½ Forward pass What function does this compute? π¦ h π£ / π§ / π¦ + π§ log π£ π£π€ π§ K π¦(π¦ + π§) π£ K π£ + π€ log (π¦ + π§) x x + y + log π¦ + π§ + π§ K π¦ π§ Suppose we shade nodes whose values we know (i.e. we have computed). We can only compute the value of a node if we know the values of all its inputs 60
οΏ½ Forward pass What function does this compute? π¦ h π£ / π§ / π¦ + π§ log π£ π£π€ π§ K π¦(π¦ + π§) π£ K π£ + π€ log (π¦ + π§) x x + y + log π¦ + π§ + π§ K π¦ π§ This gives us the function Suppose we shade nodes whose values we know (i.e. we have computed). We can only compute the value of a node if we know the values of all its inputs 61
οΏ½ A second example f π¦ L , π¦ K , π¦ g = h π¦ / π§ π f π, π° = ππ° f π, π = ππ f π―, π° = π― π π° π f π― = π― π π π π² 62
οΏ½ A second example To compute the function, we need the values of the f π¦ L , π¦ K , π¦ g = h π¦ / π§ leaves of this DAG π f π, π° = ππ° f π, π = ππ f π―, π° = π― π π° π f π― = π― π π π π² 63
οΏ½ A second example To compute the function, we need the values of the f π¦ L , π¦ K , π¦ g = h π¦ / π§ leaves of this DAG π f π, π° = ππ° f π, π = ππ f π―, π° = π― π π° π f π― = π― π π π π² 64
οΏ½ A second example Letβs also highlight which nodes can f π¦ L , π¦ K , π¦ g = h π¦ / be computed π§ using what we π know so far f π, π° = ππ° f π, π = ππ f π―, π° = π― π π° π f π― = π― π π π π² 65
οΏ½ A second example f π¦ L , π¦ K , π¦ g = h π¦ / π§ π f π, π° = ππ° f π, π = ππ f π―, π° = π― π π° π f π― = π― π π π² ' π π² 66
οΏ½ A second example f π¦ L , π¦ K , π¦ g = h π¦ / π§ π f π, π° = ππ° f π, π = ππ π ' π² f π―, π° = π― π π° π f π― = π― π π π² ' π π² 67
οΏ½ A second example f π¦ L , π¦ K , π¦ g = h π¦ / π§ π f π, π° = ππ° f π, π = ππ π ' π² π² ' π f π―, π° = π― π π° π f π― = π― π π π² ' π π² 68
οΏ½ A second example f π¦ L , π¦ K , π¦ g = h π¦ / π§ π f π, π° = ππ° π² ' ππ² f π, π = ππ π ' π² π² ' π f π―, π° = π― π π° π f π― = π― π π π² ' π π² 69
οΏ½ A second example π² ' ππ² + π Y π² + π f π¦ L , π¦ K , π¦ g = h π¦ / π§ π f π, π° = ππ° π² ' ππ² f π, π = ππ π ' π² π² ' π f π―, π° = π― π π° π f π― = π― π π π² ' π π² 70
Forward propagation Given a computation graph G and values of its input nodes: For each node in the graph, in topological order: Compute the value of that node Why topological order: Ensures that children are computed before parents. 71
Forward propagation Given a computation graph G and values of its input nodes: For each node in the graph, in topological order : Compute the value of that node Why topological order: Ensures that children are computed before parents. 72
Backpropagation with computation graphs 73
Three computational questions 1. Forward propagation β Given inputs to the graph, compute the value of the function expressed by the graph Something to think about: Given a node, can we say which nodes are inputs? β Which nodes are outputs? 2. Backpropagation After computing the function value for an input, compute the gradient of the β function at that input β Or equivalently: How does the output change if I make a small change to the input? 3. Constructing graphs β Need an easy-to-use framework to construct graphs The size of the graph may be input dependent β A templating language that creates graphs on the fly β’ Tensorflow, PyTorch are the most popular frameworks today β 74
Calculus refresher: The chain rule Suppose we have two functions π and π We wish to compute the gradient of y = π π π¦ . {| {v = π } π π¦ We know that β πβ²(π¦) Or equivalently: if π¨ = π(π¦) and π§ = π(π¨) , then ππ§ ππ¦ = ππ§ ππ¨ β ππ¨ ππ¦ 75
Or equivalently: In terms of computation graphs The forward pass gives us π¨ and π§ f π§ g π¨ π¦ 76
Or equivalently: In terms of computation graphs The forward pass gives us π¨ and π§ f π§ Remember that each node knows not only how to compute its value given inputs, but also how to compute gradients g π¨ π¦ 77
Or equivalently: In terms of computation graphs The forward pass gives us π¨ and π§ f π§ Remember that each node knows not only how to compute its value given inputs, but also how to compute gradients ππ§ g π¨ ππ¨ Start from the root of the graph and work backwards. π¦ 78
Or equivalently: In terms of computation graphs The forward pass gives us π¨ and π§ f π§ Remember that each node knows not only how to compute its value given inputs, but also how to compute gradients ππ§ g π¨ ππ¨ Start from the root of the graph and work backwards. ππ§ ππ¨ β ππ¨ When traversing an edge backwards to a new node: π¦ ππ¦ the gradient of the root with respect to that node is the product of the gradient at the parent with the derivative along that edge 79
A concrete example π§ = 1 π¦ K π π£ = 1 π§ π£ g u = u K π¨ π¦ 80
A concrete example π§ = 1 π¦ K π π£ = 1 ππ ππ£ = β 1 π§ π£ π£ K ππ g u = u K π¨ ππ£ = 2π£ π¦ Letβs also explicitly write down the derivatives. 81
A concrete example π§ = 1 π¦ K π π£ = 1 ππ§ ππ ππ£ = β 1 π§ ππ§ = 1 π£ π£ K ππ g u = u K π¨ ππ£ = 2π£ Now, we can proceed backwards from the output π¦ At each step, we compute the gradient of the function represented by the graph with respect to the node that we are at. 82
Μ A concrete example π§ = 1 π¦ K π π£ = 1 ππ§ ππ£ = β 1 ππ π§ ππ§ = 1 π£ π£ K ππ ππ§ ππ¨ = ππ§ β ππ = 1 β β 1 = β 1 g u = u K π¨ ππ£ = 2π£ π¨ K π¨ K ππ§ ππ£ ββ¦β π¦ Product of the gradient so far and the derivative computed at this step 83
A concrete example π§ = 1 π¦ K π π£ = 1 ππ§ ππ ππ£ = β 1 π§ ππ§ = 1 π£ π£ K ππ ππ§ ππ¨ = β 1 g u = u K π¨ ππ£ = 2π£ π¨ K ππ§ ππ¦ = ππ§ ππ¨ β ππ = β 1 π¨ K β 2π¦ = β 2x π¦ z K ππ£ ββ¦v 84
A concrete example π§ = 1 π¦ K π π£ = 1 ππ§ ππ ππ£ = β 1 π§ ππ§ = 1 π£ π£ K ππ ππ§ ππ¨ = β 1 g u = u K π¨ ππ£ = 2π£ π¨ K ππ§ ππ¦ = ππ§ ππ¨ β ππ = β 1 π¨ K β 2π¦ = β 2x π¦ z K ππ£ ββ¦v We can simplify this to get β K v Λ 85
A concrete example with multiple outgoing edges π§ = 1 π¦ π π£, π€ = π€ π§ π£ g u = u K π¨ π¦ 86
A concrete example with multiple outgoing edges π§ = 1 π¦ ππ ππ£ = β π€ π£ K π π£, π€ = π€ π§ ππ€ = 1 ππ π£ π£ ππ g u = u K π¨ ππ£ = 2π£ π¦ Letβs also explicitly write down the derivatives. Note that π has two derivatives because it has two inputs. 87
A concrete example with multiple outgoing edges π§ = 1 π¦ ππ ππ£ = β π€ π£ K π π£, π€ = π€ ππ§ π§ ππ§ = 1 ππ€ = 1 ππ π£ π£ ππ g u = u K π¨ ππ£ = 2π£ π¦ 88
A concrete example with multiple outgoing edges π§ = 1 π¦ ππ£ = β π€ ππ π£ K π π£, π€ = π€ ππ§ π§ ππ§ = 1 ππ€ = 1 ππ π£ π£ ππ g u = u K π¨ ππ£ = 2π£ At this point, we can compute the gradient of y with respect to z by following the edge from y to z. π¦ But we can not follow the edge from y to x because all of xβs descendants are not marked as done. 89
Μ A concrete example with multiple outgoing edges π§ = 1 π¦ ππ£ = β π€ ππ π£ K π π£, π€ = π€ ππ§ π§ ππ§ = 1 ππ ππ€ = 1 π£ π£ ππ ππ¨ = ππ§ ππ§ β ππ = 1 β β π¦ = β π¦ g u = u K π¨ ππ£ = 2π£ π¨ K π¨ K ππ§ ππ£ ββ¦β π¦ Product of the gradient so far and the derivative computed at this step 90
Μ A concrete example with multiple outgoing edges π§ = 1 π¦ ππ ππ£ = β π€ π£ K π π£, π€ = π€ ππ§ π§ ππ§ = 1 ππ ππ€ = 1 π£ π£ ππ ππ§ ππ¨ = ππ§ β ππ = 1 β β π¦ = β π¦ g u = u K π¨ ππ£ = 2π£ π¨ K π¨ K ππ§ ππ£ ββ¦β π¦ Now we can get to x There are multiple backward paths into x. The general rule: Add the gradients along all the paths. 91
A concrete example with multiple outgoing edges π§ = 1 π¦ ππ£ = β π€ ππ π£ K π π£, π€ = π€ ππ§ π§ ππ§ = 1 ππ€ = 1 ππ π£ π£ ππ§ ππ¨ = β π¦ ππ g u = u K π¨ ππ£ = 2π£ π¨ K π¦ ππ§ ππ¦ = ππ§ ππ¨ β ππ + ππ§ ππ§ β ππ ππ£ ββ¦v ππ€ β°β¦v There are multiple backward paths into x. The general rule: Add the gradients along all the paths. 92
A concrete example with multiple outgoing edges π§ = 1 π¦ ππ£ = β π€ ππ π£ K π π£, π€ = π€ ππ§ π§ ππ§ = 1 ππ€ = 1 ππ π£ π£ ππ§ ππ¨ = β π¦ ππ g u = u K π¨ ππ£ = 2π£ π¨ K π¦ ππ§ ππ¦ = ππ§ ππ¨ β ππ + ππ§ ππ§ β ππ ππ£ ββ¦v ππ€ β°β¦v There are multiple backward paths into x. The general rule: Add the gradients along all the paths. 93
A concrete example with multiple outgoing edges π§ = 1 π¦ ππ£ = β π€ ππ π£ K π π£, π€ = π€ ππ§ π§ ππ§ = 1 ππ€ = 1 ππ π£ π£ ππ§ ππ¨ = β π¦ ππ g u = u K π¨ ππ£ = 2π£ π¨ K π¦ ππ§ ππ¦ = ππ§ ππ¨ β ππ + ππ§ ππ§ β ππ ππ£ ββ¦v ππ€ β°β¦v There are multiple backward paths into x. The general rule: Add the gradients along all the paths. 94
A concrete example with multiple outgoing edges π§ = 1 π¦ ππ£ = β π€ ππ π£ K π π£, π€ = π€ ππ§ π§ ππ§ = 1 ππ ππ€ = 1 π£ π£ ππ§ ππ¨ = β π¦ ππ g u = u K π¨ ππ£ = 2π£ π¨ K π¦ ππ¦ = ππ§ ππ§ ππ¨ β ππ + ππ§ ππ§ β ππ ππ£ ββ¦v ππ€ β°β¦v π¨ = β 2π¦ K ππ§ ππ¦ = β π¦ π¨ K β 2π¦ + 1 β 1 π¨ K + 1 π¨ = β 1 π¦ K 95
A neural network example π’ = tanh ππ² + π π = ππ’ + π π π΄ = 1 π β π β 2 This is the same two-layer network we saw before. But this time we have added a new loss term at the end. Suppose our goal is to compute the derivative of the loss with respect to π, π, π, π 96
A neural network π π(π―, π°) = 1 π΄ π― β π° 2 π π―, π° = π― + π° π π’ = tanh ππ² + π π³ β π = ππ’ + π π π΄ = 1 π π β π β π π, π° = ππ° 2 π π’ π π° = tanh(π°) π π―, π° = π― + π° π π π, π° = ππ° π π² 97
A neural network π π(π―, π°) = 1 π΄ π― β π° 2 π π―, π° = π― + π° π π’ = tanh ππ² + π π³ β π = ππ’ + π π π΄ = 1 π π π π β π β π π, π° = ππ° 2 π π΄ π π’ π π° = tanh(π°) π π―, π° = π― + π° π π To simplify notation, let π π π π π, π° = ππ° us name all the nodes π π² 98
A neural network π π(π―, π°) = 1 π΄ π― β π° 2 π π―, π° = π― + π° π π’ = tanh ππ² + π π³ β π = ππ’ + π π π΄ = 1 π π π π β π β π π, π° = ππ° 2 π π΄ π π’ π π° = tanh(π°) π π―, π° = π― + π° π π ππ ππ = 1 π π π π π, π° = ππ° Let us highlight nodes that are done π π² 99
Recommend
More recommend