natural language processing with deep learning cs224n
play

Natural Language Processing with Deep Learning CS224N/Ling284 - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 4: Backpropagation and computation graphs Lecture Plan Lecture 4: Backpropagation and computation graphs 1. Matrix gradients for our simple neural net


  1. Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 4: Backpropagation and computation graphs

  2. Lecture Plan Lecture 4: Backpropagation and computation graphs 1. Matrix gradients for our simple neural net and some tips [15 mins] 2. Computation graphs and backpropagation [40 mins] 3. Stuff you should know [15 mins] a. Regularization to prevent overfitting b. Vectorization c. Nonlinearities d. Initialization e. Optimizers f. Learning rates 2

  3. 1. Derivative wrt a weight matrix Let’s look carefully at computing • Using the chain rule again: • ) = * + " " = $ % % = &! + ( ! = [ x museums x in x Paris x are x amazing ] 3

  4. Deriving gradients for backprop • For this function (following on from last time): !# = % !& !" !# = % ! !# #' + ) • Let’s consider the derivative of a single weight W ij u 2 • W ij only contributes to z i s • For example: W 23 is only f ( z 1 ) = h 1 h 2 =f ( z 2 ) used to compute z 2 not z 1 W 23 !* + ! b 2 = # +. ' + / + !, !, +- +- 0 8 01 23 ∑ 567 = , +5 9 5 = 9 - x 1 x 2 x 3 +1 4

  5. Deriving gradients for backprop • So for derivative of single W ij : !" = ' $ ( % !# $% Error signal Local gradient from above signal • We want gradient for full W – but each case is the same • Overall answer: Outer product: 5

  6. Deriving gradients: Tips Tip 1 : Carefully define your variables and keep track of their • dimensionality! Tip 2 : Chain rule! If y = f ( u ) and u = g ( x ), i.e., y = f ( g ( x )), then: • !" !# = !" !% !% !# Keep straight what variables feed into what computations • Tip 3 : For the top softmax part of a model: First consider the derivative wrt f c when c = y (the correct class), then consider derivative wrt f c when c ¹ y (all the incorrect classes) • Tip 4 : Work out element-wise partial derivatives if you’re getting confused by matrix calculus! • Tip 5: Use Shape Convention. Note: The error message & that arrives at a hidden layer has the same dimensionality as that hidden layer 6

  7. Deriving gradients wrt words for window model The gradient that arrives at and updates the word vectors can • simply be split up for each word vector: Let • With x window = [ x museums x in x Paris x are x amazing ] • We have • 7

  8. Updating word gradients in window model This will push word vectors around so that they will (in • principle) be more helpful in determining named entities. For example, the model can learn that seeing x in as the word • just before the center word is indicative for the center word to be a location 8

  9. A pitfall when retraining word vectors • Setting: We are training a logistic regression classification model for movie review sentiment using single words. • In the training data we have “TV” and “telly” • In the testing data we have “television” • The pre-trained word vectors have all three similar: TV telly television • Question: What happens when we update the word vectors? 9

  10. A pitfall when retraining word vectors • Question: What happens when we update the word vectors? • Answer: • Those words that are in the training data move around • “TV” and “telly” • Words not in the training data stay where they were • “television” telly TV This can be bad! television 10

  11. So what should I do? • Question: Should I use available “pre-trained” word vectors Answer: • Almost always, yes! • They are trained on a huge amount of data, and so they will know about words not in your training data and will know more about words that are in your training data • Have 100s of millions of words of data? Okay to start random • Question: Should I update (“fine tune”) my own word vectors? • Answer: • If you only have a small training data set, don’t train the word vectors • If you have have a large dataset, it probably will work better to train = update = fine-tune word vectors to the task 11

  12. Backpropagation We’ve almost shown you backpropagation It’s taking derivatives and using the (generalized) chain rule Other trick: we re-use derivatives computed for higher layers in computing derivatives for lower layers so as to minimize computation 12

  13. 2. Computation Graphs and Backpropagation We represent our neural net • equations as a graph Source nodes: inputs • Interior nodes: operations • + Ÿ Ÿ 13

  14. Computation Graphs and Backpropagation We represent our neural net • equations as a graph Source nodes: inputs • Interior nodes: operations • Edges pass along result of the • operation + Ÿ Ÿ 14

  15. Computation Graphs and Backpropagation Representing our neural net • equations as a graph Source nodes: inputs • “Forward Propagation” Interior nodes: operations • Edges pass along result of the • operation + Ÿ Ÿ 15

  16. Backpropagation Go backwards along edges • Pass along gradients • + Ÿ Ÿ 16

  17. Backpropagation: Single Node Node receives an “upstream gradient” • Goal is to pass on the correct • “downstream gradient” Upstream Downstream 17 gradient gradient

  18. Backpropagation: Single Node Each node has a local gradient • The gradient of it’s output with • respect to it’s input Upstream Downstream Local 18 gradient gradient gradient

  19. Backpropagation: Single Node Each node has a local gradient • The gradient of it’s output with • respect to it’s input Chain rule! Upstream Downstream Local 19 gradient gradient gradient

  20. Backpropagation: Single Node Each node has a local gradient • The gradient of it’s output with • respect to it’s input [downstream gradient] = [upstream gradient] x [local gradient] • Upstream Downstream Local 20 gradient gradient gradient

  21. Backpropagation: Single Node What about nodes with multiple inputs? • * 21

  22. Backpropagation: Single Node Multiple inputs → multiple local gradients • * Local Downstream Upstream gradients gradients gradient 22

  23. An Example 23

  24. An Example Forward prop steps + * max 24

  25. An Example Forward prop steps 1 + 3 2 6 * 2 2 max 0 25

  26. An Example Forward prop steps Local gradients 1 + 3 2 6 * 2 2 max 0 26

  27. An Example Forward prop steps Local gradients 1 + 3 2 6 * 2 2 max 0 27

  28. An Example Forward prop steps Local gradients 1 + 3 2 6 * 2 2 max 0 28

  29. An Example Forward prop steps Local gradients 1 + 3 2 6 * 2 2 max 0 29

  30. An Example Forward prop steps Local gradients 1 + 3 2 1*2 = 2 6 * 2 1 2 max 1*3 = 3 0 upstream * local = downstream 30

  31. An Example Forward prop steps Local gradients 1 + 3 2 2 6 * 2 1 2 3*1 = 3 max 3 0 3*0 = 0 upstream * local = downstream 31

  32. An Example Forward prop steps Local gradients 1 + 2*1 = 2 3 2 2 6 2*1 = 2 * 2 1 2 3 max 3 0 0 upstream * local = downstream 32

  33. An Example Forward prop steps Local gradients 1 + 2 3 2 2 6 2 * 2 1 2 3 max 3 0 0 33

  34. Gradients sum at outward branches + 34

  35. Gradients sum at outward branches + 35

  36. Node Intuitions + “distributes” the upstream gradient • 1 + 2 3 2 2 6 2 * 2 1 2 max 0 36

  37. Node Intuitions + “distributes” the upstream gradient to each summand • max “routes” the upstream gradient • 1 + 3 2 6 * 2 1 2 3 max 3 0 0 37

  38. Node Intuitions + “distributes” the upstream gradient • max “routes” the upstream gradient • * “switches” the upstream gradient • 1 + 3 2 2 6 * 2 1 2 max 3 0 38

  39. Efficiency: compute all gradients at once Incorrect way of doing backprop: • First compute • + Ÿ * 39

  40. Efficiency: compute all gradients at once Incorrect way of doing backprop: • First compute • Then independently compute • Duplicated computation! • + Ÿ * 40

  41. Efficiency: compute all gradients at once Correct way: • Compute all the gradients at once • Analogous to using ! when we • computed gradients by hand + Ÿ * 41

  42. Back-Prop in General Computation Graph 1. Fprop: visit nodes in topological sort order Single scalar output - Compute value of node given predecessors 2. Bprop: - initialize output gradient = 1 … - visit nodes in reverse order: Compute gradient wrt each node using gradient wrt successors … = successors of Done correctly, big O() complexity of fprop and bprop is the same … In general our nets have regular layer-structure and so we can use matrices and Jacobians… 42

  43. Automatic Differentiation • The gradient computation can be automatically inferred from the symbolic expression of the fprop • Each node type needs to know how to compute its output and how to compute the gradient wrt its inputs given the gradient wrt its output • Modern DL frameworks (Tensorflow, PyTorch, etc.) do backpropagation for you but mainly leave layer/node writer to hand-calculate the local derivative 43

  44. Backprop Implementations 44

  45. Implementation: forward/backward API 45

  46. Implementation: forward/backward API 46

Recommend


More recommend