multi layer networks back propagation
play

Multi-Layer Networks & Back-Propagation M. Soleymani Deep - PowerPoint PPT Presentation

Multi-Layer Networks & Back-Propagation M. Soleymani Deep Learning Sharif University of Technology Spring 2019 Most slides have been adapted from Bhiksha Raj, 11-785, CMU 2019 and some from Fei Fei Li et. al, cs231n, Stanford 2017 These


  1. Choosing cost function: Examples } Regression problem + ๐น = 1 ๐น 5 56$ โ€“ SSE ๐‘ฆ $ ๐น 5 = 1 ) 2 ๐‘ 5 โˆ’ ๐‘ง 5 One dimensional output ๐‘ง $ ๐น 5 = 1 > ) ) ๐’‘ 5 โˆ’ ๐’› 5 5 โˆ’ ๐‘ง ; 5 ๐‘ง > = 1 ๐‘ ; Multi-dimensional output 2 ๐‘ฆ 9 ;6$ } Classification problem โ€“ Cross-entropy โ€ข Binary classification ๐‘š๐‘๐‘ก๐‘ก 5 = โˆ’๐‘ง 5 log ๐‘ 5 โˆ’ (1 โˆ’ ๐‘ง 5 ) log(1 โˆ’ ๐‘ 5 ) Output layer uses sigmoid activation function 31

  2. Multi-class output: One-hot representations โ€ข Consider a network that must distinguish if an input is a cat, a dog, a camel, a hat, or a flower โ€ข For inputs of each of the five classes the desired output is: Cat : [1 0 0 0 0 ] T dog : [0 1 0 0 0 ] T camel : [0 0 1 0 0 ] T hat : [0 0 0 1 0 ] T flower : [0 0 0 0 1 ] T โ€ข For input of any class, we will have a five-dimensional vector output with four zeros and a single 1 at the position of the class โ€ข This is a one hot vector 32

  3. Multi-class networks Input Hidden Layers L ayer Output Layer โ€ข For a multi-class classifier with N classes, the one-hot representation will have N binary outputs โ€“ An N-dimensional binary vector โ€ข The neural networkโ€™s output too must ideally be binary (N-1 zeros and a single 1 in the right place) โ€ข More realistically, it will be aprobability vector โ€“ N probability values that sum to 1. 33

  4. Vector activation example: Softmax โ€ข Example: Softmax vector activation Parameters are weights and bias ๐‘ U 34

  5. Vector Activations Input Hidden Layers L ayer Output Layer โ€ข We can also have neuron that have multiple couple outputs ๐‘ง $ , ๐‘ง ) , โ€ฆ , ๐‘ง V = ๐‘”(๐‘ฆ $ , ๐‘ฆ ) , โ€ฆ , ๐‘ฆ ; ; ๐‘ฟ) โ€“ Function ๐‘”(. ) operates on set of inputs to produce set of outputs โ€“ Modifying a single parameter in ๐‘ฟ will affect all outputs 35

  6. ๏ฟฝ ๏ฟฝ Multi-class classification: Output Input Hidden Layers L ayer Output Layer s o f t m a x โ€ข Softmax vector activation is often used at the output of multi-class classifier nets (V) ๐‘ Y (5[$) ๐‘จ U = 1 ๐‘ฅ YU Y ๐‘ U = exp (๐‘จ U ) โˆ‘ exp (๐‘จ Y ) Y โ€ข This can be viewed as the probability ๐‘ U = ๐‘„ ๐‘‘๐‘š๐‘๐‘ก๐‘ก = ๐‘— ๐’š 36

  7. ๏ฟฝ For multi-class classification y 1 y 2 y 3 y 4 K L Div() E โ€ข Desired output ๐‘ง is one hot vector 0 0 โ€ฆ 1 โ€ฆ 0 0 0 wit the 1 in the ๐‘‘ -th position(for class c) โ€ข Actual output will be probability distribution [๐‘ $ , ๐‘ ) , โ€ฆ , ๐‘ V ] โ€ข The cross-entropy between the desired one-hot output and actual output ๐‘€ ๐’›, ๐’‘ = โˆ’ 1 ๐‘ง U ๐‘š๐‘๐‘•๐‘ U = โˆ’๐‘š๐‘๐‘•๐‘ a U โ€ข Derivative = bโˆ’ 1 ๐‘’๐‘€(๐’›, ๐’‘) ๐‘”๐‘๐‘  ๐‘ขโ„Ž๐‘“ ๐‘‘ ๐‘ขโ„Ž ๐‘‘๐‘๐‘›๐‘ž๐‘๐‘œ๐‘“๐‘œ๐‘ข The slopeis negative w.r.t. ๐‘ a ๐‘ a ๐‘’๐‘ U Indicates increasing ๐‘ a will reduce divergence 0 ๐‘”๐‘๐‘  ๐‘ ๐‘“๐‘›๐‘๐‘—๐‘œ๐‘—๐‘œ๐‘• ๐‘‘๐‘๐‘›๐‘ž๐‘๐‘œ๐‘“๐‘œ๐‘ข ๐’‘ ๐‘€(๐’›, ๐’‘) = [0 0 โ€ฆ โˆ’1 ๐›ผ โ€ฆ 0 0 ] ๐‘ a 37

  8. ๏ฟฝ For multi-class classification y 1 y 2 y 3 y 4 K L Div() E โ€ข Desired output ๐‘ง is one hot vector 0 0 โ€ฆ 1 โ€ฆ 0 0 0 wit the 1 in the ๐‘‘ -th position(for class c) โ€ข Actual output will be probability distribution [๐‘ $ , ๐‘ ) , โ€ฆ , ๐‘ V ] โ€ข The cross-entropy between the desired one-hot output and actual output ๐‘€ ๐’›, ๐’‘ = โˆ’ 1 ๐‘ง U ๐‘š๐‘๐‘•๐‘ U = โˆ’๐‘š๐‘๐‘•๐‘ a U โ€ข Derivative The slopeis negative w.r.t. ๐‘ a = bโˆ’ 1 Indicates increasing ๐‘ a will reduce divergence ๐‘’๐‘€(๐’›, ๐’‘) ๐‘”๐‘๐‘  ๐‘ขโ„Ž๐‘“ ๐‘‘ ๐‘ขโ„Ž ๐‘‘๐‘๐‘›๐‘ž๐‘๐‘œ๐‘“๐‘œ๐‘ข ๐‘ a ๐‘’๐‘ U Note: when ๐’› = ๐’‘ the 0 ๐‘”๐‘๐‘  ๐‘ ๐‘“๐‘›๐‘๐‘—๐‘œ๐‘—๐‘œ๐‘• ๐‘‘๐‘๐‘›๐‘ž๐‘๐‘œ๐‘“๐‘œ๐‘ข derivative is not 0 ๐’‘ ๐‘€(๐’›, ๐’‘) = [0 0 โ€ฆ โˆ’1 ๐›ผ โ€ฆ 0 0 ] ๐‘ a 38

  9. Choosing cost function: Examples } Regression problem + ๐น = 1 ๐น 5 56$ โ€“ SSE ๐‘ฆ $ ๐น 5 = 1 ) 2 ๐‘ 5 โˆ’ ๐‘ง 5 One dimensional output ๐‘ง $ ๐น 5 = 1 > ) ) ๐’‘ 5 โˆ’ ๐’› 5 5 โˆ’ ๐‘ง ; 5 ๐‘ง > = 1 ๐‘ ; Multi-dimensional output 2 ๐‘ฆ 9 ;6$ } Classification problem โ€“ Cross-entropy 1 ๐‘š๐‘๐‘ก๐‘ก 5 = โˆ’๐‘ง 5 log ๐‘ 5 โˆ’ (1 โˆ’ ๐‘ง 5 ) log(1 โˆ’ ๐‘ 5 ) ๐‘ = โ€ข Binary classification 1 + ๐‘“ s โ€ข Multi-class classification Output layer uses sigmoid activation function ๐‘š๐‘๐‘ก๐‘ก 5 = โˆ’log ๐‘ i (j) k lm Output is found by a softmax layer ๐‘ U = k ln o โˆ‘ npq 40

  10. Problem setup โ€ข Given: the architecture of thenetwork โ€ข Training data: A set of input-output pairs ๐’š ($) , ๐’› ($) , ๐’š ()) , ๐’› ()) , โ€ฆ , (๐’š (+) , ๐’› (+) ) โ€ข We need a loss function to show how penalizes the obtained output ๐‘ = ๐‘”(๐’š; ๐‘ฟ) when the desired output is ๐’› + + = 1 ๐น(๐‘ฟ) = 1 ๐‘š๐‘๐‘ก๐‘ก ๐’‘ (5) , ๐’› (5) ๐‘‚ 1 ๐‘š๐‘๐‘ก๐‘ก ๐‘” ๐’š (5) ; ๐‘ฟ , ๐’› (5) 56$ 56$ ; , ๐‘ [;] โ€ข Minimize ๐น w.r.t. ๐‘ฟ that containts ๐‘ฅ U,Y Y 41

  11. How to adjust weights for multi layer networks? โ€ข We need multiple layers of adaptive, non-linear hidden units. But how can we train such nets? โ€“ We need an efficient way of adapting all the weights, not just the last layer. โ€“ Learning the weights going into hidden units is equivalent to learning features. โ€“ This is difficult because nobody is telling us directly what the hidden units should do. 42

  12. Find the weights by optimizing the cost โ€ข Start from random weights and then adjust them iteratively to get lower cost. โ€ข Update the weights according to the gradient of the cost function Source: http://3b1b.co 43

  13. How does the network learn? โ€ข Which changes to the weights do improve the most? ๐›ผ๐น ๐›ผ๐น โ€ข The magnitude of each element shows how sensitive the cost is to that weight or bias. Source: http://3b1b.co 44

  14. Training multi-layer networks โ€ข Back-propagation โ€“ Training algorithm that is used to adjust weights in multi-layer networks (based on the training data) โ€“ The back-propagation algorithm is based on gradient descent โ€“ Use chain rule and dynamic programming to efficiently compute gradients 45

  15. Training Neural Nets through Gradient Descent Total training error: + ๐น = 1 ๐‘š๐‘๐‘ก๐‘ก ๐’‘ (5) , ๐’› (5) 56$ โ€ข Gradient descent algorithm Assuming the bias is also [;] โ€ข Initialize all weights and biases ๐‘ฅ UY represented as a weight โ€“ Using the extended notation : the bias is also weight โ€ข Do : โ€“ For every layer ๐‘™ for all ๐‘—, ๐‘˜ update: [;] = ๐‘ฅ U,Y [;] โˆ’ ๐œƒ 9w โ€ข ๐‘ฅ U,Y [y] 9x m,n โ€ข Until ๐น has converged 46

  16. The derivative Total training error: + ๐น = 1 ๐‘š๐‘๐‘ก๐‘ก ๐’‘ (5) , ๐’› (5) 56$ โ€ข Computing the derivative Total derivative: + [;] = 1 ๐‘š๐‘๐‘ก๐‘ก ๐’‘ (5) , ๐’› (5) ๐‘’๐น [;] ๐‘’๐‘ฅ U,Y ๐‘’๐‘ฅ U,Y 56$ 47

  17. Training by gradient descent [;] โ€ข Initialize all weights ๐‘ฅ UY โ€ข Do : 9w โ€“ For all ๐‘— , ๐‘˜ , ๐‘™, initialize [y] = 0 9x m,n โ€“ For all ๐‘œ = 1: ๐‘‚ โ€ข For every layer ๐‘™ for all ๐‘—, ๐‘˜ : 9 V{|| { j ,i j โ€ข Compute [y] 9x m,n 9 V{|| { j ,i j 9w [y] += โ€ข [y] 9x m,n 9x m,n โ€“ For every layer ๐‘™ for all ๐‘—, ๐‘˜: [;] = ๐‘ฅ U,Y [;] โˆ’ } 9w ๐‘ฅ U,Y [y] T 9x m,n 48

  18. The derivative Total training error: + ๐น = 1 ๐‘š๐‘๐‘ก๐‘ก ๐’‘ (5) , ๐’› (5) 56$ Total derivative: + [;] = 1 ๐‘’ ๐‘š๐‘๐‘ก๐‘ก ๐’‘ (5) , ๐’› (5) ๐‘’๐น [;] ๐‘’๐‘ฅ U,Y ๐‘’๐‘ฅ U,Y 56$ โ€ข So we must first figure out how to compute the derivative of divergences of individual training inputs 49

  19. Calculus Refresher: Basic rules of calculus ๐‘’๐‘ง ๐‘’๐‘ฆ โ‰ˆ ฮ”๐‘ง with derivative dy ฮ”๐‘ฆ dx the following must hold for sufficiently small For any differentiable function 1 2 M with partial derivatives โˆ‚ y โˆ‚ y โˆ‚ y โˆ‚ x 1 โˆ‚ x 2 โˆ‚ x M the following must hold for sufficiently small 1 2 M 50

  20. Calculus Refresher: Chainrule For any nested function Check โ€“we can confirm that : 51

  21. Calculus Refresher: Distributed Chain rule Check: 1 2 M 1 2 M 1 2 M 1 2 M 52

  22. Distributed Chain Rule: Influence Diagram 1 1 2 2 M M โ€ข ๐‘ฆ affects ๐‘ง through each ๐‘• $ , โ€ฆ , ๐‘• โ‚ฌ 53

  23. Distributed Chain Rule: Influence Diagram 1 1 1 2 M M M โ€ข Small perturbations in cause small perturbations in each o each of which individually additively perturbs 54

  24. Simple chain rule โ€ข ๐‘จ = ๐‘” ๐‘• ๐‘ฆ โ€ข ๐‘ง = ๐‘•(๐‘ฆ) 55

  25. Multiple paths chain rule 56

  26. Backpropagation: a simple example 57

  27. Backpropagation: a simple example 58

  28. Backpropagation: a simple example 59

  29. Backpropagation: a simple example 60

  30. Backpropagation: a simple example 61

  31. Backpropagation: a simple example 62

  32. How to propagate the gradients backward ๐‘จ = ๐‘”(๐‘ฆ, ๐‘ง) 63

  33. How to propagate the gradients backward 64

  34. Returning to ourproblem ๐‘’ ๐‘š๐‘๐‘ก๐‘ก ๐’‘, ๐’› โ€ข How to compute [;] ๐‘’๐‘ฅ U,Y 65

  35. A first closer look at the network + + ๐‘” (.) ๐‘” (.) + ๐‘ ๐‘” (.) + + ๐‘” (.) ๐‘” (.) โ€ข Showing a tiny 2-input network forillustration โ€“ Actual network would have many more neurons and inputs โ€ข Explicitly separating the weighted sum of inputs from the activation 66

  36. A first closer look at the network (1) (2) + + 1,1 1,1 (3) (1) (2) 1,1 1,2 1,2 + ๐‘ (1) (2) (3) 2,1 2,1 2,1 (1) (2) + + 2,2 2,2 (3) (1) (1) (2) (2) 3,1 3,1 3,2 3,1 3,2 โ€ข Showing a tiny 2-input network forillustration โ€“ Actual network would have many more neurons and inputs โ€ข Expanded with all weights andactivations shown โ€ข The overall function is differentiable w.r.t every weight,bias and input 67

  37. Backpropagation: Notation โ€ข ๐’ƒ [โ€š] โ† ๐ฝ๐‘œ๐‘ž๐‘ฃ๐‘ข โ€ข ๐‘๐‘ฃ๐‘ข๐‘ž๐‘ฃ๐‘ข โ† ๐’ƒ [โ€ ] ๐‘”(. ) ๐‘”(. ) ๐’ƒ [V[$] ๐’ƒ [V] ๐’œ [V] ๐‘”(. ) ๐‘”(. ) ๐‘”(. ) 68

  38. Backpropagation: Last layer gradient For squared error loss: ๐œ–๐‘š๐‘๐‘ก๐‘ก [โ€ ] = ๐‘” ๐‘จ U โ€  ) [โ€ ] ๐‘š๐‘๐‘ก๐‘ก = 1 ๐‘ U ) โ€  = (๐‘ง Y โˆ’ ๐‘ Y 2 1 ๐‘ Y โˆ’ ๐‘ง Y โ‚ฌ ๐œ–๐‘ Y [โ€ ] = 1 ๐‘ฅ UY Y [โ€ ] ๐‘ U [โ€ [$] ๐‘จ Y โ€  ๐‘ Y = ๐‘ Y U6โ€š โ€  ๐œ–๐‘ Y ๐œ–๐‘š๐‘๐‘ก๐‘ก [โ€ ] = ๐œ–๐‘š๐‘๐‘ก๐‘ก Output j ๐œ–๐‘š๐‘๐‘ก๐‘ก [โ€ ] [โ€ ] โ€  ๐‘ Y โ€  ๐œ–๐‘ฅ UY ๐œ–๐‘ Y ๐œ–๐‘ฅ UY ๐œ–๐‘ Y ๐‘” [โ€ ] [โ€ ] ๐‘จ ๐œ–๐‘จ ๐œ–๐‘ [โ€ ] Y [โ€ ] ๐‘ U [โ€ ] = ๐‘” โ€ฐ ๐‘จ Y [โ€ ] = ๐‘” โ€ฐ ๐‘จ [โ€ ] [โ€ [$] ๐œ–๐‘š๐‘๐‘ก๐‘ก Y Y [โ€ ] ๐œ–๐‘ฅ UY ๐œ–๐‘ฅ UY ๐‘ฅ UY [โ€ ] ๐œ–๐‘ฅ UY [V[$] ๐‘ U ๐œ–๐‘š๐‘๐‘ก๐‘ก [โ€ ] = ๐œ–๐‘š๐‘๐‘ก๐‘ก [โ€ ] ๐‘ U โ€  ๐‘” โ€ฐ ๐‘จ [โ€ [$] Y ๐œ–๐‘ฅ UY ๐œ–๐‘ Y 69

  39. Activations and theirderivatives 2 [*] โ€ข Some popular activation functions andtheir derivatives 70

  40. Previous layers gradients ๐œ– ๐‘š๐‘๐‘ก๐‘ก [V] ๐‘ Y V [V] = ๐‘” ๐‘จ U ๐‘” ๐œ–๐‘ Y [V] ๐‘ U [V] ๐‘จ V Y ๐œ–๐‘ Y โ‚ฌ ๐œ– ๐‘š๐‘๐‘ก๐‘ก [V] = ๐œ– ๐‘š๐‘๐‘ก๐‘ก [V] = 1 ๐‘ฅ UY [V] ๐‘ U [V[$] ๐‘จ Y ๐œ– ๐‘š๐‘๐‘ก๐‘ก [V] V ๐œ–๐‘ฅ UY ๐œ–๐‘ Y ๐œ–๐‘ฅ UY [V] ๐‘ฅ UY [V] U6โ€š ๐œ–๐‘ฅ UY [V[$] [V] [V] ๐‘ U ๐œ–๐‘ [V] ๐œ–๐‘ Y ๐œ–๐‘จ [V] ๐‘ U [V] = ๐‘” โ€ฐ ๐‘จ Y [V[$] Y [V] = [V] ร— ๐œ–๐‘ฅ UY ๐œ–๐‘จ ๐œ–๐‘ฅ UY Y ๐œ– ๐‘š๐‘๐‘ก๐‘ก ๐œ– ๐‘š๐‘๐‘ก๐‘ก [V] [V] 9 [โ€น] ๐œ–๐‘ $ [V] [V] ๐œ–๐‘ 9 [โ€น] [V] ๐œ–๐‘ Y ๐œ–๐‘จ Y ๐‘ Y ๐œ– ๐‘š๐‘๐‘ก๐‘ก [V[$] = 1 ๐œ– ๐‘š๐‘๐‘ก๐‘ก [V] ร— [V] ร— [V[$] ๐œ–๐‘ U ๐œ–๐‘ Y ๐œ–๐‘จ Y ๐œ–๐‘ U [V] ๐‘จ Y6$ Y 9 [โ€น] = 1 ๐œ– ๐‘š๐‘๐‘ก๐‘ก [V] ร—๐‘ฅ UY [V] ร—๐‘” โ€ฐ ๐‘จ Y [V] [V] ๐œ–๐‘ Y ๐‘ฅ UY Y6$ ๐œ– ๐‘š๐‘๐‘ก๐‘ก [V[$] ๐‘ U [V[$] ๐œ–๐‘ U 71

  41. Backpropagation: [V] ๐‘ Y [V] ๐œ–๐‘ Y ๐œ– ๐‘š๐‘๐‘ก๐‘ก [V] = ๐œ– ๐‘š๐‘๐‘ก๐‘ก [V] = ๐‘” ๐‘จ U ๐‘” [V] ๐‘ U [V] ร— [V] โ‚ฌ [V] ๐‘จ Y ๐œ–๐‘ฅ UY ๐œ–๐‘ Y ๐œ–๐‘ฅ UY [V] = 1 ๐‘ฅ UY [V] ๐‘ U [V[$] ๐‘จ Y U6โ€š [V] [V[$] ร—๐‘” โ€ฐ ๐‘จ Y [V] ร—๐‘ U [V] ๐‘ฅ UY = ๐œ€ Y [V[$] ๐‘ U [V] = โ€ข V{|| [V] } ๐œ€ [โ€น] is the sensitivity of the output to ๐‘ Y Y โ€ขลฝ n } Sensitivity vectors can be obtained by running a backward process in the network architecture (hence the name backpropagation.) We will compute ๐œบ [V[$] from ๐œบ [V] : 9 [โ€น] [V[$] = 1 ๐œ€ [V] ร—๐‘ฅ UY [V] ร—๐‘” โ€ฐ ๐‘จ [V] ๐œ€ U Y Y Y6$ 72

  42. Find and save ๐œบ [โ€ ] โ€ข Called error, computed recursively in backward manner โ€ข For the final layer ๐‘š = ๐‘€ : [โ€ ] = ๐œ– ๐‘š๐‘๐‘ก๐‘ก ๐œ€ Y [โ€ ] ๐œ–๐‘ Y 73

  43. Compute ๐œบ [V[$] from ๐œบ [V] [V] = โ€ข V{|| [V] } ๐œ€ [โ€น] is the sensitivity of the output to ๐‘ Y Y โ€ขลฝ n [V] = ๐‘” ๐‘จ U [V] ๐‘ U โ‚ฌ } Sensitivity vectors can be obtained by running a backward process in [V] = 1๐‘ฅ UY [V] ๐‘ U [V[$] ๐‘จ Y the network architecture (hence the name backpropagation.) Y6โ€š [V] ๐‘ Y 9 [โ€น] ๐‘” [V[$] = 1 ๐œ€ [V] ร—๐‘ฅ UY [V] ร—๐‘” โ€ฐ ๐‘จ [V] ๐œ€ U [V] Y Y ๐‘จ Y Y6$ [V] ๐‘ฅ UY [V[$] ๐‘ U 74

  44. Backpropagation Algorithm โ€ข Initialize all weights to small random numbers. โ€ข While not satisfied โ€ข For each training example do : 1. Feed forward the training example to the network and compute the outputs of all units in forward step (z and a) and the loss 2. For each unit find its ๐œ€ in the backward step [V] as ๐‘ฅ UY [V] โ† ๐‘ฅ UY [V] โˆ’ ๐œƒ โ€ข V{|| โ€ข V{|| [V] ร—๐‘ U [V[$] ร— 3. Update each network weight ๐‘ฅ UY [โ€น] where [โ€น] = ๐œ€ Y โ€ขx mn โ€ขx mn ๐‘” โ€ฐ ๐‘จ [V] Y 75

  45. Another example 76

  46. Another example 77

  47. Another example 78

  48. Another example 79

  49. Another example 80

  50. Another example 81

  51. Another example 82

  52. Another example 83

  53. Another example 84

  54. Another example 85

  55. Another example 86

  56. Another example 87

  57. Another example 88

  58. Another example [local gradient] x [upstream gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2 89

  59. Derivative of sigmoid function 90

  60. Derivative of sigmoid function 91

  61. Patterns in backward flow โ€ข add gate: gradient distributor โ€ข max gate: gradient router 92

  62. Modularized implementation: forward / backward API 93

  63. Modularized implementation: forward / backward API 94

  64. Modularized implementation: forward / backward API 95

  65. Ve Vector formulation โ€ข For layered networks it is generally simpler to think of the process in terms of vector operations โ€“ Simpler arithmetic โ€“ Fast matrix libraries make operations much faster โ€ข We can restate the entire process in vector terms โ€“ This is what is actually used in any real system 97

  66. The Ja Th Jacobi bian โ€ข The derivative of a vector function w.r.t. vector input is called a Jacobian โ€ข It is the matrix of partial derivatives given below ๐‘ง $ ๐‘ฆ $ ๐œ–๐‘ง $ ๐œ–๐‘ง $ ๐œ–๐‘ง $ โ‹ฏ ๐‘ง ) ๐‘ฆ ) ๐œ–๐‘ฆ $ ๐œ–๐‘ฆ ) ๐œ–๐‘ฆ 9 = ๐‘” โ‹ฎ โ‹ฎ ๐œ–๐‘ง ) ๐œ–๐‘ง ) ๐œ–๐‘ง ) ๐œ–๐’› โ‹ฏ ๐‘ง โ€˜ ๐‘ฆ 9 ๐œ–๐’š = ๐œ–๐‘ฆ $ ๐œ–๐‘ฆ ) ๐œ–๐‘ฆ 9 โ‹ฏ โ‹ฏ โ‹ฑ โ‹ฏ Using vector notation ๐œ–๐‘ง โ€˜ ๐œ–๐‘ง โ€˜ ๐œ–๐‘ง โ€˜ ๐ณ = ๐‘” ๐ฒ โ‹ฏ ๐œ–๐‘ฆ $ ๐œ–๐‘ฆ ) ๐œ–๐‘ฆ 9 โˆ†๐ณ = ๐œ–๐’› Check: ๐œ–๐’š โˆ†๐ฒ 98

  67. Matrix calculus ๐œ–๐‘ง ๐œ–๐‘ง ๐œ–๐‘ง โ€ข Scalar-by-Vector ๐œ–๐’š = โ€ฆ ๐œ–๐‘ฆ $ ๐œ–๐‘ฆ 5 ๐œ–๐‘ง $ ๐œ–๐‘ง $ โ€ฆ ๐œ–๐‘ฆ $ ๐œ–๐‘ฆ 5 ๐œ–๐’› โ€ข Vector-by-Vector ๐œ–๐’š = โ‹ฎ โ‹ฑ โ‹ฎ ๐œ–๐‘ง โ€˜ ๐œ–๐‘ง โ€˜ โ€ฆ ๐œ–๐‘ฆ $ ๐œ–๐‘ฆ 5 ๐œ–๐‘ง ๐œ–๐‘ง โ€ฆ ๐œ–๐ต $$ ๐œ–๐ต $5 ๐œ–๐‘ง โ€ข Scalar-by-Matrix ๐œ–๐‘ฉ = โ‹ฎ โ‹ฑ โ‹ฎ ๐œ–๐‘ง ๐œ–๐‘ง โ€ฆ ๐œ–๐ต โ€˜$ ๐œ–๐ต โ€˜5 โ€ข Vector-by-Matrix ๐œ–๐‘ง = ๐œ–๐‘ง ๐œ–๐’œ ๐œ–๐ต UY ๐œ–๐’œ ๐œ–๐ต UY 99

  68. Vector-by-matrix gradients 100

  69. Examples 101

  70. Ja Jacob obians c can d des escribe t e the d e der erivatives es of n of neu eural a activation ons w. w.r.t their input z a ๐‘’๐‘ $ 0 โ‹ฏ 0 ๐‘’๐‘จ $ ๐‘’๐‘ ) ๐œ–๐’ƒ 0 โ‹ฏ 0 ๐œ–๐’œ = ๐‘’๐‘จ ) โ‹ฏ โ‹ฏ โ‹ฑ โ‹ฏ ๐‘’๐‘ โ€˜ 0 0 โ‹ฏ ๐‘’๐‘จ โ€˜ โ€ข For Scalar activations โ€“ Number of outputs is identical to the number of inputs โ€ข Jacobian is a diagonal matrix โ€“ Diagonal entries are individual derivatives of outputs w.r.t inputs โ€“ Not showing the superscript โ€œ[k]โ€ in equations for brevity 102

  71. Ja Jacob obians c can d des escribe t e the d e der erivatives es of n of neu eural ac activ tivatio tions ns w.r.t t the their ir input input z a ๐‘ U = ๐‘” ๐‘จ U ๐‘”โ€ฒ ๐‘จ $ 0 โ‹ฏ 0 ๐œ–๐’ƒ 0 ๐‘”โ€ฒ ๐‘จ ) โ‹ฏ 0 ๐œ–๐’œ = โ‹ฏ โ‹ฏ โ‹ฑ โ‹ฏ 0 0 โ‹ฏ ๐‘”โ€ฒ ๐‘จ โ€˜ โ€ข For scalar activations (shorthand notation): โ€“ Jacobian is a diagonal matrix โ€“ Diagonal entries are individual derivatives of outputs w.r.t inputs 103

Recommend


More recommend