Optimization and Backpropagation I2DL: Prof. Niessner, Prof. Leal-Taixé 1
Lecture 3 Recap I2DL: Prof. Niessner, Prof. Leal-Taixé 2
Neural Network • Linear score function 𝒈 = 𝑿𝒚 On CIFAR-10 Credit: Li/Karpathy/Johnson On ImageNet I2DL: Prof. Niessner, Prof. Leal-Taixé 3
Neural Network • Linear score function 𝒈 = 𝑿𝒚 • Neural network is a nesting of ‘functions’ – 2-layers: 𝒈 = 𝑿 𝟑 max(𝟏, 𝑿 𝟐 𝒚) – 3-layers: 𝒈 = 𝑿 𝟒 max(𝟏, 𝑿 𝟑 max(𝟏, 𝑿 𝟐 𝒚)) – 4-layers: 𝒈 = 𝑿 𝟓 tanh (𝑿 𝟒 , max(𝟏, 𝑿 𝟑 max(𝟏, 𝑿 𝟐 𝒚))) – 5-layers: 𝒈 = 𝑿 𝟔 𝜏(𝑿 𝟓 tanh(𝑿 𝟒 , max(𝟏, 𝑿 𝟑 max(𝟏, 𝑿 𝟐 𝒚)))) – … up to hundreds of layers I2DL: Prof. Niessner, Prof. Leal-Taixé 4
Neural Network Output layer Input layer Hidden layer Credit: Li/Karpathy/Johnson I2DL: Prof. Niessner, Prof. Leal-Taixé 5
Neural Network Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Input Layer Output Layer Width Depth I2DL: Prof. Niessner, Prof. Leal-Taixé 6
Activation Functions Leaky ReLU: max 0.1𝑦, 𝑦 1 Sigmoid: 𝜏 𝑦 = (1+𝑓 −𝑦 ) tanh: tanh 𝑦 Parametric ReLU: max 𝛽𝑦, 𝑦 Maxout max 𝑥 1 𝑈 𝑦 + 𝑐 1 , 𝑥 2 𝑈 𝑦 + 𝑐 2 ReLU: max 0, 𝑦 𝑦 if 𝑦 > 0 ELU f x = ቊ α e 𝑦 − 1 if 𝑦 ≤ 0 I2DL: Prof. Niessner, Prof. Leal-Taixé 7
Loss Functions • Measure the goodness of the predictions (or equivalently, the network's performance) • Regression loss 𝑜 | 𝑧 𝑗 − ෝ 𝒛; 𝜾 = 1 – L1 loss 𝑀 𝒛, ෝ 𝑜 σ 𝑗 𝑧 𝑗 | 1 𝑜 | 𝑧 𝑗 − ෝ 𝒛; 𝜾 = 1 2 – MSE loss 𝑀 𝒛, ෝ 𝑜 σ 𝑗 𝑧 𝑗 | 2 • Classification loss (for multi-class classification) 𝑜 – Cross Entropy loss E 𝒛, ෝ 𝑙 𝒛; 𝜾 = − σ 𝑗=1 σ 𝑙=1 (𝑧 𝑗𝑙 ∙ log ො 𝑧 𝑗𝑙 ) I2DL: Prof. Niessner, Prof. Leal-Taixé 8
Computational Graphs • Neural network is a computational graph – It has compute nodes – It has edges that connect nodes – It is directional – It is organized in ‘layers’ I2DL: Prof. Niessner, Prof. Leal-Taixé 9
Backprop I2DL: Prof. Niessner, Prof. Leal-Taixé 10
The Importance of Gradients • Our optimization schemes are based on computing gradients 𝛼 𝜾 𝑀 𝜾 • One can compute gradients analytically but what if our function is too complex? • Break down gradient computation Backpropagation Rumelhart 1986 I2DL: Prof. Niessner, Prof. Leal-Taixé 11
Backprop: Forward Pass • 𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 ⋅ 𝑨 Initialization 𝑦 = 1, 𝑧 = −3, 𝑨 = 4 1 1 𝑒 = −2 −3 sum 𝑔 = −8 −3 mult 4 4 I2DL: Prof. Niessner, Prof. Leal-Taixé 12
Backprop: Backward Pass 𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 ⋅ 𝑨 1 1 with 𝑦 = 1, 𝑧 = −3, 𝑨 = 4 𝑒 = −2 −3 sum 𝑔 = −8 −3 mult 𝜖𝑒 𝜖𝑒 4 𝜖𝑦 = 1 , 𝑒 = 𝑦 + 𝑧 𝜖𝑧 = 1 4 𝜖𝑔 𝜖𝑔 𝜖𝑒 = 𝑨 , 𝑔 = 𝑒 ⋅ 𝑨 𝜖𝑨 = 𝑒 𝜖𝑔 𝜖𝑔 𝜖𝑔 What is 𝜖𝑨 ? 𝜖𝑦 , 𝜖𝑧 , I2DL: Prof. Niessner, Prof. Leal-Taixé 13
Backprop: Backward Pass 𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 ⋅ 𝑨 1 1 with 𝑦 = 1, 𝑧 = −3, 𝑨 = 4 𝑒 = −2 −3 sum 𝑔 = −8 −3 1 mult 𝜖𝑒 𝜖𝑒 4 𝜖𝑦 = 1 , 𝑒 = 𝑦 + 𝑧 𝜖𝑧 = 1 4 𝜖𝑔 𝜖𝑔 𝜖𝑒 = 𝑨 , 𝑔 = 𝑒 ⋅ 𝑨 𝜖𝑨 = 𝑒 𝜖𝑔 𝜖𝑔 𝜖𝑔 𝜖𝑔 𝜖𝑔 What is 𝜖𝑨 ? 𝜖𝑦 , 𝜖𝑧 , I2DL: Prof. Niessner, Prof. Leal-Taixé 14
Backprop: Backward Pass 𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 ⋅ 𝑨 1 1 with 𝑦 = 1, 𝑧 = −3, 𝑨 = 4 𝑒 = −2 −3 sum 𝑔 = −8 −3 1 mult 𝜖𝑒 𝜖𝑒 4 𝜖𝑦 = 1 , 𝑒 = 𝑦 + 𝑧 𝜖𝑧 = 1 −2 4 −2 𝜖𝑔 𝜖𝑔 𝜖𝑒 = 𝑨 , 𝑔 = 𝑒 ⋅ 𝑨 𝜖𝑨 = 𝑒 𝜖𝑔 𝜖𝑨 𝜖𝑔 𝜖𝑔 𝜖𝑔 What is 𝜖𝑨 ? 𝜖𝑦 , 𝜖𝑧 , I2DL: Prof. Niessner, Prof. Leal-Taixé 15
Backprop: Backward Pass 𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 ⋅ 𝑨 1 1 with 𝑦 = 1, 𝑧 = −3, 𝑨 = 4 𝑒 = −2 −3 4 sum 𝑔 = −8 −3 1 mult 𝜖𝑒 𝜖𝑒 4 𝜖𝑦 = 1 , 𝑒 = 𝑦 + 𝑧 𝜖𝑧 = 1 −2 4 −2 𝜖𝑔 𝜖𝑔 𝜖𝑒 = 𝑨 , 𝑔 = 𝑒 ⋅ 𝑨 𝜖𝑨 = 𝑒 𝜖𝑔 𝜖𝑒 𝜖𝑔 𝜖𝑔 𝜖𝑔 What is 𝜖𝑨 ? 𝜖𝑦 , 𝜖𝑧 , I2DL: Prof. Niessner, Prof. Leal-Taixé 16
Backprop: Backward Pass 𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 ⋅ 𝑨 1 1 with 𝑦 = 1, 𝑧 = −3, 𝑨 = 4 𝑒 = −2 −3 4 4 sum 𝑔 = −8 −3 4 1 mult 𝜖𝑒 𝜖𝑒 4 𝜖𝑦 = 1 , 𝑒 = 𝑦 + 𝑧 𝜖𝑧 = 1 4 −2 𝜖𝑔 𝜖𝑔 𝜖𝑒 = 𝑨 , 𝑔 = 𝑒 ⋅ 𝑨 𝜖𝑨 = 𝑒 Chain Rule: 𝜖𝑔 𝜖𝑧 = 𝜖𝑔 𝜖𝑔 𝜖𝑒 ⋅ 𝜖𝑒 𝜖𝑧 𝜖𝑧 𝜖𝑔 𝜖𝑔 𝜖𝑔 What is 𝜖𝑨 ? 𝜖𝑦 , 𝜖𝑧 , → 𝜖𝑔 𝜖𝑧 = 4 ⋅ 1 = 4 I2DL: Prof. Niessner, Prof. Leal-Taixé 17
Backprop: Backward Pass 𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 ⋅ 𝑨 1 1 4 4 with 𝑦 = 1, 𝑧 = −3, 𝑨 = 4 𝑒 = −2 −3 4 4 sum 𝑔 = −8 −3 4 1 mult 𝜖𝑒 𝜖𝑒 4 𝜖𝑦 = 1 , 𝑒 = 𝑦 + 𝑧 𝜖𝑧 = 1 −2 4 −2 𝜖𝑔 𝜖𝑔 𝜖𝑒 = 𝑨 , 𝑔 = 𝑒 ⋅ 𝑨 𝜖𝑨 = 𝑒 Chain Rule: 𝜖𝑔 𝜖𝑔 𝜖𝑦 = 𝜖𝑔 𝜖𝑒 ⋅ 𝜖𝑒 𝜖𝑦 𝜖𝑦 𝜖𝑔 𝜖𝑔 𝜖𝑔 What is 𝜖𝑨 ? 𝜖𝑦 , 𝜖𝑧 , → 𝜖𝑔 𝜖𝑦 = 4 ⋅ 1 = 4 I2DL: Prof. Niessner, Prof. Leal-Taixé 18
Compute Graphs -> Neural Networks • 𝑦 𝑙 input variables • 𝑥 𝑚,𝑛,𝑜 network weights (note 3 indices) – 𝑚 which layer – 𝑛 which neuron in layer – 𝑜 which weight in neuron • 𝑧 𝑗 computed output ( 𝑗 output dim; 𝑜 𝑝𝑣𝑢 ) ො • 𝑧 𝑗 ground truth targets • 𝑀 loss function I2DL: Prof. Niessner, Prof. Leal-Taixé 19
Compute Graphs -> Neural Networks Input layer Output layer ∗ 𝑥 0 𝑦 0 𝑦 0 Loss/ 𝑦 ∗ 𝑦 + −𝑧 0 cost 𝑧 0 ො 𝑧 0 𝑦 1 ∗ 𝑥 1 𝑦 1 L2 Loss Weights Input function (unknowns!) e.g., class label/ regression target I2DL: Prof. Niessner, Prof. Leal-Taixé 20
Compute Graphs -> Neural Networks Input layer Output layer ∗ 𝑥 0 𝑦 0 Loss/ 𝑦 ∗ 𝑦 −𝑧 0 + max(0, 𝑦) 𝑦 0 cost 𝑦 1 ∗ 𝑥 1 𝑧 0 ො 𝑧 0 ReLU Activation 𝑦 1 L2 Loss Weights Input (btw. I’m not arguing function (unknowns!) this is the right choice here) We want to compute gradients w.r.t. all weights 𝑿 e.g., class label/ regression target I2DL: Prof. Niessner, Prof. Leal-Taixé 21
Compute Graphs -> Neural Networks ∗ 𝑥 0,0 Loss/ Output layer Input layer 𝑦 ∗ 𝑦 + −𝑧 0 cost ∗ 𝑥 0,1 ∗ 𝑥 1,0 𝑧 0 ො 𝑧 0 𝑦 0 Loss/ 𝑦 ∗ 𝑦 + −𝑧 0 cost 𝑦 0 ∗ 𝑥 1,1 𝑦 1 𝑧 1 ො 𝑧 1 ∗ 𝑥 2,0 𝑦 1 Loss/ 𝑧 2 ො 𝑧 2 𝑦 ∗ 𝑦 −𝑧 0 + cost ∗ 𝑥 2,1 We want to compute gradients w.r.t. all weights 𝑿 I2DL: Prof. Niessner, Prof. Leal-Taixé 22
Compute Graphs -> Neural Networks Output layer Input layer Goal: We want to compute gradients of the loss function 𝑀 w.r.t. all weights 𝑿 𝑀 = 𝑀 𝑗 𝑦 0 𝑗 𝑧 0 ො 𝑧 0 𝑀 : sum over loss per sample, e.g. L2 loss ⟶ simply sum up squares: … 𝑧 𝑗 − 𝑧 𝑗 2 𝑀 𝑗 = ො 𝑧 1 ො 𝑧 1 𝑦 𝑙 ⟶ use chain rule to compute partials 𝑧 𝑗 = 𝐵(𝑐 𝑗 + ො 𝑦 𝑙 𝑥 𝑗,𝑙 ) 𝜖𝑀 𝑗 = 𝜖𝑀 𝑗 ⋅ 𝜖 ො 𝑧 𝑗 𝑙 𝜖𝑥 𝑗,𝑙 𝜖 ො 𝑧 𝑗 𝜖𝑥 𝑗,𝑙 Activation bias function We want to compute gradients w.r.t. all weights 𝑿 AND all biases 𝒄 I2DL: Prof. Niessner, Prof. Leal-Taixé 23
NNs as Computational Graphs • We can express any kind of functions in a 1 computational graph, e.g. 𝑔 𝒙, 𝒚 = 1+𝑓 − 𝑐+𝑥0𝑦0+𝑥1𝑦1 𝑥 1 ∗ Sigmoid function 𝑦 1 1 + 𝜏 𝑦 = 1 + 𝑓 −𝑦 𝑥 0 ∗ 1 + exp(∙) +1 ∙ −1 𝑦 0 ∙ 𝑐 I2DL: Prof. Niessner, Prof. Leal-Taixé 24
NNs as Computational Graphs 1 • 𝑔 𝒙, 𝒚 = 1+𝑓 − 𝑐+𝑥0𝑦0+𝑥1𝑦1 2 𝑥 1 ∗ −1 −2 𝑦 1 + −3 4 6 𝑥 0 ∗ 1 −1 1.37 0.73 0.37 1 −2 + exp(∙) +1 ∙ −1 𝑦 0 ∙ −3 𝑐 I2DL: Prof. Niessner, Prof. Leal-Taixé 25
NNs as Computational Graphs 1 • 𝑔 𝒙, 𝒚 = 𝑦 = 1 𝜖 1 ⇒ 𝜖𝑦 = − 1+𝑓 − 𝑐+𝑥0𝑦0+𝑥1𝑦1 𝑦 2 𝑦 𝛽 𝑦 = 𝛽 + 𝑦 ⇒ 𝜖 𝜖𝑦 = 1 2 𝑥 1 ⇒ 𝜖 𝑦 = 𝑓 𝑦 𝜖𝑦 = 𝑓 𝑦 ∗ −1 −2 𝜖 𝛽 𝑦 = 𝛽𝑦 ⇒ 𝜖𝑦 = 𝛽 𝑦 1 + 1 1 ∙ − 1.37 2 = −0.53 −3 4 6 𝑥 0 ∗ 1 −1 1.37 0.73 0.37 1 −2 + exp(∙) +1 ∙ −1 𝑦 0 ∙ −0.53 1 −3 𝑐 I2DL: Prof. Niessner, Prof. Leal-Taixé 26
Recommend
More recommend