Scali ling Optim imiz izatio ion I2DL: Prof. Niessner, Prof. Leal-Taixé 1
Lecture 4 Recap I2DL: Prof. Niessner, Prof. Leal-Taixé 2
Neural Network Source: http://cs231n.github.io/neural-networks-1/ I2DL: Prof. Niessner, Prof. Leal-Taixé 3
Neural Network Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Input Layer Output Layer Width Depth I2DL: Prof. Niessner, Prof. Leal-Taixé 4
Compute Gra raphs → Neura ral Network rks Output layer Input layer ∗ 𝑥 0 𝑦 0 Loss/ 𝑦 ∗ 𝑦 −𝑧 0 𝑦 0 max(0, 𝑦) + cost 𝑦 1 ∗ 𝑥 1 𝑧 0 ො 𝑧 0 ReLU Activation 𝑦 1 L2 Loss Weights Input (btw. I’m not arguing (unknowns!) this is the right choice here) We want to compute gradients w.r.t. all weights 𝑿 e.g., class label/ regression target I2DL: Prof. Niessner, Prof. Leal-Taixé 5
Compute Gra raphs → Neura ral Network rks ∗ 𝑥 0,0 Loss/ Output layer Input layer 𝑦 ∗ 𝑦 + −𝑧 0 cost ∗ 𝑥 0,1 ∗ 𝑥 1,0 𝑦 0 𝑧 0 ො 𝑧 0 Loss/ 𝑦 ∗ 𝑦 + −𝑧 0 cost 𝑦 0 ∗ 𝑥 1,1 𝑦 1 𝑧 1 ො 𝑧 1 ∗ 𝑥 2,0 𝑦 1 Loss/ 𝑧 2 ො 𝑧 2 𝑦 ∗ 𝑦 + −𝑧 0 cost ∗ 𝑥 2,1 We want to compute gradients w.r.t. all weights 𝑿 I2DL: Prof. Niessner, Prof. Leal-Taixé 6
Compute Gra raphs → Neura ral Network rks Output layer Input layer Goal: We want to compute gradients of the loss function 𝑀 w.r.t. all weights 𝑥 𝑀 = 𝑀 𝑗 𝑦 0 𝑗 𝑧 0 ො 𝑧 0 𝑀 : sum over loss per sample, e.g. L2 loss ⟶ simply sum up squares: … 𝑧 𝑗 − 𝑧 𝑗 2 𝑀 𝑗 = ො 𝑧 1 ො 𝑧 1 𝑦 𝑙 ⟶ use chain rule to compute partials 𝑧 𝑗 = 𝐵(𝑐 𝑗 + ො 𝑦 𝑙 𝑥 𝑗,𝑙 ) 𝜖𝑀 = 𝜖𝑀 ⋅ 𝜖 ො 𝑧 𝑗 𝑙 𝜖𝑥 𝑗,𝑙 𝜖 ො 𝑧 𝑗 𝜖𝑥 𝑗,𝑙 Activation bias function We want to compute gradients w.r.t. all weights 𝑿 AN AND all biases 𝑐 I2DL: Prof. Niessner, Prof. Leal-Taixé 7
Summary 𝜖𝑔 • We have 𝜖𝑥 0,0,0 … – (Directional) compute graph … 𝜖𝑔 – Structure graph into layers 𝛼 𝑿 𝑔 𝒚,𝒛 (𝑿) = 𝜖𝑥 𝑚,𝑛,𝑜 – Compute partial derivatives w.r.t. … … weights (unknowns) 𝜖𝑔 𝜖𝑐 𝑚,𝑛 • Next Gradient step: – Find weights based on gradients 𝑿 ′ = 𝑿 − 𝛽𝛼 𝑿 𝑔 𝒚,𝒛 (𝑿) I2DL: Prof. Niessner, Prof. Leal-Taixé 8
Optim imiz izatio ion I2DL: Prof. Niessner, Prof. Leal-Taixé 9
Gra radie ient Descent 𝑦 ∗ = arg min 𝑔(𝑦) Initialization Optimum I2DL: Prof. Niessner, Prof. Leal-Taixé 10
Gra radie ient Descent 𝑦 ∗ = arg min 𝑔(𝑦) Initialization Follow the slope of the DERIVATIVE 𝑒𝑔(𝑦) 𝑔 𝑦 + ℎ − 𝑔(𝑦) Optimum = lim 𝑒𝑦 ℎ ℎ→0 I2DL: Prof. Niessner, Prof. Leal-Taixé 11
Gra radie ient Descent • From derivative to gradient Direction of greatest increase of the function ⅆ𝑔 𝑦 𝛼 𝑦 𝑔 𝑦 ⅆ𝑦 • Gradient steps in direction of negative gradient 𝑦 𝛼 𝑦 𝑔(𝑦) 𝑦 ′ = 𝑦 − 𝛽𝛼 𝑦 𝑔 𝑦 Learning rate I2DL: Prof. Niessner, Prof. Leal-Taixé 12
Gra radie ient Descent • From derivative to gradient Direction of greatest increase of the function ⅆ𝑔 𝑦 𝛼 𝑦 𝑔 𝑦 ⅆ𝑦 • Gradient steps in direction of negative gradient 𝑦 𝛼 𝑦 𝑔(𝑦) 𝑦 ′ = 𝑦 − 𝛽𝛼 𝑦 𝑔 𝑦 SMALL Learning rate I2DL: Prof. Niessner, Prof. Leal-Taixé 13
Gra radie ient Descent • From derivative to gradient Direction of greatest increase of the function ⅆ𝑔 𝑦 𝛼 𝑦 𝑔 𝑦 ⅆ𝑦 • Gradient steps in direction of negative gradient 𝑦 𝛼 𝑦 𝑔(𝑦) 𝑦 ′ = 𝑦 − 𝛽𝛼 𝑦 𝑔 𝑦 LARGE Learning rate I2DL: Prof. Niessner, Prof. Leal-Taixé 14
Gra radie ient Descent 𝒚 ∗ = arg min 𝑔(𝒚) Initialization What is the gradient when Not guaranteed we reach this to reach the Optimum point? global optimum I2DL: Prof. Niessner, Prof. Leal-Taixé 15
Convergence of f Gra radient Descent • Convex function: all local minima are global minima Source: https://en.wikipedia.org/wiki/Convex_function#/media/File:ConvexFunction.svg If line/plane segment between any two points lies above or on the graph I2DL: Prof. Niessner, Prof. Leal-Taixé 16
Convergence of f Gra radient Descent • Neural networks are non-convex – many (different) local minima – no (practical) way to say which is globally optimal Source: Li, Qi. (2006). Challenging Registration of Geologic Image Data I2DL: Prof. Niessner, Prof. Leal-Taixé 17
Convergence of f Gra radient Descent Source: https://builtin.com/data-science/gradient-descent I2DL: Prof. Niessner, Prof. Leal-Taixé 18
Convergence of f Gra radient Descent Source: A. Geron I2DL: Prof. Niessner, Prof. Leal-Taixé 19
Gra radie ient Descent: : Mult ltip iple Dim imensio ions Source: builtin.com/data-science/gradient-descent Various ways to visualize… I2DL: Prof. Niessner, Prof. Leal-Taixé 20
Gra radie ient Descent: : Mult ltip iple Dim imensio ions Source: http://blog.datumbox.com/wp-content/uploads/2013/10/gradient-descent.png I2DL: Prof. Niessner, Prof. Leal-Taixé 21
Gra radie ient Descent fo for r Neura ral Networks Loss function 𝜖𝑔 𝑧 𝑗 − 𝑧 𝑗 2 𝑀 𝑗 = ො 𝜖𝑥 0,0,0 … ℎ 0 … 𝑦 0 𝜖𝑔 𝑧 0 ො ℎ 1 𝑧 0 𝛼 𝑿,𝒄 𝑔 𝒚,𝒛 (𝑿) = 𝜖𝑥 𝑚,𝑛,𝑜 𝑦 1 … 𝑧 1 ො ℎ 2 𝑧 1 … 𝜖𝑔 𝑦 2 ℎ 3 𝜖𝑐 𝑚,𝑛 𝑧 𝑗 = 𝐵(𝑐 1,𝑗 + ො ℎ 𝑘 𝑥 1,𝑗,𝑘 ) 𝑘 Just simple: ℎ 𝑘 = 𝐵(𝑐 0,𝑘 + 𝑦 𝑙 𝑥 0,𝑘,𝑙 ) 𝐵 𝑦 = max(0, 𝑦) 𝑙 I2DL: Prof. Niessner, Prof. Leal-Taixé 22
Gra radie ient Descent: : Sin ingle Tra rain inin ing Sample • Given a loss function 𝑀 and a single training sample {𝒚 𝑗 , 𝒛 𝑗 } • Find best model parameters 𝜾 = 𝑿, 𝒄 • Cost 𝑀 𝑗 𝜾, 𝒚 𝑗 , 𝒛 𝑗 – 𝜾 = arg min 𝑀 𝑗 (𝒚 𝑗 , 𝒛 𝑗 ) • Gradient Descent: – Initialize 𝜾 1 with ‘random’ values (more to that later) – 𝜾 𝑙+1 = 𝜾 𝑙 − 𝛽𝛼 𝜾 𝑀 𝑗 (𝜾 𝑙 , 𝒚 𝑗 , 𝒛 𝑗 ) – Iterate until convergence: 𝜾 𝑙+1 − 𝜾 𝑙 < 𝜗 I2DL: Prof. Niessner, Prof. Leal-Taixé 23
Gra radie ient Descent: : Sin ingle Tra rain inin ing Sample – 𝜾 𝑙+1 = 𝜾 𝑙 − 𝛽𝛼 𝜾 𝑀 𝑗 (𝜾 𝑙 , 𝒚 𝑗 , 𝒛 𝑗 ) Training sample Loss Function Weights, biases after Gradient w.r.t. 𝜾 update step Learning rate Weights, biases at step k (current model) – 𝛼 𝜾 𝑀 𝑗 𝜾 𝑙 , 𝒚 𝑗 , 𝒛 𝒋 computed via backpropagation – Typically: ⅆim 𝛼 𝜾 𝑀 𝑗 𝜾 𝑙 , 𝒚 𝑗 , 𝒛 𝑗 = ⅆim 𝜾 ≫ 1 𝑛𝑗𝑚𝑚𝑗𝑝𝑜 I2DL: Prof. Niessner, Prof. Leal-Taixé 24
Gra radie ient Descent: : Mult ltip iple Tra rain inin ing Samples • Given a loss function 𝑀 and multiple ( 𝑜 ) training samples {𝒚 𝑗 , 𝒛 𝑗 } • Find best model parameters 𝜾 = 𝑿, 𝒄 1 𝑜 • Cost 𝑀 = 𝑜 σ 𝑗=1 𝑀 𝑗 (𝜾, 𝒚 𝑗 , 𝒛 𝑗 ) – 𝜾 = arg min 𝑀 I2DL: Prof. Niessner, Prof. Leal-Taixé 25
Gra radie ient Descent: : Mult ltip iple Tra rain inin ing Samples • Update step for multiple samples 𝜾 𝑙+1 = 𝜾 𝑙 − 𝛽𝛼 𝜾 𝑀 𝜾 𝑙 , 𝒚 1..𝑜 , 𝒛 1..𝑜 • Gradient is average / sum over residuals = 1 𝑜 𝛼 𝜾 𝑀 𝜾 𝑙 , 𝒚 1..𝑜 , 𝒛 1..𝑜 𝛼 𝜾 𝑀 𝑗 𝜾 𝑙 , 𝒚 𝑗 , 𝒛 𝒋 𝑜 σ 𝑗=1 Reminder: this comes from backprop. 𝑜 • Often people are lazy and just write: 𝛼𝑀 = σ 𝑗=1 𝛼 𝜾 𝑀 𝑗 omitting 1 𝑜 is not ‘wrong’, it just means rescaling the learning rate I2DL: Prof. Niessner, Prof. Leal-Taixé 26
Sid ide Note: : Optim imal Learnin ing Rate Can compute optimal learning rate 𝛽 using Line Search (optimal for a given set) 1 𝑜 1. Compute gradient: 𝛼 𝜾 𝑀 = 𝑜 σ 𝑗=1 𝛼 𝜾 𝑀 𝑗 2. Optimize for optimal step 𝛽 : 𝛽 𝑀(𝜾 𝑙 − 𝛽 𝛼 𝜾 𝑀) arg min 𝜾 𝑙+1 Not that practical for DL since we 𝜾 𝑙+1 = 𝜾 𝑙 − 𝛽𝛼 𝜾 𝑀 3. need to solve huge system every step… I2DL: Prof. Niessner, Prof. Leal-Taixé 27
Gra radie ient Descent on Tra rain in Set • Given large train set with 𝑜 training samples {𝒚 𝑗 , 𝒛 𝑗 } – Let’s say 1 million labeled images – Let’s say our network has 500k parameters • Gradient has 500k dimensions • 𝑜 = 1 𝑛𝑗𝑚𝑚𝑗𝑝𝑜 → Extremely expensive to compute I2DL: Prof. Niessner, Prof. Leal-Taixé 28
Stochastic Gra radient Descent (S (SGD) • If we have 𝑜 training samples we need to compute the gradient for all of them which is 𝑃(𝑜) • If we consider the problem as empirical risk minimization, we can express the total loss over the training data as the expectation of all the samples 𝑜 1 𝑜 𝑀 𝑗 𝜾, 𝒚 𝒋 , 𝒛 𝒋 = 𝔽 𝑗~ 1,…,𝑜 𝑀 𝑗 𝜾, 𝒚 𝒋 , 𝒛 𝒋 𝑗=1 I2DL: Prof. Niessner, Prof. Leal-Taixé 29
Recommend
More recommend