Advanced Machine Learning - Exercise 3 Deep learning essentials
Introduction What’s the plan? Exercise overview Deep learning in a nutshell Backprop in (painful) detail 2 of 13
Introduction Exercise overview Goal: implement a simple DL framework Tasks: – Compute derivatives (Jacobians) – Write code You’ll need some help. . . 3 of 13
Introduction Deep learning in a nutshell Given: – Training data X = { x i } i =1 .. N with x i ∈ I , usually as X ∈ R N × N I – Training labels T = { t i } i =1 .. N with t i ∈ O . 4 of 13
Introduction Deep learning in a nutshell Given: – Training data X = { x i } i =1 .. N with x i ∈ I , usually as X ∈ R N × N I – Training labels T = { t i } i =1 .. N with t i ∈ O . Choose: – Parametrized, (sub-)differentiable function F ( X , θ ) : I × P �→ O , with: typically, input-space I = R N I (generic data), I = R 3 × H × W (images), . . . typically, output-space O = R N O (regression), O = [0 , 1] N O (probabilistic classification), . . . typically, parameter-space P = R N P . 4 of 13
Introduction Deep learning in a nutshell Given: – Training data X = { x i } i =1 .. N with x i ∈ I , usually as X ∈ R N × N I – Training labels T = { t i } i =1 .. N with t i ∈ O . Choose: – Parametrized, (sub-)differentiable function F ( X , θ ) : I × P �→ O , with: typically, input-space I = R N I (generic data), I = R 3 × H × W (images), . . . typically, output-space O = R N O (regression), O = [0 , 1] N O (probabilistic classification), . . . typically, parameter-space P = R N P . – (Sub-)differentiable criterion/loss L ( T , F ( X , θ )) : O × O �→ R 4 of 13
Introduction Deep learning in a nutshell Given: – Training data X = { x i } i =1 .. N with x i ∈ I , usually as X ∈ R N × N I – Training labels T = { t i } i =1 .. N with t i ∈ O . Choose: – Parametrized, (sub-)differentiable function F ( X , θ ) : I × P �→ O , with: typically, input-space I = R N I (generic data), I = R 3 × H × W (images), . . . typically, output-space O = R N O (regression), O = [0 , 1] N O (probabilistic classification), . . . typically, parameter-space P = R N P . – (Sub-)differentiable criterion/loss L ( T , F ( X , θ )) : O × O �→ R Find: θ ∗ = argmin L ( T , F ( X , θ )) θ ∈ P 4 of 13
Introduction Deep learning in a nutshell Given: – Training data X = { x i } i =1 .. N with x i ∈ I , usually as X ∈ R N × N I – Training labels T = { t i } i =1 .. N with t i ∈ O . Choose: – Parametrized, (sub-)differentiable function F ( X , θ ) : I × P �→ O , with: typically, input-space I = R N I (generic data), I = R 3 × H × W (images), . . . typically, output-space O = R N O (regression), O = [0 , 1] N O (probabilistic classification), . . . typically, parameter-space P = R N P . – (Sub-)differentiable criterion/loss L ( T , F ( X , θ )) : O × O �→ R Find: θ ∗ = argmin L ( T , F ( X , θ )) θ ∈ P Assumption: N L ( T , F ( X , θ )) = 1 ∑ ℓ ( t i , F ( x i , θ )) N i =1 4 of 13
Backprop N N 1 ℓ ( t i , F ( x i , θ )) = 1 ∑ ∑ D θ ℓ ( t i , F ( x i , θ )) D θ N N i =1 i =1 N = 1 ∑ D F ℓ ( t i , F ( x i , θ )) ◦ D θ F ( x i , θ ) N i =1 5 of 13
Backprop N N 1 ℓ ( t i , F ( x i , θ )) = 1 ∑ ∑ D θ ℓ ( t i , F ( x i , θ )) D θ N N i =1 i =1 N = 1 ∑ D F ℓ ( t i , F ( x i , θ )) ◦ D θ F ( x i , θ ) N i =1 Assumption: F is hierarchical: F ( x i , θ ) = f 1 ( f 2 ( f 3 ( . . . x i . . . , θ 3 ) , θ 2 ) , θ 1 ) 5 of 13
Backprop N N 1 ℓ ( t i , F ( x i , θ )) = 1 ∑ ∑ D θ ℓ ( t i , F ( x i , θ )) D θ N N i =1 i =1 N = 1 ∑ D F ℓ ( t i , F ( x i , θ )) ◦ D θ F ( x i , θ ) N i =1 Assumption: F is hierarchical: F ( x i , θ ) = f 1 ( f 2 ( f 3 ( . . . x i . . . , θ 3 ) , θ 2 ) , θ 1 ) D θ 1 F ( x i , θ ) = D θ 1 f 1 ( f 2 , θ 1 ) D θ 2 F ( x i , θ ) = D f 2 f 1 ( f 2 , θ 1 ) ◦ D θ 2 f 2 ( f 3 , θ 2 ) D θ 3 F ( x i , θ ) = D f 2 f 1 ( f 2 , θ 1 ) ◦ D f 3 f 2 ( f 3 , θ 2 ) ◦ D θ 3 f 3 ( . . . , θ 3 ) Where f 2 = f 2 ( f 3 ( . . . x i . . . , θ 3 ) , θ 2 ) etc. 5 of 13
Backprop Jacobians The Loss: ( ∂ F 1 ℓ . . . ∂ F NF ℓ ) ∈ R 1 × N F D F ℓ ( t i , F ( x i , θ )) = 6 of 13
Backprop Jacobians The Loss: ( ∂ F 1 ℓ . . . ∂ F NF ℓ ) ∈ R 1 × N F D F ℓ ( t i , F ( x i , θ )) = The functions (modules): ( ( z 1 . . . z N z ) , θ ) f 1 . . f ( z , θ ) = . ( ) f N f ( z 1 . . . z N z ) , θ ∂ z 1 f 1 . . . ∂ z Nz f 1 . . . . ∈ R N f × N z D z f ( z , θ ) = . . ∂ z 1 f N f . . . ∂ z Nz f N f 6 of 13
Backprop Modules Looking at module f 2 : output input [ ] [ ] [ ] D θ 3 F ( x i , θ ) = D f 2 f 1 ( f 2 , θ 1 ) f 2 ( f 3 , θ 2 ) D θ 3 f 3 ( . . . , θ 3 ) D f 3 � �� � � �� � grad output Jacobian wrt. input � �� � grad input 7 of 13
Backprop Modules Looking at module f 2 : output input [ ] [ ] [ ] D θ 3 F ( x i , θ ) = D f 2 f 1 ( f 2 , θ 1 ) f 2 ( f 3 , θ 2 ) D θ 3 f 3 ( . . . , θ 3 ) D f 3 � �� � � �� � grad output Jacobian wrt. input � �� � grad input Three (core) functions per module: fprop : compute the output f i ( z , θ i ) given the input z and current parametrization θ i . grad input : compute grad output · D z f i ( z , θ i ). grad param : compute ∇ θ i = grad output · D θ i f i ( z , θ i ). 7 of 13
Backprop Modules Looking at module f 2 : output input [ ] [ ] [ ] D θ 3 F ( x i , θ ) = D f 2 f 1 ( f 2 , θ 1 ) f 2 ( f 3 , θ 2 ) D θ 3 f 3 ( . . . , θ 3 ) D f 3 � �� � � �� � grad output Jacobian wrt. input � �� � grad input Three (core) functions per module: fprop : compute the output f i ( z , θ i ) given the input z and current parametrization θ i . grad input : compute grad output · D z f i ( z , θ i ). grad param : compute ∇ θ i = grad output · D θ i f i ( z , θ i ). Typically: fprop caches its input and/or output for later reuse. grad input and grad param are combined into single bprop function to share computation. 7 of 13
Backprop (Mini-)Batching ( ∂ F 1 ℓ . . . ∂ F NF ℓ ) ∑ N 1 ∈ R 1 × N F i =1 D F ℓ ( t i , F ( x i , θ )) ◦ D θ F ( . . . ) where D F ℓ = Remember: N Reformulate as matrix-vector operations allows computation in a single pass: ) . . . ( ( ) t 1 , F ( x 1 , θ ) t 1 , F ( x 1 , θ ) ∂ F 1 ℓ ∂ F NF ℓ ) · ( 1 . . ∈ R N × N F . . 1 . . N . . . ) . . . ∂ F NF ℓ N ( ( ) t N , F ( x N , θ ) t N , F ( x N , θ ) ∂ F 1 ℓ 8 of 13
Backprop Usage/training net = [ f1 , f2 , . . . ] , l = c r i t e r i o n f o r Xb , Tb in batched X, T: z = Xb f o r module in net : z = module . fprop ( z ) costs = l . fprop ( z , Tb) ∂ z = l . bprop ( [ 1 N B . . . 1 N B ] ) f o r module in rev erse d ( net ) : ∂ z = module . bprop ( ∂ z ) f o r module in net : θ , ∂θ = module . params () , module . grads () θ = θ − λ · ∂θ 9 of 13
Backprop Example: Linear aka. Fully-connected module f ( z , W , b ) = z · W + b T Where z ∈ R 1 × N z , W ∈ R N z × N f , and b ∈ R 1 × N f . The gradients are: – R N z , N f ∋ grad W = z T · grad output – R 1 × N f ∋ grad b = grad output T – R 1 × N z ∋ grad input = grad output · W T 10 of 13
Backprop Gradient checking Crucial debugging method! Compare Jacobian computed by finite differences using fprop function to Jacobian computed by bprop function. Advice: Use (small) random input x , and h i = √ eps max( x i , 1). Finite-difference: first column of Jacobian as: ( ) x − = x 1 − h 1 x 2 . . . x N x x + = ( x 1 + h 1 x 2 . . . x N x ) J • , 1 = fprop ( x + ) − fprop ( x − ) 2 h 1 Backprop: first row of Jacobian as: fprop ( x ) J 1 , • = bprop ( 1 0 . . . 0 ) 11 of 13
Backprop Rule-of-thumb results on MNIST Linear(28*28, 10), SoftMax should give ± 750 errors. Linear(28*28, 200), Tanh, Linear(200, 10), SoftMax should give ± 250 errors. Typical learning-rates λ ∈ [0 . 1 , 0 . 01]. Typical batch-sizes N B ∈ [100 , 1000]. √ Initialize weights as R M × N ∋ W ∼ N (0 , σ = 2 M + N ) and b = 0. Don’t forget data pre-processing, here at least divide values by 255. (Max pixel value.) 12 of 13
Merry Christmas and a happy New Year! Also, good luck for the exercise =)
Recommend
More recommend