Advanced Machine Learning - Exercise 3 Deep learning essentials

Introduction What’s the plan? Exercise overview Deep learning in a nutshell Backprop in (painful) detail 2 of 13

Introduction Exercise overview Goal: implement a simple DL framework Tasks: – Compute derivatives (Jacobians) – Write code You’ll need some help. . . 3 of 13

Introduction Deep learning in a nutshell Given: – Training data X = { x i } i =1 .. N with x i ∈ I , usually as X ∈ R N × N I – Training labels T = { t i } i =1 .. N with t i ∈ O . 4 of 13

Introduction Deep learning in a nutshell Given: – Training data X = { x i } i =1 .. N with x i ∈ I , usually as X ∈ R N × N I – Training labels T = { t i } i =1 .. N with t i ∈ O . Choose: – Parametrized, (sub-)differentiable function F ( X , θ ) : I × P �→ O , with: typically, input-space I = R N I (generic data), I = R 3 × H × W (images), . . . typically, output-space O = R N O (regression), O = [0 , 1] N O (probabilistic classification), . . . typically, parameter-space P = R N P . 4 of 13

Introduction Deep learning in a nutshell Given: – Training data X = { x i } i =1 .. N with x i ∈ I , usually as X ∈ R N × N I – Training labels T = { t i } i =1 .. N with t i ∈ O . Choose: – Parametrized, (sub-)differentiable function F ( X , θ ) : I × P �→ O , with: typically, input-space I = R N I (generic data), I = R 3 × H × W (images), . . . typically, output-space O = R N O (regression), O = [0 , 1] N O (probabilistic classification), . . . typically, parameter-space P = R N P . – (Sub-)differentiable criterion/loss L ( T , F ( X , θ )) : O × O �→ R 4 of 13

Introduction Deep learning in a nutshell Given: – Training data X = { x i } i =1 .. N with x i ∈ I , usually as X ∈ R N × N I – Training labels T = { t i } i =1 .. N with t i ∈ O . Choose: – Parametrized, (sub-)differentiable function F ( X , θ ) : I × P �→ O , with: typically, input-space I = R N I (generic data), I = R 3 × H × W (images), . . . typically, output-space O = R N O (regression), O = [0 , 1] N O (probabilistic classification), . . . typically, parameter-space P = R N P . – (Sub-)differentiable criterion/loss L ( T , F ( X , θ )) : O × O �→ R Find: θ ∗ = argmin L ( T , F ( X , θ )) θ ∈ P 4 of 13

Introduction Deep learning in a nutshell Given: – Training data X = { x i } i =1 .. N with x i ∈ I , usually as X ∈ R N × N I – Training labels T = { t i } i =1 .. N with t i ∈ O . Choose: – Parametrized, (sub-)differentiable function F ( X , θ ) : I × P �→ O , with: typically, input-space I = R N I (generic data), I = R 3 × H × W (images), . . . typically, output-space O = R N O (regression), O = [0 , 1] N O (probabilistic classification), . . . typically, parameter-space P = R N P . – (Sub-)differentiable criterion/loss L ( T , F ( X , θ )) : O × O �→ R Find: θ ∗ = argmin L ( T , F ( X , θ )) θ ∈ P Assumption: N L ( T , F ( X , θ )) = 1 ∑ ℓ ( t i , F ( x i , θ )) N i =1 4 of 13

Backprop N N 1 ℓ ( t i , F ( x i , θ )) = 1 ∑ ∑ D θ ℓ ( t i , F ( x i , θ )) D θ N N i =1 i =1 N = 1 ∑ D F ℓ ( t i , F ( x i , θ )) ◦ D θ F ( x i , θ ) N i =1 5 of 13

Backprop N N 1 ℓ ( t i , F ( x i , θ )) = 1 ∑ ∑ D θ ℓ ( t i , F ( x i , θ )) D θ N N i =1 i =1 N = 1 ∑ D F ℓ ( t i , F ( x i , θ )) ◦ D θ F ( x i , θ ) N i =1 Assumption: F is hierarchical: F ( x i , θ ) = f 1 ( f 2 ( f 3 ( . . . x i . . . , θ 3 ) , θ 2 ) , θ 1 ) 5 of 13

Backprop N N 1 ℓ ( t i , F ( x i , θ )) = 1 ∑ ∑ D θ ℓ ( t i , F ( x i , θ )) D θ N N i =1 i =1 N = 1 ∑ D F ℓ ( t i , F ( x i , θ )) ◦ D θ F ( x i , θ ) N i =1 Assumption: F is hierarchical: F ( x i , θ ) = f 1 ( f 2 ( f 3 ( . . . x i . . . , θ 3 ) , θ 2 ) , θ 1 ) D θ 1 F ( x i , θ ) = D θ 1 f 1 ( f 2 , θ 1 ) D θ 2 F ( x i , θ ) = D f 2 f 1 ( f 2 , θ 1 ) ◦ D θ 2 f 2 ( f 3 , θ 2 ) D θ 3 F ( x i , θ ) = D f 2 f 1 ( f 2 , θ 1 ) ◦ D f 3 f 2 ( f 3 , θ 2 ) ◦ D θ 3 f 3 ( . . . , θ 3 ) Where f 2 = f 2 ( f 3 ( . . . x i . . . , θ 3 ) , θ 2 ) etc. 5 of 13

Backprop Jacobians The Loss: ( ∂ F 1 ℓ . . . ∂ F NF ℓ ) ∈ R 1 × N F D F ℓ ( t i , F ( x i , θ )) = 6 of 13

Backprop Jacobians The Loss: ( ∂ F 1 ℓ . . . ∂ F NF ℓ ) ∈ R 1 × N F D F ℓ ( t i , F ( x i , θ )) = The functions (modules):  ( ( z 1 . . . z N z ) , θ )  f 1 . . f ( z , θ ) = .   ( ) f N f ( z 1 . . . z N z ) , θ   ∂ z 1 f 1 . . . ∂ z Nz f 1 . . . .  ∈ R N f × N z D z f ( z , θ ) = . .  ∂ z 1 f N f . . . ∂ z Nz f N f 6 of 13

Backprop Modules Looking at module f 2 : output input [ ] [ ] [ ] D θ 3 F ( x i , θ ) = D f 2 f 1 ( f 2 , θ 1 ) f 2 ( f 3 , θ 2 ) D θ 3 f 3 ( . . . , θ 3 ) D f 3 � �� grad output Jacobian wrt. input � �� grad input 7 of 13

Backprop Modules Looking at module f 2 : output input [ ] [ ] [ ] D θ 3 F ( x i , θ ) = D f 2 f 1 ( f 2 , θ 1 ) f 2 ( f 3 , θ 2 ) D θ 3 f 3 ( . . . , θ 3 ) D f 3 � �� grad output Jacobian wrt. input � �� grad input Three (core) functions per module: fprop : compute the output f i ( z , θ i ) given the input z and current parametrization θ i . grad input : compute grad output · D z f i ( z , θ i ). grad param : compute ∇ θ i = grad output · D θ i f i ( z , θ i ). 7 of 13

Backprop Modules Looking at module f 2 : output input [ ] [ ] [ ] D θ 3 F ( x i , θ ) = D f 2 f 1 ( f 2 , θ 1 ) f 2 ( f 3 , θ 2 ) D θ 3 f 3 ( . . . , θ 3 ) D f 3 � �� grad output Jacobian wrt. input � �� grad input Three (core) functions per module: fprop : compute the output f i ( z , θ i ) given the input z and current parametrization θ i . grad input : compute grad output · D z f i ( z , θ i ). grad param : compute ∇ θ i = grad output · D θ i f i ( z , θ i ). Typically: fprop caches its input and/or output for later reuse. grad input and grad param are combined into single bprop function to share computation. 7 of 13

Backprop (Mini-)Batching ( ∂ F 1 ℓ . . . ∂ F NF ℓ ) ∑ N 1 ∈ R 1 × N F i =1 D F ℓ ( t i , F ( x i , θ )) ◦ D θ F ( . . . ) where D F ℓ = Remember: N Reformulate as matrix-vector operations allows computation in a single pass: ) . . . ( ( )   t 1 , F ( x 1 , θ ) t 1 , F ( x 1 , θ ) ∂ F 1 ℓ ∂ F NF ℓ ) · ( 1 . .  ∈ R N × N F . . 1 . . N . . .  ) . . . ∂ F NF ℓ N ( ( ) t N , F ( x N , θ ) t N , F ( x N , θ ) ∂ F 1 ℓ 8 of 13

Backprop Usage/training net = [ f1 , f2 , . . . ] , l = c r i t e r i o n f o r Xb , Tb in batched X, T: z = Xb f o r module in net : z = module . fprop ( z ) costs = l . fprop ( z , Tb) ∂ z = l . bprop ( [ 1 N B . . . 1 N B ] ) f o r module in rev erse d ( net ) : ∂ z = module . bprop ( ∂ z ) f o r module in net : θ , ∂θ = module . params () , module . grads () θ = θ − λ · ∂θ 9 of 13

Backprop Example: Linear aka. Fully-connected module f ( z , W , b ) = z · W + b T Where z ∈ R 1 × N z , W ∈ R N z × N f , and b ∈ R 1 × N f . The gradients are: – R N z , N f ∋ grad W = z T · grad output – R 1 × N f ∋ grad b = grad output T – R 1 × N z ∋ grad input = grad output · W T 10 of 13

Backprop Gradient checking Crucial debugging method! Compare Jacobian computed by finite differences using fprop function to Jacobian computed by bprop function. Advice: Use (small) random input x , and h i = √ eps max( x i , 1). Finite-difference: first column of Jacobian as: ( ) x − = x 1 − h 1 x 2 . . . x N x x + = ( x 1 + h 1 x 2 . . . x N x ) J • , 1 = fprop ( x + ) − fprop ( x − ) 2 h 1 Backprop: first row of Jacobian as: fprop ( x ) J 1 , • = bprop ( 1 0 . . . 0 ) 11 of 13

Backprop Rule-of-thumb results on MNIST Linear(28*28, 10), SoftMax should give ± 750 errors. Linear(28*28, 200), Tanh, Linear(200, 10), SoftMax should give ± 250 errors. Typical learning-rates λ ∈ [0 . 1 , 0 . 01]. Typical batch-sizes N B ∈ [100 , 1000]. √ Initialize weights as R M × N ∋ W ∼ N (0 , σ = 2 M + N ) and b = 0. Don’t forget data pre-processing, here at least divide values by 255. (Max pixel value.) 12 of 13

Merry Christmas and a happy New Year! Also, good luck for the exercise =)

Advanced Machine Learning - Exercise 3 Deep learning essentials - PowerPoint PPT Presentation

Advanced Machine Learning - Exercise 3 Deep learning essentials Introduction Whats the plan? Exercise overview Deep learning in a nutshell Backprop in (painful) detail 2 of 13 Introduction Exercise overview Goal: implement a simple DL

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

An Exercise in An Exercise in Machine Learning Machine Learning

Exercise 7: Two-Steps Method Exercise 7: Two Steps Method FLUKA Advanced Course Exercise 7 -

Exercise and Secondary Exercise and Secondary Exercise and Secondary Exercise and Secondary

Exercise 2: Materials Exercise 2: Materials FLUKA Beginners Course Exercise 2: Materials Aim

Exercise 4: Fight Club Karl Gmeiner 2015 Exercise 4: Fight Club 1 Exercise 4: Fight Club The

Exercise 12: Heavy ions beams Exercise 12: Heavy ions beams Beginners FLUKA Course Exercise

Exercise 1: Basic Input Exercise 1: Basic Input FLUKA Beginners Course Exercise 1: Basic Input

Exercise 8: Scoring Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

ADVANCED MACHINE LEARNING Kernel PCA 11 ADVANCED MACHINE LEARNING Overview Todays Lecture

ADVANCED MACHINE LEARNING Non-linear regression techniques 1 1 ADVANCED MACHINE LEARNING

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

ADMM and Mirror Descent Geoff Gordon & Ryan Tibshirani (I am Aaditya Ramdas and I approve

Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember

PageRank: Ranking of nodes in graphs Gonzalo Mateos Dept. of ECE and Goergen Institute for Data

t t tt r t

Lecture 4.4: Finitely generated abelian groups Matthew Macauley Department of Mathematical

The moment-LP and moment-SOS approaches Jean B. Lasserre LAAS-CNRS and Institute of Mathematics,

Overview Motivation and Introduction Solving CMPs A heuristic Application Implementation

Basics of Numerical Optimization: Iterative Methods Ju Sun Computer Science & Engineering