Newton Methods for Neural Networks: Gauss Newton Matrix-vector - PowerPoint PPT Presentation

Newton Methods for Neural Networks: Gauss Newton Matrix-vector Product Chih-Jen Lin National Taiwan University Last updated: June 1, 2020 Chih-Jen Lin (National Taiwan Univ.) 1 / 81

Outline Backward setting 1 Jacobian evaluation Gauss-Newton Matrix-vector products Forward + backward settings 2 R operator Gauss-Newton matrix-vector product Chih-Jen Lin (National Taiwan Univ.) 2 / 81

Backward setting Outline Backward setting 1 Jacobian evaluation Gauss-Newton Matrix-vector products Forward + backward settings 2 R operator Gauss-Newton matrix-vector product Chih-Jen Lin (National Taiwan Univ.) 3 / 81

Backward setting Jacobian evaluation Outline Backward setting 1 Jacobian evaluation Gauss-Newton Matrix-vector products Forward + backward settings 2 R operator Gauss-Newton matrix-vector product Chih-Jen Lin (National Taiwan Univ.) 4 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer I For an instance i the Jacobian can be partitioned into L blocks according to layers J i = � J 1 , i J 2 , i . . . J L , i � , m = 1 , . . . , L , (1) where � � ∂ ③ L +1 , i ∂ ③ L +1 , i J m , i = . ∂ ( ❜ m ) T ∂ vec( W m ) T The calculation seems to be very similar to that for the gradient. Chih-Jen Lin (National Taiwan Univ.) 5 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer II For the convolutional layers, recall for gradient we have l ∂ W m = 1 ∂ f C W m + 1 ∂ξ i � ∂ W m l i =1 and � ∂ξ i � T ∂ξ i ∂ S m , i φ (pad( Z m , i )) T ∂ vec( W m ) T = vec Chih-Jen Lin (National Taiwan Univ.) 6 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer III Now we have   ∂ z L +1 , i 1 ∂ vec( W m ) T ∂ ③ L +1 , i .   . ∂ vec( W m ) T = .     ∂ z L +1 , i nL +1 ∂ vec( W m ) T   vec( ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T 1 .   . = .     ∂ z L +1 , i nL +1 ∂ S m , i φ (pad( Z m , i )) T ) T vec( Chih-Jen Lin (National Taiwan Univ.) 7 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer IV If ❜ m is considered, the result is � � ∂ ③ L +1 , i ∂ ③ L +1 , i ∂ ( ❜ m ) T ∂ vec( W m ) T   � �� T ∂ z L +1 , i � φ (pad( Z m , i )) T 1 a m vec 1 conv b m ∂ S m , i conv   . .   . = .     � �� T ∂ z L +1 , i  �  φ (pad( Z m , i )) T 1 a m nL +1 vec conv b m ∂ S m , i conv Chih-Jen Lin (National Taiwan Univ.) 8 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer V We can see that it’s more complicated than gradient. Gradient is a vector but Jacobian is a matrix Chih-Jen Lin (National Taiwan Univ.) 9 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process I For gradient, earlier we need a backward process to calculate ∂ξ i ∂ S m , i Now what we need are ∂ z L +1 , i ∂ S m , i , . . . , ∂ z L +1 , i n L +1 1 ∂ S m , i The process is similar Chih-Jen Lin (National Taiwan Univ.) 10 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process II If with RELU activation function and max pooling, for gradient we had ∂ξ i ∂ vec( S m , i ) T � � ∂ξ i P m , i ∂ vec( Z m +1 , i ) T ⊙ vec( I [ Z m +1 , i ]) T = pool . Chih-Jen Lin (National Taiwan Univ.) 11 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process III Assume that ∂ ③ L +1 , i ∂ vec( Z m +1 , i ) are available. ∂ z L +1 , i j ∂ vec( S m , i ) T � � ∂ z L +1 , i j P m , i ∂ vec( Z m +1 , i ) T ⊙ vec( I [ Z m +1 , i ]) T = pool , j = 1 , . . . , n L +1 . Chih-Jen Lin (National Taiwan Univ.) 12 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process IV These row vectors can be written together as a matrix ∂ ③ L +1 , i ∂ vec( S m , i ) T � � ∂ ③ L +1 , i � 1 n L +1 vec( I [ Z m +1 , i ]) T � P m , i = ∂ vec( Z m +1 , i ) T ⊙ pool . Chih-Jen Lin (National Taiwan Univ.) 13 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process V For gradient, we use ∂ξ i ∂ S m , i to have � � T ∂ξ i ( W m ) T ∂ξ i P m φ P m ∂ vec( Z m , i ) T = vec pad ∂ S m , i and pass it to the previous layer Chih-Jen Lin (National Taiwan Univ.) 14 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process VI Now we need to generate ∂ ③ L +1 , i ∂ vec( Z m , i ) T and pass it to the previous layer. Now we have   � � T ( W m ) T ∂ z L +1 , i P m φ P m vec 1 pad ∂ S m , i   ∂ ③ L +1 , i .  .  . ∂ vec( Z m , i ) T = .     � � T ( W m ) T ∂ z L +1 , i   nL +1 P m φ P m vec pad ∂ S m , i Chih-Jen Lin (National Taiwan Univ.) 15 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Fully-connected Layer I We do not discuss details, but list all results below ∂ ③ L +1 , i ∂ vec( W m ) T = � � � � �� T ∂ z L +1 , i ∂ z L +1 , i n L +1 ∂ s m , i ( ③ m , i ) T 1 ∂ s m , i ( ③ m , i ) T vec . . . vec Chih-Jen Lin (National Taiwan Univ.) 16 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Fully-connected Layer II ∂ ( ❜ m ) T = ∂ ③ L +1 , i ∂ ③ L +1 , i ∂ ( s m , i ) T , ∂ ③ L +1 , i ∂ ③ L +1 , i � 1 n L +1 I [ ③ m +1 , i ] T � ∂ ( s m , i ) T = ∂ ( ③ m +1 , i ) T ⊙ ∂ ( ③ m , i ) T = ∂ ③ L +1 , i ∂ ③ L +1 , i ∂ ( s m , i ) T W m Chih-Jen Lin (National Taiwan Univ.) 17 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Fully-connected Layer III For layer L + 1, if using the squared loss and the linear activation function, we have ∂ ③ L +1 , i ∂ ( s L , i ) T = I n L +1 . Chih-Jen Lin (National Taiwan Univ.) 18 / 81

Backward setting Jacobian evaluation Gradient versus Jacobian I Operations for gradient ∂ξ i ∂ vec( S m , i ) T � � ∂ξ i P m , i ∂ vec( Z m +1 , i ) T ⊙ vec( I [ Z m +1 , i ]) T = pool . ∂ξ i ∂ξ i ∂ S m , i φ (pad( Z m , i )) T ∂ W m = � � T ∂ξ i ( W m ) T ∂ξ i P m φ P m ∂ vec( Z m , i ) T = vec pad , ∂ S m , i Chih-Jen Lin (National Taiwan Univ.) 19 / 81

Backward setting Jacobian evaluation Gradient versus Jacobian II For Jacobian we have ∂ ③ L +1 , i ∂ vec( S m , i ) T � � ∂ ③ L +1 , i � 1 n L +1 vec( I [ Z m +1 , i ]) T � P m , i = ∂ vec( Z m +1 , i ) T ⊙ pool .   vec( ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T 1 ∂ ③ L +1 , i .   . ∂ vec( W m ) T = .     ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T nL +1 vec( Chih-Jen Lin (National Taiwan Univ.) 20 / 81

Backward setting Jacobian evaluation Gradient versus Jacobian III ∂ ③ L +1 , i ∂ vec( Z m , i ) T   � � T ( W m ) T ∂ z L +1 , i P m φ P m vec 1 pad ∂ S m , i   .  .  . = .     � � T ( W m ) T ∂ z L +1 , i   nL +1 P m φ P m vec pad ∂ S m , i Chih-Jen Lin (National Taiwan Univ.) 21 / 81

Backward setting Jacobian evaluation Implementation I For gradient we did ∆ ← mat(vec(∆) T P m , i pool ) ∂ξ i ∂ W m = ∆ · φ (pad( Z m , i )) T � T P m � ( W m ) T ∆ φ P m ∆ ← vec pad ∆ ← ∆ ⊙ I [ Z m , i ] Now for Jacobian we have similar settings but there are some differences Chih-Jen Lin (National Taiwan Univ.) 22 / 81

Backward setting Jacobian evaluation Implementation II We don’t really store the Jacobian:   vec( ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T 1 ∂ ③ L +1 , i .   . ∂ vec( W m ) T = .     ∂ z L +1 , i nL +1 ∂ S m , i φ (pad( Z m , i )) T ) T vec( Recall Jacobian is used for matrix-vector products G S ✈ = 1 C ✈ + 1 � � ( J i ) T � �� B i ( J i ✈ ) (2) | S | i ∈ S Chih-Jen Lin (National Taiwan Univ.) 23 / 81

Backward setting Jacobian evaluation Implementation III The form   vec( ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T 1 ∂ ③ L +1 , i .   . ∂ vec( W m ) T = .     ∂ z L +1 , i nL +1 ∂ S m , i φ (pad( Z m , i )) T ) T vec( is like the product of two things Chih-Jen Lin (National Taiwan Univ.) 24 / 81

Backward setting Jacobian evaluation Implementation IV If we have ∂ z L +1 , i ∂ S m , i , . . . , ∂ z L +1 , i n L +1 1 ∂ S m , i , and φ (pad( Z m , i )) probably we can do the matrix-vector product without multiplying these two things out We will talk about this again later Thus our Jacobian evaluation is solely on obtaining ∂ z L +1 , i ∂ S m , i , . . . , ∂ z L +1 , i n L +1 1 ∂ S m , i Chih-Jen Lin (National Taiwan Univ.) 25 / 81

Backward setting Jacobian evaluation Implementation V Further we need to take all data (or data in the selected subset) into account In the end what we have is the following procedure In the beginning ∆ ∈ R d m +1 a m +1 b m +1 × n L +1 × l This corresponds to ∂ ③ L +1 , i � 1 n L +1 vec( I [ Z m +1 , i ]) T � ∂ vec( Z m +1 , i ) T ⊙ , ∀ i = 1 , . . . , l Chih-Jen Lin (National Taiwan Univ.) 26 / 81

Newton Methods for Neural Networks: Gauss Newton Matrix-vector - PowerPoint PPT Presentation

Newton Methods for Neural Networks: Gauss Newton Matrix-vector Product Chih-Jen Lin National Taiwan University Last updated: June 1, 2020 Chih-Jen Lin (National Taiwan Univ.) 1 / 81 Outline Backward setting 1 Jacobian evaluation

Newton Methods for Neural Networks: Part 1 Chih-Jen Lin National Taiwan University Last

An Efficient Gauss-Newton Algorithm for Symmetric Low-Rank Product Matrix Approximations Xin Liu

Nonlinear Least-Squares Problems with the Gauss-Newton and Levenberg-Marquardt Methods Alfonso

Quasi-Newton methods for minimization Lectures for PHD course on Non-linear equations and

Optimization Unconstrained optimization Constrained optimization Newton with equality

Quasi-Newton methods for minimization Lectures for PHD course on Numerical optimization Enrico

Fast Newton-type Methods for Nonnegative Matrix and Tensor Approximation Inderjit S. Dhillon

NEWTON SEPAC End of Year Report to Newton School Committee June 10, 2019 Newton SEPAC Co-Chairs

Faces Introduction/Problem Statement Tell me this is Newton Dont tell me this is Newton

Sub-sampled Newton Methods with Non-uniform Sampling Jiyan Yang ICME, Stanford University

Worldwide Newton Conference Paris, September 2004 eBook composition on the Newton MessagePad 2100

On the Newton method for the matrix th root Bruno Iannazzo Dipartimento di Matematica

SIR ISAAC NEWTON (1642-1727) Born in the small village of Woolsthorpe, Newton quickly made an

Newton never dies It only gets new hardware Paul Guyot Worldwide Newton Conference 2004

The KuratowskiRyll-Nardzewski Theorem and semismooth Newton methods for

NEWTON EARLY CHILDHOOD PROGRAM STAFF PRESENTATION NEWTON, MA 14 JANUARY 2020 SCHEDULE OVERVIEW

1 quasi-newton in one variable: the secant method In a one dimensional problem, approximating the

Applied konstruktion af apparater Indhold Biomechanics Biomechanics Newton mekanik Newton

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Lecture 5 Galileo & Kepler to Newton Today Universal Laws of Classical Mechanics Newton

Convex Optimization ( EE227A: UC Berkeley ) Lecture 25 (Newton, quasi-Newton) 23 Apr, 2013

Deep Learning: From Theory to Algorithm Outline: 1. Overview of

Computational Optimization Quasi Newton Methods 2/22 NW Chapter 8 Theorem 3.4 Suppose f is

EE3CL4 C01: Trans. Newton. Mech. Rot. Newton. Mech. Introduction to Linear Control Systems

Newton Methods for Neural Networks: Gauss Newton Matrix-vector - PowerPoint PPT Presentation

Newton Methods for Neural Networks: Gauss Newton Matrix-vector Product Chih-Jen Lin National Taiwan University Last updated: June 1, 2020 Chih-Jen Lin (National Taiwan Univ.) 1 / 81 Outline Backward setting 1 Jacobian evaluation

Newton Methods for Neural Networks: Part 1 Chih-Jen Lin National Taiwan University Last

An Efficient Gauss-Newton Algorithm for Symmetric Low-Rank Product Matrix Approximations Xin Liu

Nonlinear Least-Squares Problems with the Gauss-Newton and Levenberg-Marquardt Methods Alfonso

Quasi-Newton methods for minimization Lectures for PHD course on Non-linear equations and

Optimization Unconstrained optimization Constrained optimization Newton with equality

Quasi-Newton methods for minimization Lectures for PHD course on Numerical optimization Enrico

Fast Newton-type Methods for Nonnegative Matrix and Tensor Approximation Inderjit S. Dhillon

NEWTON SEPAC End of Year Report to Newton School Committee June 10, 2019 Newton SEPAC Co-Chairs

Faces Introduction/Problem Statement Tell me this is Newton Dont tell me this is Newton

Sub-sampled Newton Methods with Non-uniform Sampling Jiyan Yang ICME, Stanford University

Worldwide Newton Conference Paris, September 2004 eBook composition on the Newton MessagePad 2100

On the Newton method for the matrix th root Bruno Iannazzo Dipartimento di Matematica

SIR ISAAC NEWTON (1642-1727) Born in the small village of Woolsthorpe, Newton quickly made an

Newton never dies It only gets new hardware Paul Guyot Worldwide Newton Conference 2004

The KuratowskiRyll-Nardzewski Theorem and semismooth Newton methods for

NEWTON EARLY CHILDHOOD PROGRAM STAFF PRESENTATION NEWTON, MA 14 JANUARY 2020 SCHEDULE OVERVIEW

1 quasi-newton in one variable: the secant method In a one dimensional problem, approximating the

Applied konstruktion af apparater Indhold Biomechanics Biomechanics Newton mekanik Newton

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Lecture 5 Galileo &amp; Kepler to Newton Today Universal Laws of Classical Mechanics Newton

Convex Optimization ( EE227A: UC Berkeley ) Lecture 25 (Newton, quasi-Newton) 23 Apr, 2013

Deep Learning: From Theory to Algorithm Outline: 1. Overview of

Computational Optimization Quasi Newton Methods 2/22 NW Chapter 8 Theorem 3.4 Suppose f is

EE3CL4 C01: Trans. Newton. Mech. Rot. Newton. Mech. Introduction to Linear Control Systems

Lecture 5 Galileo & Kepler to Newton Today Universal Laws of Classical Mechanics Newton