Newton Methods for Neural Networks: Gauss Newton Matrix-vector Product Chih-Jen Lin National Taiwan University Last updated: June 1, 2020 Chih-Jen Lin (National Taiwan Univ.) 1 / 81
Outline Backward setting 1 Jacobian evaluation Gauss-Newton Matrix-vector products Forward + backward settings 2 R operator Gauss-Newton matrix-vector product Chih-Jen Lin (National Taiwan Univ.) 2 / 81
Backward setting Outline Backward setting 1 Jacobian evaluation Gauss-Newton Matrix-vector products Forward + backward settings 2 R operator Gauss-Newton matrix-vector product Chih-Jen Lin (National Taiwan Univ.) 3 / 81
Backward setting Jacobian evaluation Outline Backward setting 1 Jacobian evaluation Gauss-Newton Matrix-vector products Forward + backward settings 2 R operator Gauss-Newton matrix-vector product Chih-Jen Lin (National Taiwan Univ.) 4 / 81
Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer I For an instance i the Jacobian can be partitioned into L blocks according to layers J i = � J 1 , i J 2 , i . . . J L , i � , m = 1 , . . . , L , (1) where � � ∂ ③ L +1 , i ∂ ③ L +1 , i J m , i = . ∂ ( ❜ m ) T ∂ vec( W m ) T The calculation seems to be very similar to that for the gradient. Chih-Jen Lin (National Taiwan Univ.) 5 / 81
Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer II For the convolutional layers, recall for gradient we have l ∂ W m = 1 ∂ f C W m + 1 ∂ξ i � ∂ W m l i =1 and � ∂ξ i � T ∂ξ i ∂ S m , i φ (pad( Z m , i )) T ∂ vec( W m ) T = vec Chih-Jen Lin (National Taiwan Univ.) 6 / 81
Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer III Now we have ∂ z L +1 , i 1 ∂ vec( W m ) T ∂ ③ L +1 , i . . ∂ vec( W m ) T = . ∂ z L +1 , i nL +1 ∂ vec( W m ) T vec( ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T 1 . . = . ∂ z L +1 , i nL +1 ∂ S m , i φ (pad( Z m , i )) T ) T vec( Chih-Jen Lin (National Taiwan Univ.) 7 / 81
Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer IV If ❜ m is considered, the result is � � ∂ ③ L +1 , i ∂ ③ L +1 , i ∂ ( ❜ m ) T ∂ vec( W m ) T � �� T ∂ z L +1 , i � φ (pad( Z m , i )) T 1 a m vec 1 conv b m ∂ S m , i conv . . . = . � �� T ∂ z L +1 , i � φ (pad( Z m , i )) T 1 a m nL +1 vec conv b m ∂ S m , i conv Chih-Jen Lin (National Taiwan Univ.) 8 / 81
Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer V We can see that it’s more complicated than gradient. Gradient is a vector but Jacobian is a matrix Chih-Jen Lin (National Taiwan Univ.) 9 / 81
Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process I For gradient, earlier we need a backward process to calculate ∂ξ i ∂ S m , i Now what we need are ∂ z L +1 , i ∂ S m , i , . . . , ∂ z L +1 , i n L +1 1 ∂ S m , i The process is similar Chih-Jen Lin (National Taiwan Univ.) 10 / 81
Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process II If with RELU activation function and max pooling, for gradient we had ∂ξ i ∂ vec( S m , i ) T � � ∂ξ i P m , i ∂ vec( Z m +1 , i ) T ⊙ vec( I [ Z m +1 , i ]) T = pool . Chih-Jen Lin (National Taiwan Univ.) 11 / 81
Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process III Assume that ∂ ③ L +1 , i ∂ vec( Z m +1 , i ) are available. ∂ z L +1 , i j ∂ vec( S m , i ) T � � ∂ z L +1 , i j P m , i ∂ vec( Z m +1 , i ) T ⊙ vec( I [ Z m +1 , i ]) T = pool , j = 1 , . . . , n L +1 . Chih-Jen Lin (National Taiwan Univ.) 12 / 81
Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process IV These row vectors can be written together as a matrix ∂ ③ L +1 , i ∂ vec( S m , i ) T � � ∂ ③ L +1 , i � 1 n L +1 vec( I [ Z m +1 , i ]) T � P m , i = ∂ vec( Z m +1 , i ) T ⊙ pool . Chih-Jen Lin (National Taiwan Univ.) 13 / 81
Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process V For gradient, we use ∂ξ i ∂ S m , i to have � � T ∂ξ i ( W m ) T ∂ξ i P m φ P m ∂ vec( Z m , i ) T = vec pad ∂ S m , i and pass it to the previous layer Chih-Jen Lin (National Taiwan Univ.) 14 / 81
Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process VI Now we need to generate ∂ ③ L +1 , i ∂ vec( Z m , i ) T and pass it to the previous layer. Now we have � � T ( W m ) T ∂ z L +1 , i P m φ P m vec 1 pad ∂ S m , i ∂ ③ L +1 , i . . . ∂ vec( Z m , i ) T = . � � T ( W m ) T ∂ z L +1 , i nL +1 P m φ P m vec pad ∂ S m , i Chih-Jen Lin (National Taiwan Univ.) 15 / 81
Backward setting Jacobian evaluation Jacobian Evaluation: Fully-connected Layer I We do not discuss details, but list all results below ∂ ③ L +1 , i ∂ vec( W m ) T = � � � � �� T ∂ z L +1 , i ∂ z L +1 , i n L +1 ∂ s m , i ( ③ m , i ) T 1 ∂ s m , i ( ③ m , i ) T vec . . . vec Chih-Jen Lin (National Taiwan Univ.) 16 / 81
Backward setting Jacobian evaluation Jacobian Evaluation: Fully-connected Layer II ∂ ( ❜ m ) T = ∂ ③ L +1 , i ∂ ③ L +1 , i ∂ ( s m , i ) T , ∂ ③ L +1 , i ∂ ③ L +1 , i � 1 n L +1 I [ ③ m +1 , i ] T � ∂ ( s m , i ) T = ∂ ( ③ m +1 , i ) T ⊙ ∂ ( ③ m , i ) T = ∂ ③ L +1 , i ∂ ③ L +1 , i ∂ ( s m , i ) T W m Chih-Jen Lin (National Taiwan Univ.) 17 / 81
Backward setting Jacobian evaluation Jacobian Evaluation: Fully-connected Layer III For layer L + 1, if using the squared loss and the linear activation function, we have ∂ ③ L +1 , i ∂ ( s L , i ) T = I n L +1 . Chih-Jen Lin (National Taiwan Univ.) 18 / 81
Backward setting Jacobian evaluation Gradient versus Jacobian I Operations for gradient ∂ξ i ∂ vec( S m , i ) T � � ∂ξ i P m , i ∂ vec( Z m +1 , i ) T ⊙ vec( I [ Z m +1 , i ]) T = pool . ∂ξ i ∂ξ i ∂ S m , i φ (pad( Z m , i )) T ∂ W m = � � T ∂ξ i ( W m ) T ∂ξ i P m φ P m ∂ vec( Z m , i ) T = vec pad , ∂ S m , i Chih-Jen Lin (National Taiwan Univ.) 19 / 81
Backward setting Jacobian evaluation Gradient versus Jacobian II For Jacobian we have ∂ ③ L +1 , i ∂ vec( S m , i ) T � � ∂ ③ L +1 , i � 1 n L +1 vec( I [ Z m +1 , i ]) T � P m , i = ∂ vec( Z m +1 , i ) T ⊙ pool . vec( ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T 1 ∂ ③ L +1 , i . . ∂ vec( W m ) T = . ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T nL +1 vec( Chih-Jen Lin (National Taiwan Univ.) 20 / 81
Backward setting Jacobian evaluation Gradient versus Jacobian III ∂ ③ L +1 , i ∂ vec( Z m , i ) T � � T ( W m ) T ∂ z L +1 , i P m φ P m vec 1 pad ∂ S m , i . . . = . � � T ( W m ) T ∂ z L +1 , i nL +1 P m φ P m vec pad ∂ S m , i Chih-Jen Lin (National Taiwan Univ.) 21 / 81
Backward setting Jacobian evaluation Implementation I For gradient we did ∆ ← mat(vec(∆) T P m , i pool ) ∂ξ i ∂ W m = ∆ · φ (pad( Z m , i )) T � T P m � ( W m ) T ∆ φ P m ∆ ← vec pad ∆ ← ∆ ⊙ I [ Z m , i ] Now for Jacobian we have similar settings but there are some differences Chih-Jen Lin (National Taiwan Univ.) 22 / 81
Backward setting Jacobian evaluation Implementation II We don’t really store the Jacobian: vec( ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T 1 ∂ ③ L +1 , i . . ∂ vec( W m ) T = . ∂ z L +1 , i nL +1 ∂ S m , i φ (pad( Z m , i )) T ) T vec( Recall Jacobian is used for matrix-vector products G S ✈ = 1 C ✈ + 1 � � ( J i ) T � �� B i ( J i ✈ ) (2) | S | i ∈ S Chih-Jen Lin (National Taiwan Univ.) 23 / 81
Backward setting Jacobian evaluation Implementation III The form vec( ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T 1 ∂ ③ L +1 , i . . ∂ vec( W m ) T = . ∂ z L +1 , i nL +1 ∂ S m , i φ (pad( Z m , i )) T ) T vec( is like the product of two things Chih-Jen Lin (National Taiwan Univ.) 24 / 81
Backward setting Jacobian evaluation Implementation IV If we have ∂ z L +1 , i ∂ S m , i , . . . , ∂ z L +1 , i n L +1 1 ∂ S m , i , and φ (pad( Z m , i )) probably we can do the matrix-vector product without multiplying these two things out We will talk about this again later Thus our Jacobian evaluation is solely on obtaining ∂ z L +1 , i ∂ S m , i , . . . , ∂ z L +1 , i n L +1 1 ∂ S m , i Chih-Jen Lin (National Taiwan Univ.) 25 / 81
Backward setting Jacobian evaluation Implementation V Further we need to take all data (or data in the selected subset) into account In the end what we have is the following procedure In the beginning ∆ ∈ R d m +1 a m +1 b m +1 × n L +1 × l This corresponds to ∂ ③ L +1 , i � 1 n L +1 vec( I [ Z m +1 , i ]) T � ∂ vec( Z m +1 , i ) T ⊙ , ∀ i = 1 , . . . , l Chih-Jen Lin (National Taiwan Univ.) 26 / 81
Recommend
More recommend