measurements of three level hierarchical structure in the
play

Measurements of Three-Level Hierarchical Structure in the Outliers - PowerPoint PPT Presentation

Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians Vardan Papyan Department of Statistics Stanford University June 13, 2019 Setting C-class classification problem Setting C-class


  1. Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians Vardan Papyan Department of Statistics Stanford University June 13, 2019

  2. Setting ◮ C-class classification problem

  3. Setting ◮ C-class classification problem ◮ Loss: L ( θ ) = Ave i , c { ℓ ( f ( x i , c ; θ ) , y c ) }

  4. Setting ◮ C-class classification problem ◮ Loss: L ( θ ) = Ave i , c { ℓ ( f ( x i , c ; θ ) , y c ) } ◮ Hessian: � ∂ 2 ℓ ( f ( x i , c ; θ ) , y c ) � Hess ( θ ) = Ave i , c ∂θ 2

  5. Setting ◮ C-class classification problem ◮ Loss: L ( θ ) = Ave i , c { ℓ ( f ( x i , c ; θ ) , y c ) } ◮ Hessian: � ∂ 2 ℓ ( f ( x i , c ; θ ) , y c ) � Hess ( θ ) = Ave i , c ∂θ 2 ◮ Gauss-Newton decomposition: Hess = G + H

  6. Previous work: LeCun et al. (1998)

  7. Previous work: Dauphin et al. (2014)

  8. Previous work: Sagun et al. (2017) ◮ Noticed that the spectrum can be decomposed into:

  9. Previous work: Sagun et al. (2017) ◮ Noticed that the spectrum can be decomposed into: ◮ Bulk+outliers

  10. Previous work: Sagun et al. (2017) ◮ Noticed that the spectrum can be decomposed into: ◮ Bulk+outliers ◮ Number of outliers ≈ number of classes

  11. This work What is causing the outliers in the spectrum?

  12. G is a second moment of gradients with structure on indices ◮ Define the gradient: p c ′ ( x i , c ; θ )( y c ′ − p ( x i , c ; θ )) T ∂ f ( x i , c ; θ ) δ i , c , c ′ T = � ∂θ

  13. G is a second moment of gradients with structure on indices ◮ Define the gradient: p c ′ ( x i , c ; θ )( y c ′ − p ( x i , c ; θ )) T ∂ f ( x i , c ; θ ) δ i , c , c ′ T = � ∂θ ◮ δ i , c , c : gradient of i -th example in c -th class (up to a scalar)

  14. G is a second moment of gradients with structure on indices ◮ Define the gradient: p c ′ ( x i , c ; θ )( y c ′ − p ( x i , c ; θ )) T ∂ f ( x i , c ; θ ) δ i , c , c ′ T = � ∂θ ◮ δ i , c , c : gradient of i -th example in c -th class (up to a scalar) ◮ δ i , c , c ′ : gradient of i -th example in c -th class, if it belonged to class c ′ instead (up to a scalar)

  15. G is a second moment of gradients with structure on indices ◮ Define the gradient: p c ′ ( x i , c ; θ )( y c ′ − p ( x i , c ; θ )) T ∂ f ( x i , c ; θ ) δ i , c , c ′ T = � ∂θ ◮ δ i , c , c : gradient of i -th example in c -th class (up to a scalar) ◮ δ i , c , c ′ : gradient of i -th example in c -th class, if it belonged to class c ′ instead (up to a scalar) ◮ These gradients can be indexed by three numbers:

  16. G is a second moment of gradients with structure on indices ◮ Define the gradient: p c ′ ( x i , c ; θ )( y c ′ − p ( x i , c ; θ )) T ∂ f ( x i , c ; θ ) δ i , c , c ′ T = � ∂θ ◮ δ i , c , c : gradient of i -th example in c -th class (up to a scalar) ◮ δ i , c , c ′ : gradient of i -th example in c -th class, if it belonged to class c ′ instead (up to a scalar) ◮ These gradients can be indexed by three numbers: ◮ i : observation

  17. G is a second moment of gradients with structure on indices ◮ Define the gradient: p c ′ ( x i , c ; θ )( y c ′ − p ( x i , c ; θ )) T ∂ f ( x i , c ; θ ) δ i , c , c ′ T = � ∂θ ◮ δ i , c , c : gradient of i -th example in c -th class (up to a scalar) ◮ δ i , c , c ′ : gradient of i -th example in c -th class, if it belonged to class c ′ instead (up to a scalar) ◮ These gradients can be indexed by three numbers: ◮ i : observation ◮ c : true class

  18. G is a second moment of gradients with structure on indices ◮ Define the gradient: p c ′ ( x i , c ; θ )( y c ′ − p ( x i , c ; θ )) T ∂ f ( x i , c ; θ ) δ i , c , c ′ T = � ∂θ ◮ δ i , c , c : gradient of i -th example in c -th class (up to a scalar) ◮ δ i , c , c ′ : gradient of i -th example in c -th class, if it belonged to class c ′ instead (up to a scalar) ◮ These gradients can be indexed by three numbers: ◮ i : observation ◮ c : true class ◮ c ′ : potential class

  19. G is a second moment of gradients with structure on indices ◮ Define the gradient: p c ′ ( x i , c ; θ )( y c ′ − p ( x i , c ; θ )) T ∂ f ( x i , c ; θ ) δ i , c , c ′ T = � ∂θ ◮ δ i , c , c : gradient of i -th example in c -th class (up to a scalar) ◮ δ i , c , c ′ : gradient of i -th example in c -th class, if it belonged to class c ′ instead (up to a scalar) ◮ These gradients can be indexed by three numbers: ◮ i : observation ◮ c : true class ◮ c ′ : potential class ◮ G is a second moment (not Covariance) of these gradients: � � δ i , c , c ′ δ T G = Ave i , c , c ′ i , c , c ′

  20. Three-level hierarchical structure in gradients ◮ Averaging over the index i ! ","$ ! ',"," & ! %,"," &

  21. Three-level hierarchical structure in gradients ◮ Averaging over the index i ◮ Averaging over the index c ′ ! "," && ! " ! ","$ ! ',"," & ! %,"," &

  22. Three-level hierarchical structure in gradients ◮ Averaging over the index i ◮ Averaging over the index c ′ ◮ Averaging over the index c ! "# ! "," && ! " ! ","# + Av* " ! " ! " ! ',"," & ! %,"," &

  23. Visualization of three-level hierarchical structure in gradients Figure: ResNet50 trained on ImageNet. Large circles: δ c . Small circles: δ c , c ′ .

  24. Visualization of three-level hierarchical structure in gradients MNIST, 13 examples per class Fashion, 13 examples per class CIFAR10, 13 examples per class MNIST, 702 examples per class Fashion, 702 examples per class CIFAR10, 702 examples per class MNIST, 5000 examples per class Fashion, 5000 examples per class CIFAR10, 5000 examples per class

  25. δ c δ T � � Ave c causes outliers in G c Figure: ResNet18 trained on CIFAR10, 1351 examples per class. Orange: � δ c δ T � eigenvalues of Ave c . c

  26. δ c δ T � � Ave c causes outliers in G c MNIST, 136 examples per class Fashion, 136 examples per class CIFAR10, 136 examples per class MNIST, 365 examples per class Fashion, 365 examples per class CIFAR10, 365 examples per class MNIST, 702 examples per class Fashion, 702 examples per class CIFAR10, 702 examples per class MNIST, 2599 examples per class Fashion, 2599 examples per class CIFAR10, 1351 examples per class

Recommend


More recommend