Measurements of Three-Level Hierarchical Structure in the Outliers - PowerPoint PPT Presentation

Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians Vardan Papyan Department of Statistics Stanford University June 13, 2019

Setting ◮ C-class classification problem

Setting ◮ C-class classification problem ◮ Loss: L ( θ ) = Ave i , c { ℓ ( f ( x i , c ; θ ) , y c ) }

Setting ◮ C-class classification problem ◮ Loss: L ( θ ) = Ave i , c { ℓ ( f ( x i , c ; θ ) , y c ) } ◮ Hessian: � ∂ 2 ℓ ( f ( x i , c ; θ ) , y c ) � Hess ( θ ) = Ave i , c ∂θ 2

Setting ◮ C-class classification problem ◮ Loss: L ( θ ) = Ave i , c { ℓ ( f ( x i , c ; θ ) , y c ) } ◮ Hessian: � ∂ 2 ℓ ( f ( x i , c ; θ ) , y c ) � Hess ( θ ) = Ave i , c ∂θ 2 ◮ Gauss-Newton decomposition: Hess = G + H

Previous work: LeCun et al. (1998)

Previous work: Dauphin et al. (2014)

Previous work: Sagun et al. (2017) ◮ Noticed that the spectrum can be decomposed into:

Previous work: Sagun et al. (2017) ◮ Noticed that the spectrum can be decomposed into: ◮ Bulk+outliers

Previous work: Sagun et al. (2017) ◮ Noticed that the spectrum can be decomposed into: ◮ Bulk+outliers ◮ Number of outliers ≈ number of classes

This work What is causing the outliers in the spectrum?

G is a second moment of gradients with structure on indices ◮ Define the gradient: p c ′ ( x i , c ; θ )( y c ′ − p ( x i , c ; θ )) T ∂ f ( x i , c ; θ ) δ i , c , c ′ T = � ∂θ

G is a second moment of gradients with structure on indices ◮ Define the gradient: p c ′ ( x i , c ; θ )( y c ′ − p ( x i , c ; θ )) T ∂ f ( x i , c ; θ ) δ i , c , c ′ T = � ∂θ ◮ δ i , c , c : gradient of i -th example in c -th class (up to a scalar)

G is a second moment of gradients with structure on indices ◮ Define the gradient: p c ′ ( x i , c ; θ )( y c ′ − p ( x i , c ; θ )) T ∂ f ( x i , c ; θ ) δ i , c , c ′ T = � ∂θ ◮ δ i , c , c : gradient of i -th example in c -th class (up to a scalar) ◮ δ i , c , c ′ : gradient of i -th example in c -th class, if it belonged to class c ′ instead (up to a scalar)

G is a second moment of gradients with structure on indices ◮ Define the gradient: p c ′ ( x i , c ; θ )( y c ′ − p ( x i , c ; θ )) T ∂ f ( x i , c ; θ ) δ i , c , c ′ T = � ∂θ ◮ δ i , c , c : gradient of i -th example in c -th class (up to a scalar) ◮ δ i , c , c ′ : gradient of i -th example in c -th class, if it belonged to class c ′ instead (up to a scalar) ◮ These gradients can be indexed by three numbers:

G is a second moment of gradients with structure on indices ◮ Define the gradient: p c ′ ( x i , c ; θ )( y c ′ − p ( x i , c ; θ )) T ∂ f ( x i , c ; θ ) δ i , c , c ′ T = � ∂θ ◮ δ i , c , c : gradient of i -th example in c -th class (up to a scalar) ◮ δ i , c , c ′ : gradient of i -th example in c -th class, if it belonged to class c ′ instead (up to a scalar) ◮ These gradients can be indexed by three numbers: ◮ i : observation

G is a second moment of gradients with structure on indices ◮ Define the gradient: p c ′ ( x i , c ; θ )( y c ′ − p ( x i , c ; θ )) T ∂ f ( x i , c ; θ ) δ i , c , c ′ T = � ∂θ ◮ δ i , c , c : gradient of i -th example in c -th class (up to a scalar) ◮ δ i , c , c ′ : gradient of i -th example in c -th class, if it belonged to class c ′ instead (up to a scalar) ◮ These gradients can be indexed by three numbers: ◮ i : observation ◮ c : true class

G is a second moment of gradients with structure on indices ◮ Define the gradient: p c ′ ( x i , c ; θ )( y c ′ − p ( x i , c ; θ )) T ∂ f ( x i , c ; θ ) δ i , c , c ′ T = � ∂θ ◮ δ i , c , c : gradient of i -th example in c -th class (up to a scalar) ◮ δ i , c , c ′ : gradient of i -th example in c -th class, if it belonged to class c ′ instead (up to a scalar) ◮ These gradients can be indexed by three numbers: ◮ i : observation ◮ c : true class ◮ c ′ : potential class

G is a second moment of gradients with structure on indices ◮ Define the gradient: p c ′ ( x i , c ; θ )( y c ′ − p ( x i , c ; θ )) T ∂ f ( x i , c ; θ ) δ i , c , c ′ T = � ∂θ ◮ δ i , c , c : gradient of i -th example in c -th class (up to a scalar) ◮ δ i , c , c ′ : gradient of i -th example in c -th class, if it belonged to class c ′ instead (up to a scalar) ◮ These gradients can be indexed by three numbers: ◮ i : observation ◮ c : true class ◮ c ′ : potential class ◮ G is a second moment (not Covariance) of these gradients: � � δ i , c , c ′ δ T G = Ave i , c , c ′ i , c , c ′

Three-level hierarchical structure in gradients ◮ Averaging over the index i ! ","$ ! ',"," & ! %,"," &

Three-level hierarchical structure in gradients ◮ Averaging over the index i ◮ Averaging over the index c ′ ! "," && ! " ! ","$ ! ',"," & ! %,"," &

Three-level hierarchical structure in gradients ◮ Averaging over the index i ◮ Averaging over the index c ′ ◮ Averaging over the index c ! "# ! "," && ! " ! ","# + Av* " ! " ! " ! ',"," & ! %,"," &

Visualization of three-level hierarchical structure in gradients Figure: ResNet50 trained on ImageNet. Large circles: δ c . Small circles: δ c , c ′ .

Visualization of three-level hierarchical structure in gradients MNIST, 13 examples per class Fashion, 13 examples per class CIFAR10, 13 examples per class MNIST, 702 examples per class Fashion, 702 examples per class CIFAR10, 702 examples per class MNIST, 5000 examples per class Fashion, 5000 examples per class CIFAR10, 5000 examples per class

δ c δ T � � Ave c causes outliers in G c Figure: ResNet18 trained on CIFAR10, 1351 examples per class. Orange: � δ c δ T � eigenvalues of Ave c . c

δ c δ T � � Ave c causes outliers in G c MNIST, 136 examples per class Fashion, 136 examples per class CIFAR10, 136 examples per class MNIST, 365 examples per class Fashion, 365 examples per class CIFAR10, 365 examples per class MNIST, 702 examples per class Fashion, 702 examples per class CIFAR10, 702 examples per class MNIST, 2599 examples per class Fashion, 2599 examples per class CIFAR10, 1351 examples per class

Measurements of Three-Level Hierarchical Structure in the Outliers - PowerPoint PPT Presentation

Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians Vardan Papyan Department of Statistics Stanford University June 13, 2019 Setting C-class classification problem Setting C-class

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

Microsticky Microsticky Measurements by Measurements by Measurements by Microsticky

Bayesian hierarchical models in Stata Nikolay Balov StataCorp LP 2016 Stata Conference Nikolay

Unsupervised Learning and Clustering Owen Roberts, Zach Busser, Ganesh Sugunan Hierarchical

Measurements of BB Angular Correlations Measurements of BB Angular Correlations Measurements of

Paper Reviewed (1) Overview Relational Database VS Data Cubes How to derive the Hierarchical

PowerWizard Level 1.0 & Level 2.0 Control Systems Training Systems Comparison Level 2

CS 4204 Computer Graphics Structure Graphics and Structure Graphics and Hierarchical Modeling

Part IV I/O System Chapter 12: Mass Storage Structure Chapter 12: Mass Storage Structure 1

Completed Rehab of Level 1 and Level 3 Completed Bypass Adit and Entry into Level 1

A hierarchical model for micro-level E.A. Valdez stochastic loss reserving joint work with K.

Introduction to HPSG Class 1: Clause Structure, Hierarchical Organization of Knowledge, Lexical

Analysis of variance and regression 2009-3-11 Lene Theil Skovgaard Repeated measurements May

Vibration measurements on the final doublets Vibration measurements on the final doublets and the

Lecture 9 Introduction to Measurements Process Control Prof. Kannan M. Moudgalya IIT Bombay

Course Schedule (by weeks) Introduction/Applications, Introduction to Visual C++ and Windows

Network Layout Ma Maneesh Agrawala CS 448B: Visualization Winter 2020 1 Announcements 3 1

Deformation http://www.cse.iitd.ac.in/ Deformation IIT Delhi http://www.cse.iitd.ac.in/

Ad hoc and Sensor Networks Naming & Addressing Goals of this chapter This short chapter

Whose Fault is This? Untangling Domain Concepts in Ontology Design Patterns Bene

Learning Space-Time Structures for Human Action Recognition and Localization 1 1 2 3 1

CS 333 Introduction to Operating Systems Class 2 OS-Related Hardware & Software The

The Consensus Hierarchy Synchronous Systems In real systems, one can sometimes 1 Read/Write

Sambuz

Useful Links

Newsletter

Mail Us

Measurements of Three-Level Hierarchical Structure in the Outliers - PowerPoint PPT Presentation

Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians Vardan Papyan Department of Statistics Stanford University June 13, 2019 Setting C-class classification problem Setting C-class

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

Microsticky Microsticky Measurements by Measurements by Measurements by Microsticky

Bayesian hierarchical models in Stata Nikolay Balov StataCorp LP 2016 Stata Conference Nikolay

Unsupervised Learning and Clustering Owen Roberts, Zach Busser, Ganesh Sugunan Hierarchical

Measurements of BB Angular Correlations Measurements of BB Angular Correlations Measurements of

Paper Reviewed (1) Overview Relational Database VS Data Cubes How to derive the Hierarchical

PowerWizard Level 1.0 &amp; Level 2.0 Control Systems Training Systems Comparison Level 2

CS 4204 Computer Graphics Structure Graphics and Structure Graphics and Hierarchical Modeling

Part IV I/O System Chapter 12: Mass Storage Structure Chapter 12: Mass Storage Structure 1

Completed Rehab of Level 1 and Level 3 Completed Bypass Adit and Entry into Level 1

A hierarchical model for micro-level E.A. Valdez stochastic loss reserving joint work with K.

Introduction to HPSG Class 1: Clause Structure, Hierarchical Organization of Knowledge, Lexical

Analysis of variance and regression 2009-3-11 Lene Theil Skovgaard Repeated measurements May

Vibration measurements on the final doublets Vibration measurements on the final doublets and the

Lecture 9 Introduction to Measurements Process Control Prof. Kannan M. Moudgalya IIT Bombay

Course Schedule (by weeks) Introduction/Applications, Introduction to Visual C++ and Windows

Network Layout Ma Maneesh Agrawala CS 448B: Visualization Winter 2020 1 Announcements 3 1

Deformation http://www.cse.iitd.ac.in/ Deformation IIT Delhi http://www.cse.iitd.ac.in/

Ad hoc and Sensor Networks Naming &amp; Addressing Goals of this chapter This short chapter

Whose Fault is This? Untangling Domain Concepts in Ontology Design Patterns Bene

Learning Space-Time Structures for Human Action Recognition and Localization 1 1 2 3 1

CS 333 Introduction to Operating Systems Class 2 OS-Related Hardware &amp; Software The

The Consensus Hierarchy Synchronous Systems In real systems, one can sometimes 1 Read/Write

Sambuz

Useful Links

Newsletter

Mail Us

PowerWizard Level 1.0 & Level 2.0 Control Systems Training Systems Comparison Level 2

Ad hoc and Sensor Networks Naming & Addressing Goals of this chapter This short chapter

CS 333 Introduction to Operating Systems Class 2 OS-Related Hardware & Software The