Data-Dependent Sample Complexities for Deep Neural Networks Tengyu - PowerPoint PPT Presentation

Data-Dependent Sample Complexities for Deep Neural Networks Tengyu Ma Colin Wei Stanford University

How do we design principled regularizers for deep models?

How do we design principled regularizers for deep models? • Many regularizers are designed ad-hoc

How do we design principled regularizers for deep models? • Many regularizers are designed ad-hoc • A principled approach: • Theoretically prove upper bounds on generalization error

How do we design principled regularizers for deep models? • Many regularizers are designed ad-hoc • A principled approach: • Theoretically prove upper bounds on generalization error • Empirically regularize the upper bounds

How do we design principled regularizers for deep models? • Many regularizers are designed ad-hoc • A principled approach: • Theoretically prove upper bounds on generalization error • Empirically regularize the upper bounds • Bottleneck in prior work: • Mostly considers norm of weights [Bartlett et. al’17, Neyshabur et. al’17, Nagarajan and Kolter’19]

How do we design principled regularizers for deep models? • Many regularizers are designed ad-hoc • A principled approach: • Theoretically prove upper bounds on generalization error • Empirically regularize the upper bounds • Bottleneck in prior work: • Mostly considers norm of weights • ⇒ Loose/pessimistic bounds (e.g., exponential in depth) [Bartlett et. al’17, Neyshabur et. al’17, Nagarajan and Kolter’19]

Data-Dependent Generalization Bounds

Data-Dependent Generalization Bounds generalization ≤ 𝑕( weights, training data) Add 𝑕(⋅) to the loss as an explicit regularizer •

Data-Dependent Generalization Bounds generalization ≤ 𝑕( weights, training data) Add 𝑕(⋅) to the loss as an explicit regularizer • Theorem (informal): 𝑕 ⋅ = jacobian norm ⋅ hidden layer norm + low-order terms margin train set size

Data-Dependent Generalization Bounds generalization ≤ 𝑕( weights, training data) Add 𝑕(⋅) to the loss as an explicit regularizer • Theorem (informal): 𝑕 ⋅ = jacobian norm ⋅ hidden layer norm + low-order terms margin train set size • Jacobian norm = max norm of the Jacobian of model w.r.t hidden layers on training data

Data-Dependent Generalization Bounds generalization ≤ 𝑕( weights, training data) Add 𝑕(⋅) to the loss as an explicit regularizer • Theorem (informal): 𝑕 ⋅ = jacobian norm ⋅ hidden layer norm + low-order terms margin train set size • Jacobian norm = max norm of the Jacobian of model w.r.t hidden layers on training data • Hidden layer norm = max norm of hidden activation layer on training data

Data-Dependent Generalization Bounds generalization ≤ 𝑕( weights, training data) Add 𝑕(⋅) to the loss as an explicit regularizer • Theorem (informal): 𝑕 ⋅ = jacobian norm ⋅ hidden layer norm + low-order terms margin train set size • Jacobian norm = max norm of the Jacobian of model w.r.t hidden layers on training data • Hidden layer norm = max norm of hidden activation layer on training data • Margin = largest logit – second largest logit

Data-Dependent Generalization Bounds generalization ≤ g ( weights, training data) Add 𝑕(⋅) to the loss as an explicit regularizer • Theorem (informal): 𝑕 ⋅ = jacobian norm ⋅ hidden layer norm + low-order terms margin train set size • Measures stability/Lipschitzness of the network around training examples

Data-Dependent Generalization Bounds generalization ≤ g ( weights, training data) Add 𝑕(⋅) to the loss as an explicit regularizer • Theorem (informal): 𝑕 ⋅ = jacobian norm ⋅ hidden layer norm + low-order terms margin train set size • Measures stability/Lipschitzness of the network around training examples • Prior works consider worst-case stability over all inputs ⇒ exponential depth dependency [Bartlett et. al’17, Neyshabur et. al’17, etc.] [Bartlett et. al’17, Neyshabur et. al’17, etc.]

Data-Dependent Generalization Bounds generalization ≤ g ( weights, training data) Add 𝑕(⋅) to the loss as an explicit regularizer • Theorem (informal): 𝑕 ⋅ = jacobian norm ⋅ hidden layer norm + low-order terms margin train set size • Measures stability/Lipschitzness of the network around training examples • Prior works consider worst-case stability over all inputs ⇒ exponential depth dependency [Bartlett et. al’17, Neyshabur et. al’17, etc.] • Noise stability also studied in [Arora et. al’19, Nagarajan and Kolter’19] with looser bounds [Bartlett et. al’17, Neyshabur et. al’17, etc.]

Regularizing our Bound

Regularizing our Bound • Penalize squared Jacobian norm in loss

Regularizing our Bound • Penalize squared Jacobian norm in loss • Hidden layer controlled by normalization layers (BatchNorm, LayerNorm)

Regularizing our Bound • Penalize squared Jacobian norm in loss • Hidden layer controlled by normalization layers (BatchNorm, LayerNorm) • Helps in variety of settings which lack regularization compared to baseline

Correlation of our Bound with Test Error • Ours (red) vs. norm-based bound (blue) [Bartlett et. al’17] BN, norm-based BN, ours Fixup, norm-based Fixup, ours No BN, norm-based No BN, ours [Fixup: Zhang et. al’19 ]

Correlation of our Bound with Test Error • Ours (red) vs. norm-based bound (blue) [Bartlett et. al’17] BN, norm-based BN, ours Fixup, norm-based Fixup, ours No BN, norm-based No BN, ours • Our bound correlates better with test error [Fixup: Zhang et. al’19 ]

Conclusion

Conclusion • Tighter bounds by considering data-dependent properties (stability on training data)

Conclusion • Tighter bounds by considering data-dependent properties (stability on training data) • Our bound avoids exponential dependencies on depth

Conclusion • Tighter bounds by considering data-dependent properties (stability on training data) • Our bound avoids exponential dependencies on depth • Optimizing this bound improves empirical performance

Conclusion • Tighter bounds by considering data-dependent properties (stability on training data) • Our bound avoids exponential dependencies on depth • Optimizing this bound improves empirical performance • Follow up work: tighter bounds and empirical improvement over strong baselines • Works for both robust and clean accuracy [Wei and Ma’19, “Improved Sample Complexities for Deep Networks and Robust Classification via an All-Layer Margin”]

Conclusion • Tighter bounds by considering data-dependent properties (stability on training data) • Our bound avoids exponential dependencies on depth • Optimizing this bound improves empirical performance • Follow up work: tighter bounds and empirical improvement over strong baselines • Works for both robust and clean accuracy [Wei and Ma’19, “Improved Sample Complexities for Deep Networks and Robust Classification via an All-Layer Margin”] Come find our poster: 10:45 AM -- 12:45 PM @ East Exhibition Hall B + C #220!

Data-Dependent Sample Complexities for Deep Neural Networks Tengyu - PowerPoint PPT Presentation

Data-Dependent Sample Complexities for Deep Neural Networks Tengyu Ma Colin Wei Stanford University How do we design principled regularizers for deep models? How do we design principled regularizers for deep models? Many regularizers are

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Why Dependent Origination? So what is dependent origination? Dependent on ignorance, there

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

Dependent Eligibility Audit Dependent Eligibility Audit Purpose: The dependent eligibility audit

Deep Neural Networks and Hidden Markov Models in i-vector-based Text-Dependent Speaker

The Fundamentals of Deep Learning Building Blocks Theory with Applications Neural Units Neural

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Beazley Breach Response Select p g Making the connection on data breach complexities P

The Benefits and Complexities of f Data Sharing wit ithin Academic Collaboration Think lounge

SEM Photographs of Activated ash samples SEM Micrographs (Original ash samples) (a) Sample S1F1

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Some Second Order Set Theory Joel David Hamkins The City University of New York The College of

UNCLASSIFIED AD NUMBER ADB805378 LIMITATION CHANGES TO: Approved for public release;

From Well-Quasi-Orders to Noetherian Spaces: the Reverse Mathematics Viewpoint Alberto Marcone

Mixing in Oceans and Seas W.G. Large Climate and Global

VI. The Feasibility Study VI. The Feasibility Study What is a feasibility study? What is a

Lax-algebraic theories and closed objects Dirk Hofmann University of Aveiro dirk@mat.ua.pt 1

Accelerated machine-learning research via composable function transformations in Python mattjj@

Numerical Solutions to Partial Differential Equations Zhiping Li LMAM and School of Mathematical

Sambuz

Useful Links

Newsletter

Mail Us