Disentangling Trainability and Generalization In Deep Neural - PowerPoint PPT Presentation

Disentangling Trainability and Generalization In Deep Neural Networks Lechao Xiao, Je ff rey Pennington and Samuel S. Schoenholz Google Brain Team, Google Research Colab Tutorial

Two Fundamental Theoretical Questions in Deep Learning • Trainability / Optimization   • E ffi cient algorithm to reach global minima   • Generalization   • performant on unseen data   • Dream • (model, algorithm): Fast Training + Fantastic Generalization   • Solves AGI  

A trade-off between Trainability and Generalization for very deep and very wide NNs • Trained Fast, but NOT generalizable • Large Weight Initialization (Chaotic Phase)   • Trained Slowly, able to generalize • Small Weight Initialization (Ordered Phase) Deep Neural Networks

<latexit sha1_base64="DuIzDhz/IEruhRbuOKeVOYkbx8E=">AB+XicbVDLSgMxFM3UVx1fVZdugkVwVWYU0Y1YdOyBfuAdiZ9E4bmSGJCOWoV/gVj9A3IhbP8JPENf+iOljodUDFw7n3Mu5nDhTBvP+3RyC4tLyv5VXdtfWNzq7C9U9dxqijUaMxj1QyJBs4k1AwzHJqJAiJCDo1wcDX2G7egNIvljRkmEAjSkyxilBgrVe86haJX8ibAf4k/I8WLd/c8ef5wK53CV7sb01SANJQTrVu+l5gI8owymHktlMNCaED0oOWpZI0E2eXSED6zSxVGs7EiDJ+rPi4wIrYcitJuCmL6e98biv14YirloE50FGZNJakDSaXKUcmxiPK4Bd5kCavjQEkIVs89j2ieKUGPLcm0r/nwHf0n9qOQfl06qXrF8iabIoz20jw6Rj05RGV2jCqohigDdowf06GTOk/PivE5Xc87sZhf9gvP2DWyrl04=</latexit> <latexit sha1_base64="GgESaj68ZiQuUqXL42pXNMYKlgk=">AB+XicbVC7SgNBFL3rM8ZX1FKRwSBYhV1FtAzaWCZgHpCEMDu5mwyZ2V1mZoWwpLSy1Q+wE1ub/Iq1pT/h5Fo4oELh3Pu5VyOHwujet+OkvLK6tr65mN7ObW9s5ubm+/qNEMaywSESq7lONgodYMdwIrMcKqfQF1vz+7divPaDSPArvzSDGlqTdkAecUWOlctDO5d2COwFZJN6M5ItHo/L34/Go1M59NTsRSySGhgmqdcNzY9NKqTKcCRxm4nGmLI+7WLD0pBK1K108uiQnFqlQ4JI2QkNmai/L1IqtR5I325Kanp63huL/3q+L+eiTXDdSnkYJwZDNk0OEkFMRMY1kA5XyIwYWEKZ4vZ5wnpUWZsWVnbijfwSKpnhe8i8Jl2dZzA1Nk4BO4Aw8uIi3EJKsA4Qme4cVJnVfnzXmfri45s5sD+APn4wfkpJem</latexit> Neural Networks Initialization f x

Training Dynamics and NTK Gradient descent dynamics with Mean Squared Error Function Space Neural Tangent Kernel (NTK) • In the infinite width setting, the NTK is deterministic and remains a constant through training (NTK Jacot et al., 2018) • The above ODE has a closed form solution.

Training and Learning Dynamics Training Dynamics: Learning Dynamics: Agreement between finite- and infinite-width networks Credit: Roman Novak

Metric for Trainability: Condition Number Training Dynamics: Eigen-decomposition The smallest eigenvector converges at rate 8-layers finite width FCN on CIFAR10 • σ 2 Blue w = 25 • σ 2 Orange Trainability Metric: w = 0.5

Metric for Generalization: Mean Prediction Learning Dynamics: Mean Prediction Generalization metric: Cannot generalize if becomes completely independent of the inputs. P ( Θ ) Y train

Evolution of the Metrics with depth Neural Networks Analyzing Induced Dynamical Systems NTK Condition Number Mean Prediction

Convergence of NTK and Phase Diagram Θ ( l ) Convergence of is determined by a bivariate function defined on the χ 1 ( σ 2 w , σ 2 b ) -plane Ordered Phase : Chaotic Phase : • • χ 1 < 1 χ 1 > 1 Θ ( l ) → Θ * = C 11 T Θ ( l ) → ∞ • • Θ ( l ) → Θ * Θ ( l ) → ∞ κ ( l ) → ∞ κ ( l ) → 1 • • κ * = 1 κ * = ∞ P ( Θ *) Y train = C test P ( Θ *) Y train = 0 P ( Θ ( l ) ) Y train → C test P ( Θ ( l ) ) Y train → 0 • •

Chaotic Phase χ 1 > 1 • Entries dynamics of NTK • Trainability / Generalization metrics Easy to Train, but not Generalizable

Chaotic Phase / Memorization • 10k/2k Training / Test, CIFAR10 (10 classes) • Full Batch + Gradient Descent • σ 2 w = 25, σ 2 b = 0, l = 8 Easy to Train, but Not Generalize

Ordered Phase χ 1 < 1 • Entries of the NTK • Trainability / Generalization metrics Di ffi cult to Train, Generalizable

Ordered Phase / Generalization • σ 2 w = 0.5 Di ffi cult to Train, Generalizable • σ 2 w = 25 Easy to Train, but Not Generalize

Summary • A tradeo ff between trainability and generalization for deep and wide networks • Fast training + memorization (e.g. Chaotic Phase) • Slow training + generalizable (e.g. Ordered Phase) • More results • Pooling, Dropout, Skip Connection, LayerNorm, etc. • Conjugate Kernels Colab Tutorial

Disentangling Trainability and Generalization In Deep Neural - PowerPoint PPT Presentation

Disentangling Trainability and Generalization In Deep Neural Networks Lechao Xiao, Je ff rey Pennington and Samuel S. Schoenholz Google Brain Team, Google Research Colab Tutorial Two Fundamental Theoretical Questions in Deep Learning

Self-Supervised Model Training and Selection for Disentangling GANs Previous title: InfoGAN-CR:

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23

Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND

Generalization of Cycle-Covering Heuristics Clemens B uchner Department of Mathematics and

Local Substitutability for Sequence Generalization Fran cois Coste , Ga elle Garet , Jacques

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 /

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Model Generalization for Medical Image Computing at Scale DOU Qi Department of Computer

Assessing Generalization in Deep Reinforcement Learning Soo Jung Jang Background Before (ex:

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Roughing it up: Disentangling Continuous and Jump Components in Measuring, Modeling and

MCC12 Validation with Pandora Reconstruction Metrics Steven Green on behalf of the Pandora team

Monitoring Rohit Jnagal Anushree Narasimha Outline Overview Monitoring for containers

A High-Precision, Hybrid GPU, CPU and RAM Power Model for the Tegra K1 SoC Kristoffer Robin

Physical meaning of natural orbitals and natural occupation numbers Member of the

How to Stay Relevant * For Oracle Professionals whoami Never Worked for Oracle Worked with

Ground Plane Data Analysis Heng-Ye Liao, Alan Hahn, Cheng-Ju Lin, Sarah Lockwitz 11/08/2017

Matched filtering 6.011, Spring 2018 Lec 24 1 Matched filtering for detecting known signal in

Physical Layer Lecture Progression Bottom-up through the layers: Application - HTTP,