Disentangling Trainability and Generalization In Deep Neural Networks Lechao Xiao, Je ff rey Pennington and Samuel S. Schoenholz Google Brain Team, Google Research Colab Tutorial
Two Fundamental Theoretical Questions in Deep Learning • Trainability / Optimization • E ffi cient algorithm to reach global minima • Generalization • performant on unseen data • Dream • (model, algorithm): Fast Training + Fantastic Generalization • Solves AGI
A trade-off between Trainability and Generalization for very deep and very wide NNs • Trained Fast, but NOT generalizable • Large Weight Initialization (Chaotic Phase) • Trained Slowly, able to generalize • Small Weight Initialization (Ordered Phase) Deep Neural Networks
<latexit sha1_base64="DuIzDhz/IEruhRbuOKeVOYkbx8E=">AB+XicbVDLSgMxFM3UVx1fVZdugkVwVWYU0Y1YdOyBfuAdiZ9E4bmSGJCOWoV/gVj9A3IhbP8JPENf+iOljodUDFw7n3Mu5nDhTBvP+3RyC4tLyv5VXdtfWNzq7C9U9dxqijUaMxj1QyJBs4k1AwzHJqJAiJCDo1wcDX2G7egNIvljRkmEAjSkyxilBgrVe86haJX8ibAf4k/I8WLd/c8ef5wK53CV7sb01SANJQTrVu+l5gI8owymHktlMNCaED0oOWpZI0E2eXSED6zSxVGs7EiDJ+rPi4wIrYcitJuCmL6e98biv14YirloE50FGZNJakDSaXKUcmxiPK4Bd5kCavjQEkIVs89j2ieKUGPLcm0r/nwHf0n9qOQfl06qXrF8iabIoz20jw6Rj05RGV2jCqohigDdowf06GTOk/PivE5Xc87sZhf9gvP2DWyrl04=</latexit> <latexit sha1_base64="GgESaj68ZiQuUqXL42pXNMYKlgk=">AB+XicbVC7SgNBFL3rM8ZX1FKRwSBYhV1FtAzaWCZgHpCEMDu5mwyZ2V1mZoWwpLSy1Q+wE1ub/Iq1pT/h5Fo4oELh3Pu5VyOHwujet+OkvLK6tr65mN7ObW9s5ubm+/qNEMaywSESq7lONgodYMdwIrMcKqfQF1vz+7divPaDSPArvzSDGlqTdkAecUWOlctDO5d2COwFZJN6M5ItHo/L34/Go1M59NTsRSySGhgmqdcNzY9NKqTKcCRxm4nGmLI+7WLD0pBK1K108uiQnFqlQ4JI2QkNmai/L1IqtR5I325Kanp63huL/3q+L+eiTXDdSnkYJwZDNk0OEkFMRMY1kA5XyIwYWEKZ4vZ5wnpUWZsWVnbijfwSKpnhe8i8Jl2dZzA1Nk4BO4Aw8uIi3EJKsA4Qme4cVJnVfnzXmfri45s5sD+APn4wfkpJem</latexit> Neural Networks Initialization f x
Training Dynamics and NTK Gradient descent dynamics with Mean Squared Error Function Space Neural Tangent Kernel (NTK) • In the infinite width setting, the NTK is deterministic and remains a constant through training (NTK Jacot et al., 2018) • The above ODE has a closed form solution.
Training and Learning Dynamics Training Dynamics: Learning Dynamics: Agreement between finite- and infinite-width networks Credit: Roman Novak
Metric for Trainability: Condition Number Training Dynamics: Eigen-decomposition The smallest eigenvector converges at rate 8-layers finite width FCN on CIFAR10 • σ 2 Blue w = 25 • σ 2 Orange Trainability Metric: w = 0.5
Metric for Generalization: Mean Prediction Learning Dynamics: Mean Prediction Generalization metric: Cannot generalize if becomes completely independent of the inputs. P ( Θ ) Y train
Evolution of the Metrics with depth Neural Networks Analyzing Induced Dynamical Systems NTK Condition Number Mean Prediction
Convergence of NTK and Phase Diagram Θ ( l ) Convergence of is determined by a bivariate function defined on the χ 1 ( σ 2 w , σ 2 b ) -plane Ordered Phase : Chaotic Phase : • • χ 1 < 1 χ 1 > 1 Θ ( l ) → Θ * = C 11 T Θ ( l ) → ∞ • • Θ ( l ) → Θ * Θ ( l ) → ∞ κ ( l ) → ∞ κ ( l ) → 1 • • κ * = 1 κ * = ∞ P ( Θ *) Y train = C test P ( Θ *) Y train = 0 P ( Θ ( l ) ) Y train → C test P ( Θ ( l ) ) Y train → 0 • •
Chaotic Phase χ 1 > 1 • Entries dynamics of NTK • Trainability / Generalization metrics Easy to Train, but not Generalizable
Chaotic Phase / Memorization • 10k/2k Training / Test, CIFAR10 (10 classes) • Full Batch + Gradient Descent • σ 2 w = 25, σ 2 b = 0, l = 8 Easy to Train, but Not Generalize
Ordered Phase χ 1 < 1 • Entries of the NTK • Trainability / Generalization metrics Di ffi cult to Train, Generalizable
Ordered Phase / Generalization • σ 2 w = 0.5 Di ffi cult to Train, Generalizable • σ 2 w = 25 Easy to Train, but Not Generalize
Summary • A tradeo ff between trainability and generalization for deep and wide networks • Fast training + memorization (e.g. Chaotic Phase) • Slow training + generalizable (e.g. Ordered Phase) • More results • Pooling, Dropout, Skip Connection, LayerNorm, etc. • Conjugate Kernels Colab Tutorial
Recommend
More recommend