disentangling trainability and generalization in deep
play

Disentangling Trainability and Generalization In Deep Neural - PowerPoint PPT Presentation

Disentangling Trainability and Generalization In Deep Neural Networks Lechao Xiao, Je ff rey Pennington and Samuel S. Schoenholz Google Brain Team, Google Research Colab Tutorial Two Fundamental Theoretical Questions in Deep Learning


  1. Disentangling Trainability and Generalization In Deep Neural Networks Lechao Xiao, Je ff rey Pennington and Samuel S. Schoenholz Google Brain Team, Google Research Colab Tutorial

  2. Two Fundamental Theoretical Questions in Deep Learning • Trainability / Optimization 
 • E ffi cient algorithm to reach global minima 
 • Generalization 
 • performant on unseen data 
 • Dream • (model, algorithm): Fast Training + Fantastic Generalization 
 • Solves AGI 


  3. A trade-off between Trainability and Generalization for very deep and very wide NNs • Trained Fast, but NOT generalizable • Large Weight Initialization (Chaotic Phase) 
 • Trained Slowly, able to generalize • Small Weight Initialization (Ordered Phase) Deep Neural Networks

  4. <latexit sha1_base64="DuIzDhz/IEruhRbuOKeVOYkbx8E=">AB+XicbVDLSgMxFM3UVx1fVZdugkVwVWYU0Y1YdOyBfuAdiZ9E4bmSGJCOWoV/gVj9A3IhbP8JPENf+iOljodUDFw7n3Mu5nDhTBvP+3RyC4tLyv5VXdtfWNzq7C9U9dxqijUaMxj1QyJBs4k1AwzHJqJAiJCDo1wcDX2G7egNIvljRkmEAjSkyxilBgrVe86haJX8ibAf4k/I8WLd/c8ef5wK53CV7sb01SANJQTrVu+l5gI8owymHktlMNCaED0oOWpZI0E2eXSED6zSxVGs7EiDJ+rPi4wIrYcitJuCmL6e98biv14YirloE50FGZNJakDSaXKUcmxiPK4Bd5kCavjQEkIVs89j2ieKUGPLcm0r/nwHf0n9qOQfl06qXrF8iabIoz20jw6Rj05RGV2jCqohigDdowf06GTOk/PivE5Xc87sZhf9gvP2DWyrl04=</latexit> <latexit sha1_base64="GgESaj68ZiQuUqXL42pXNMYKlgk=">AB+XicbVC7SgNBFL3rM8ZX1FKRwSBYhV1FtAzaWCZgHpCEMDu5mwyZ2V1mZoWwpLSy1Q+wE1ub/Iq1pT/h5Fo4oELh3Pu5VyOHwujet+OkvLK6tr65mN7ObW9s5ubm+/qNEMaywSESq7lONgodYMdwIrMcKqfQF1vz+7divPaDSPArvzSDGlqTdkAecUWOlctDO5d2COwFZJN6M5ItHo/L34/Go1M59NTsRSySGhgmqdcNzY9NKqTKcCRxm4nGmLI+7WLD0pBK1K108uiQnFqlQ4JI2QkNmai/L1IqtR5I325Kanp63huL/3q+L+eiTXDdSnkYJwZDNk0OEkFMRMY1kA5XyIwYWEKZ4vZ5wnpUWZsWVnbijfwSKpnhe8i8Jl2dZzA1Nk4BO4Aw8uIi3EJKsA4Qme4cVJnVfnzXmfri45s5sD+APn4wfkpJem</latexit> Neural Networks Initialization f x

  5. Training Dynamics and NTK Gradient descent dynamics with Mean Squared Error Function Space Neural Tangent Kernel (NTK) • In the infinite width setting, the NTK is deterministic and remains a constant through training (NTK Jacot et al., 2018) • The above ODE has a closed form solution.

  6. Training and Learning Dynamics Training Dynamics: Learning Dynamics: Agreement between finite- and infinite-width networks Credit: Roman Novak

  7. Metric for Trainability: Condition Number Training Dynamics: Eigen-decomposition The smallest eigenvector converges at rate 8-layers finite width FCN on CIFAR10 • σ 2 Blue w = 25 • σ 2 Orange Trainability Metric: w = 0.5

  8. Metric for Generalization: Mean Prediction Learning Dynamics: Mean Prediction Generalization metric: Cannot generalize if becomes completely independent of the inputs. P ( Θ ) Y train

  9. Evolution of the Metrics with depth Neural Networks Analyzing Induced Dynamical Systems NTK Condition Number Mean Prediction

  10. Convergence of NTK and Phase Diagram Θ ( l ) Convergence of is determined by a bivariate function defined on the χ 1 ( σ 2 w , σ 2 b ) -plane Ordered Phase : Chaotic Phase : • • χ 1 < 1 χ 1 > 1 Θ ( l ) → Θ * = C 11 T Θ ( l ) → ∞ • • Θ ( l ) → Θ * Θ ( l ) → ∞ κ ( l ) → ∞ κ ( l ) → 1 • • κ * = 1 κ * = ∞ P ( Θ *) Y train = C test P ( Θ *) Y train = 0 P ( Θ ( l ) ) Y train → C test P ( Θ ( l ) ) Y train → 0 • •

  11. Chaotic Phase χ 1 > 1 • Entries dynamics of NTK • Trainability / Generalization metrics Easy to Train, but not Generalizable

  12. Chaotic Phase / Memorization • 10k/2k Training / Test, CIFAR10 (10 classes) • Full Batch + Gradient Descent • σ 2 w = 25, σ 2 b = 0, l = 8 Easy to Train, but Not Generalize

  13. Ordered Phase χ 1 < 1 • Entries of the NTK • Trainability / Generalization metrics Di ffi cult to Train, Generalizable

  14. Ordered Phase / Generalization • σ 2 w = 0.5 Di ffi cult to Train, Generalizable • σ 2 w = 25 Easy to Train, but Not Generalize

  15. Summary • A tradeo ff between trainability and generalization for deep and wide networks • Fast training + memorization (e.g. Chaotic Phase) • Slow training + generalizable (e.g. Ordered Phase) • More results • Pooling, Dropout, Skip Connection, LayerNorm, etc. • Conjugate Kernels Colab Tutorial

Recommend


More recommend