using loss surface geometry for practical bayesian deep
play

Using Loss Surface Geometry for Practical Bayesian Deep Learning - PowerPoint PPT Presentation

Using Loss Surface Geometry for Practical Bayesian Deep Learning Andrew Gordon Wilson https://cims.nyu.edu/~andrewgw New York University Bayesian Deep Learning Workshop Advances in Neural Information Processing Systems December 13, 2019


  1. Using Loss Surface Geometry for Practical Bayesian Deep Learning Andrew Gordon Wilson https://cims.nyu.edu/~andrewgw New York University Bayesian Deep Learning Workshop Advances in Neural Information Processing Systems December 13, 2019 Collaborators: Pavel Izmailov, Wesley Maddox, Polina Kirichenko, Timur Garipov, Dmitry Vetrov 1 / 43

  2. Model Selection 700 Airline Passengers (Thousands) 600 500 400 300 200 100 1949 1951 1953 1955 1957 1959 1961 Year Which model should we choose? 10 4 3 � a j x j � a j x j (1): f 1 ( x ) = a 0 + a 1 x (2): f 2 ( x ) = (3): f 3 ( x ) = j = 0 j = 0 2 / 43

  3. How do we learn? ◮ The ability for a system to learn is determined by its support (which solutions are a priori possible) and inductive biases (which solutions are a priori likely). ◮ An influx of new massive datasets provide great opportunities to automatically learn rich statistical structure, leading to new scientific discoveries. Flexible Simple p(data|model) Medium All Possible Datasets 3 / 43

  4. Bayesian Deep Learning Why? ◮ A powerful framework for model construction and understanding generalization ◮ Uncertainty representation and calibration (crucial for decision making) ◮ Better point estimates ◮ Interpretably incorporate prior knowledge and domain expertise ◮ It was the most successful approach at the end of the second wave of neural networks (Neal, 1998). ◮ Neural nets are much less mysterious when viewed through the lens of probability theory. Why not? ◮ Can be computationally intractable (but doesn’t have to be). ◮ Can involve a lot of moving parts (but doesn’t have to). There has been exciting progress in the last year addressing these limitations. 4 / 43

  5. Wide Optima Generalize Better Keskar et. al (2017) ◮ Bayesian integration will give very different predictions in deep learning especially ! 5 / 43

  6. Bayesian Deep Learning Sum rule: p ( x ) = � x p ( x , y ) . Product rule: p ( x , y ) = p ( x | y ) p ( y ) = p ( y | x ) p ( x ) . � p ( y | x ∗ , y , X ) = p ( y | x ∗ , w ) p ( w | y , X ) d w . (1) ◮ Think of each setting of w as a different model. Eq. (1) is a Bayesian model average , an average of infinitely many models weighted by their posterior probabilities. ◮ Automatically calibrated complexity even with highly flexible models. ◮ Can view classical training as using an approximate posterior q ( w | y , X ) = δ ( w = w MAP ) . ◮ Typically more interested in the induced distribution over functions than in parameters w . Can be hard to have intuitions for priors on p ( w ) . 6 / 43

  7. Mode Connectivity Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs T. Garipov, P. Izmailov, D. Podoprikhin, D. Vetrov, A.G. Wilson NeurIPS 2018 7 / 43

  8. Mode Connectivity 8 / 43

  9. Mode Connectivity 9 / 43

  10. Mode Connectivity 10 / 43

  11. Mode Connectivity 11 / 43

  12. Uncertainty Representation with SWAG 1. Leverage theory that shows SGD with a constant learning rate is approximately sampling from a Gaussian distribution. 2. Compute first two moments of SGD trajectory (SWA computes just the first). 3. Use these moments to construct a Gaussian approximation in weight space. 4. Sample from this Gaussian distribution, pass samples through predictive distribution, and form a Bayesian model average. J p ( y ∗ |D ) ≈ 1 � p ( y ∗ | w j ) , w j ∼ q ( w |D ) , q ( w |D ) = N (¯ w , K ) J j = 1 � � w = 1 K = 1 1 1 w ) T + � � � w ) 2 ¯ w t , ( w t − ¯ w )( w t − ¯ diag ( w i − ¯ T − 1 T − 1 T 2 t t t A Simple Baseline for Bayesian Uncertainty in Deep Learning W. Maddox, P. Izmailov, T. Garipov, D. Vetrov, A.G. Wilson NeurIPS 2019 12 / 43

  13. Trajectory in PCA Subspace 13 / 43

  14. Uncertainty Calibration WideResNet28x10 CIFAR-100 WideResNet28x10 CIFAR-10 → STL-10 0.20 0.40 0.35 0.15 Confidence - Accuracy Confidence - Accuracy 0.30 0.10 0.25 0.05 0.20 0.00 0.15 0.10 -0.05 0.05 -0.10 0.00 0.200 0.759 0.927 0.978 0.993 0.998 0.200 0.759 0.927 0.978 0.993 0.998 Confidence (max prob) Confidence (max prob) DenseNet-161 ImageNet ResNet-152 ImageNet 0.12 0.10 0.10 Confidence - Accuracy Confidence - Accuracy 0.08 0.08 0.05 0.05 0.02 0.02 0.00 0.00 -0.02 -0.03 -0.05 -0.05 -0.08 0.200 0.759 0.927 0.978 0.993 0.998 0.200 0.759 0.927 0.978 0.993 0.998 Confidence (max prob) Confidence (max prob) 14 / 43

  15. SWAG Regression Uncertainty 15 / 43

  16. SWAG Visualization 16 / 43

  17. Subspace Inference for Bayesian Deep Learning A modular approach: ◮ Construct a subspace of a network with a high dimensional parameter space ◮ Perform inference directly in the subspace ◮ Sample from approximate posterior for Bayesian model averaging We can approximate the posterior of a WideResNet with 36 million parameters in a 5D subspace and achieve state-of-the-art results! 17 / 43

  18. Subspace Construction ◮ Choose shift ˆ w and basis vectors { d 1 , . . . , d k } . ◮ Define subspace S = { w | w = ˆ w + z 1 d 1 + z k d k } . ◮ Likelihood p ( D| z ) = p M ( D| w = ˆ w + Pz ) . ◮ Posterior inference p ( z |D ) ∝ p ( D| z ) p ( z ) . 18 / 43

  19. Curve Subspace Traversal 19 / 43

  20. Curve Subspace Traversal 20 / 43

  21. Curve Subspace Traversal 21 / 43

  22. Curve Subspace Traversal 22 / 43

  23. Curve Subspace Traversal 23 / 43

  24. Curve Subspace Traversal 24 / 43

  25. Curve Subspace Traversal 25 / 43

  26. Curve Subspace Traversal 26 / 43

  27. Curve Subspace Traversal 27 / 43

  28. Curve Subspace Traversal 28 / 43

  29. Curve Subspace Traversal 29 / 43

  30. Curve Subspace Traversal 30 / 43

  31. Curve Subspace Traversal 31 / 43

  32. Curve Subspace Traversal 32 / 43

  33. Curve Subspace Traversal 33 / 43

  34. Subspace Comparison (Regression) 34 / 43

  35. Subspace Comparison (Classification) Accuracy and NLL on CIFAR-100 Bayesian methods also lead to better point predictions in deep learning! Subspace Inference for Bayesian Deep Learning P. Izmailov, W. Maddox, P. Kirichenko, T. Garipov, D. Vetrov, A.G. Wilson UAI 2019 35 / 43

  36. Conclusions ◮ Neural networks represent many compelling solutions to a given problem, and a very underspecified by the available data. This is the perfect situation for Bayesian marginalization . ◮ Even if we cannot perfectly express our priors, or perform full Bayesian inference, we can try our best and get much better point predictions as well as improved calibration. We can view standard training as an impoverished Bayesian approximation. ◮ By exploiting information about the loss geometry in training, we can scale Bayesian neural networks to ImageNet with improvements in accuracy and calibration, and essentially no runtime overhead. 36 / 43

  37. Join Us! There is a postdoc opening in my group! Join an energetic and ambitious team of scientists in New York City, looking to address big open questions in core machine learning. 37 / 43

  38. Scalable Gaussian Processes ◮ Run exact GPs on millions of points in minutes. ◮ Outperforms stand-alone deep neural networks by learning deep kernels . ◮ Implemented in our new library GPyTorch : gpytorch.ai 38 / 43

  39. Gaussian processes: a function space view Gaussian processes provide an intuitive function space perspective on learning and generalization . GP posterior Likelihood GP prior � �� � � �� � � �� � p ( f ( x ) |D ) ∝ p ( D| f ( x )) p ( f ( x )) Sample Prior Functions Sample Posterior Functions 3 3 2 2 Outputs, f(x) Outputs, f(x) 1 1 0 0 − 1 − 1 − 2 − 2 − 3 − 3 − 10 − 5 0 5 10 − 10 − 5 0 5 10 Inputs, x Inputs, x 39 / 43

  40. BoTorch: Bayesian Optimization in PyTorch ◮ Probabilistic active learning ◮ Black box objectives, hyperparameter tuning, A/B testing, global optimization. 40 / 43

  41. Probabilistic Reinforcement Learning Robust, sample efficient online decision making under uncertainty. 41 / 43

Recommend


More recommend