probabilistic numerics part ii linear algebra and
play

Probabilistic Numerics Part II Linear Algebra and Nonlinear - PowerPoint PPT Presentation

Probabilistic Numerics Part II Linear Algebra and Nonlinear Optimization Philipp Hennig MLSS 2015 20 / 07 / 2015 Emmy Noether Group on Probabilistic Numerics Department of Empirical Inference Max Planck Institute for Intelligent


  1. Probabilistic Numerics – Part II – Linear Algebra and Nonlinear Optimization Philipp Hennig MLSS 2015 20 / 07 / 2015 Emmy Noether Group on Probabilistic Numerics Department of Empirical Inference Max Planck Institute for Intelligent Systems Tübingen, Germany

  2. Probabilistic Numerics Recap from Saturday On Saturday ▸ computation is inference ▸ classic methods for integration and solution of differential equations can be interpreted as MAP inference from Gaussian models ▸ customizing the implicit prior gives faster, tailored numerics ▸ probabilistic formulation allows propagation of uncertainty through composite computations 1 ,

  3. Linear Algebra Ax = b A ∈ R N × N symmetric positive definite A x \ b 2 ,

  4. Why you should care about linear algebra least-squares: a most basic machine learning task A − 1 A f ( x ) = k xX ( k XX + σ 2 I ) − 1 b = k xX A − 1 b ˆ 3 ,

  5. Inference on Matrix Elements generic Gaussian priors [Hennig, SIOPT, 2015] ▸ prior on elements of inverse H = A − 1 ∈ R N × N with Σ ∈ R N 2 × N 2 p ( H ) = N(� H ; � ⇀ ⇀ ( 2 π ) N 2 / 2 ∣ Σ ∣ 1 / 2 exp [(� � � � ⇀ Σ − 1 (� � � � ⇀ ⊺ H 0 , Σ ) = 1 H − H 0 ) H − H 0 )] ▸ can collect noise-free observations p ( S,Y ∣ H ) = δ ( S − HY ) AS = Y S = HY ∈ R N × M ⇔ ▸ a linear projection: (using the Kronecker product) S km = ∑ S = ( I ⊗ Y ⊺ ) � H = C � C ∈ R NM × N 2 � ⇀ � ⇀ ⇀ ⇀ δ ki Y jm H ij . H ij ▸ posterior: p ( H ∣ S,Y ) = N [ � H 0 + Σ C ⊺ (C Σ C ⊺ ) − 1 ( � S − C H 0 ) , Σ − Σ C ⊺ (C Σ C ⊺ ) − 1 C Σ ] H ; � ⇀ ⇀ � � � � ⇀ ▸ requires O( N 3 M ) operations! Need structure in Σ 4 ,

  6. p ( H ∣ S,Y ) = N [� H ; � ⇀ H 0 + Σ C ⊺ (C Σ C ⊺ ) − 1 (� ⇀ � � � � ⇀ S − C H 0 ) , Σ − Σ C ⊺ (C Σ C ⊺ ) − 1 C Σ ] ▸ good probabilistic numerical methods must have both ▸ low computational cost ▸ meaningful prior assumptions 5 ,

  7. A factorization assumption with support on all matrices = + D ⊺ H 0 H C ⋅ ▸ cov ( H ij ,H kℓ ) = V ik W jℓ p ( H ) = N( H ; H 0 ,V ⊗ W ) ⇒ ▸ if V,W ≻ 0 , this puts nonzero mass on all H ∈ R N × N var ( H ij ) = V ii W jj ▸ draw n columns of C iid. from N( C ∶ i ;0 ,V / n ) ▸ draw n columns of D iid. from N( D ∶ i ;0 ,W / n ) 6 ,

  8. A Structured Prior computation requires trading expressivity and cost [Hennig, SIOPT, 2015] ▸ prior p ( H ) = N(� H ; � ⇀ ⇀ H 0 ,V ⊗ W ) gives p ( H ∣ S,Y ) = N[ H ; H 0 + ( S − H 0 Y )( Y ⊺ WY ) − 1 Y ⊺ W, V ⊗ ( W − WY ( Y ⊺ WY ) − 1 Y ⊺ W )] H true H M Y A S ⇒ 7 ,

  9. A Structured Prior computation requires trading expressivity and cost [Hennig, SIOPT, 2015] ▸ prior p ( H ) = N(� H ; � ⇀ ⇀ H 0 ,V ⊗ W ) gives p ( H ∣ S,Y ) = N[ H ; H 0 + ( S − H 0 Y )( Y ⊺ WY ) − 1 Y ⊺ W, V ⊗ ( W − WY ( Y ⊺ WY ) − 1 Y ⊺ W )] H true H M Y A S ⇒ 7 ,

  10. A Structured Prior computation requires trading expressivity and cost [Hennig, SIOPT, 2015] ▸ prior p ( H ) = N(� H ; � ⇀ ⇀ H 0 ,V ⊗ W ) gives p ( H ∣ S,Y ) = N[ H ; H 0 + ( S − H 0 Y )( Y ⊺ WY ) − 1 Y ⊺ W, V ⊗ ( W − WY ( Y ⊺ WY ) − 1 Y ⊺ W )] H true H M Y A S ⇒ ▸ two problems: ▸ still requires O( M 3 ) inversion just to compute mean ↝ would like diagonal Y ⊺ WY (conjugate observations) ▸ how to choose H 0 , V, W to get well-scaled prior? ↝ ‘empirical Bayesian’ choice to include H 7 ,

  11. A Scaled Prior probabilistic computation needs meaningful priors [Hennig, SIOPT, 2015] ▸ using H 0 = ǫI with ǫ ≪ 1 . It would be nice to have W = V = H : var ( H ) ij = V ii W jj = H ii H jj for symmetric positive definite matrices, H ii > 0 , H 2 ij ≤ H ii H jj ▸ if W = V = H , p ( H ∣ S,Y ) = N[ H ; H 0 + ( S − H 0 Y )( Y ⊺ WY ) − 1 Y ⊺ W, V ⊗ ( W − WY ( Y ⊺ WY ) − 1 Y ⊺ W )] 8 ,

  12. A Scaled Prior probabilistic computation needs meaningful priors [Hennig, SIOPT, 2015] ▸ using H 0 = ǫI with ǫ ≪ 1 . It would be nice to have W = V = H : var ( H ) ij = V ii W jj = H ii H jj for symmetric positive definite matrices, H ii > 0 , H 2 ij ≤ H ii H jj ▸ if W = V = H , p ( H ∣ S,Y ) = N[ H ; H 0 + ( S − H 0 Y )( Y ⊺ S ) − 1 S ⊺ , W ⊗ ( W − S ( Y ⊺ S ) − 1 S ⊺ )] 8 ,

  13. A Scaled Prior probabilistic computation needs meaningful priors [Hennig, SIOPT, 2015] ▸ using H 0 = ǫI with ǫ ≪ 1 . It would be nice to have W = V = H : var ( H ) ij = V ii W jj = H ii H jj for symmetric positive definite matrices, H ii > 0 , H 2 ij ≤ H ii H jj ▸ if W = V = H , p ( H ∣ S,Y ) = N[ H ; H 0 + ( S − H 0 Y )( Y ⊺ S ) − 1 S ⊺ , W ⊗ ( W − S ( Y ⊺ S ) − 1 S ⊺ )] ▸ can choose conjugate directions S ⊺ AS = S ⊺ Y = diag i { g i } using Gram-Schmidt process. Choose orthogonal set { u 1 ,...,u N } y ⊺ i − 1 j u i s i = u i − ∑ s j y ⊺ j s j j = 1 then ( s m − H 0 y m ) s ⊺ M E ∣ S,Y [ H ] = H 0 + ∑ m y ⊺ m s m i = 1 8 ,

  14. Active Learning of Matrix Inverses Gaussian Elimination [C.F . Gauss, 1809] which set of orthogonal directions should we choose? ▸ e.g. { u 1 ,...,u N } = { e 1 ,...,e N } ∣ S ∣ ∣ Y ∣ p ( H ) ∣ A ⋅ H M ∣ H true 9 ,

  15. Active Learning of Matrix Inverses Gaussian Elimination [C.F . Gauss, 1809] which set of orthogonal directions should we choose? ▸ e.g. { u 1 ,...,u N } = { e 1 ,...,e N } ∣ S ∣ ∣ Y ∣ p ( H ) ∣ A ⋅ H M ∣ H true 9 ,

  16. Active Learning of Matrix Inverses Gaussian Elimination [C.F . Gauss, 1809] which set of orthogonal directions should we choose? ▸ e.g. { u 1 ,...,u N } = { e 1 ,...,e N } ∣ S ∣ ∣ Y ∣ p ( H ) ∣ A ⋅ H M ∣ H true 9 ,

  17. Active Learning of Matrix Inverses Gaussian Elimination [C.F . Gauss, 1809] which set of orthogonal directions should we choose? ▸ e.g. { u 1 ,...,u N } = { e 1 ,...,e N } ∣ S ∣ ∣ Y ∣ p ( H ) ∣ A ⋅ H M ∣ H true 9 ,

  18. Active Learning of Matrix Inverses Gaussian Elimination [C.F . Gauss, 1809] which set of orthogonal directions should we choose? ▸ e.g. { u 1 ,...,u N } = { e 1 ,...,e N } ∣ S ∣ ∣ Y ∣ p ( H ) ∣ A ⋅ H M ∣ H true 9 ,

  19. Active Learning of Matrix Inverses Gaussian Elimination [C.F . Gauss, 1809] which set of orthogonal directions should we choose? ▸ e.g. { u 1 ,...,u N } = { e 1 ,...,e N } ∣ S ∣ ∣ Y ∣ p ( H ) ∣ A ⋅ H M ∣ H true 9 ,

  20. Active Learning of Matrix Inverses Gaussian Elimination [C.F . Gauss, 1809] which set of orthogonal directions should we choose? ▸ e.g. { u 1 ,...,u N } = { e 1 ,...,e N } ∣ S ∣ ∣ Y ∣ p ( H ) ∣ A ⋅ H M ∣ H true Gaussian eliminiation of A is maximum a-posteriori estimation of H under a well-scaled Gaussian prior, if the search directions are chosen from the unit vectors. 9 ,

  21. Gaussian elimination as MAP inference: ▸ decide to use Gaussian prior ▸ factorization assumption (Kronecker structure) in covariance gives simple update ▸ implicitly choosing “ W = H ” gives well-scaled prior ▸ conjugate directions for efficient bookkeeping ▸ construct projections from unit vectors 10 ,

  22. What about Uncertainty? calibrating prior covariance at runtime [Hennig, SIOPT, 2015] under “ W = H ” p ( H ∣ S,Y ) = N[ H ; H 0 + ( S − H 0 Y )( Y ⊺ S ) − 1 S ⊺ ,W ⊗ ( W − S ( Y ⊺ S ) − 1 S ⊺ )] just need WY = S . So choose W = S ( Y ⊺ S ) − 1 S ⊺ + ( I − Y ( Y ⊺ Y ) − 1 Y ⊺ ) Ω ( I − Y ( Y ⊺ Y ) − 1 Y ⊺ ) 1 0 . 8 m s m 0 . 6 y ⊺ 0 . 4 0 . 2 0 5 10 15 20 25 30 step m 11 ,

  23. What about Uncertainty? calibrating prior covariance at runtime [Hennig, SIOPT, 2015] under “ W = H ” p ( H ∣ S,Y ) = N[ H ; H 0 + ( S − H 0 Y )( Y ⊺ S ) − 1 S ⊺ ,W ⊗ ( W − S ( Y ⊺ S ) − 1 S ⊺ )] just need WY = S . So choose W = S ( Y ⊺ S ) − 1 S ⊺ + ( I − Y ( Y ⊺ Y ) − 1 Y ⊺ ) Ω ( I − Y ( Y ⊺ Y ) − 1 Y ⊺ ) W M for W 0 = H W M for W 0 estimated 11 ,

  24. ▸ scaled, structured prior, exploration along unit vectors gives Gaussian elimination ▸ empirical Bayesian estimation of covariance gives scaled posterior uncertainty, retains classic estimate, at very low cost overhead 12 ,

  25. Can we do better than Gaussian Elimination? encode symmetry H = H ⊺ [Hennig, SIOPT, 2015] ▸ Using Γ � H = 1 / 2 (� ⇀ H + H ⊺ ) , p ( symm. ∣ H ) = lim β � 0 N( 0;Γ � � � � ⇀ ⇀ H,β ) p ( H ∣ symm. ) = N(� H ; � ⇀ ⇀ H 0 ,W ⊗ ⊖ W ) ( W ⊗ ⊖ W ) ij,kℓ = 1 / 2 ( W ik W jℓ + W iℓ W jk ) H ∼ N ( H 0 , W ⊗ W ) H ∼ N ( H 0 , W ⊗ ⊖ W ) ▸ p ( S,Y ∣ H ) = δ ( S − HY ) now gives ( ∆ = S − H 0 Y , G = Y ⊺ WY ) p ( H ∣ S,Y ) = N[ H ; H 0 + ∆ G − 1 Y ⊺ W + WY G − 1 ∆ ⊺ − WY G − 1 ∆ ⊺ Y G − 1 Y ⊺ W, ( W − WY G − 1 Y ⊺ W )⊗ ⊖( W − WY G − 1 Y ⊺ W )] 13 ,

Recommend


More recommend