understanding and accelerating particle based variational
play

Understanding and Accelerating Particle-Based Variational Inference - PowerPoint PPT Presentation

Understanding and Accelerating Particle-Based Variational Inference Chang Liu , Jingwei Zhuo , Pengyu Cheng , Ruiyi Zhang , Jun Zhu , Lawrence Carin ICML 2019 Tsinghua University Duke University :


  1. Understanding and Accelerating Particle-Based Variational Inference Chang Liu † , Jingwei Zhuo † , Pengyu Cheng ‡ , Ruiyi Zhang ‡ , Jun Zhu †§ , Lawrence Carin ‡§ ICML 2019 † Tsinghua University ‡ Duke University § : Corresponding authors

  2. Introduction Particle-based Variational Inference Methods (ParVIs): • Represent the variational distribution q by particles; update the particles to minimize KL p ( q ). • More flexible than classical VIs; more particle-efficient than MCMCs. Related Work: • Stein Variational Gradient Descent (SVGD) [3] simulates the gradient flow (steepest descending curves) of KL p on P H ( X ) [2]. • The Blob and DGF methods [1] simulate the gradient flow of KL p on the Wasserstein space P 2 ( X ). 1

  3. ParVIs Approximate P 2 ( X ) (Wasserstein) Gradient Flow Remark 1 Existing ParVI methods approximate Wasserstein Gradient flow by smoothing the density or functions . Smoothing the Density • Blob [1] partially smooths the density. � δ � δ GF = −∇ Blob = −∇ δ q E q [log( q / p )] � = δ q E q [log(˜ q / p )] � . v ⇒ v • GFSD fully smooths the density. GF := ∇ log p − ∇ log q = GFSD := ∇ log p − ∇ log ˜ v ⇒ v q . Smoothing Functions • SVGD restricts the optimization domain L 2 q to H D . v GFSF = ˆ K ′ ˆ g + ˆ K − 1 . • GFSF smoothed functions in a similar way: ˆ v SVGD = ˆ v GFSF ˆ (Note ˆ K .) j ∇ x ( j ) K ( x ( j ) , x ( i ) ) . g : , i = ∇ x ( i ) log p ( x ( i ) ), ˆ K ij = K ( x ( i ) , x ( j ) ), ˆ K ′ ˆ : , i = � 2

  4. ParVIs Approximate P 2 ( X ) Gradient Flow by Smoothing • Equivalence: Smoothing-function objective = E q [ L ( v )], L : L 2 q → L 2 q linear. = q [ L ( v )] = E q ∗ K [ L ( v )] = E q [ L ( v ) ∗ K ] = E q [ L ( v ∗ K )] . ⇒ E ˜ � N q := 1 • Necessity: grad KL p ( q ) undefined at q = ˆ i =1 δ x ( i ) . N Theorem 2 (Necessity of smoothing for SVGD) q and v ∈ L 2 For q = ˆ p : GF , v � � max v q , L 2 v ∈L 2 p , � v � L 2 p =1 ˆ has no optimal solution. ☛ ✟ ParVIs rely on the smoothing assumption! No free lunch! ✡ ✠ 3

  5. Bandwidth Selection via the Heat Equation Note Under the dynamics d x = −∇ log q t ( x ) d t , q t evolves following the heat equation (HE): ∂ t q t ( x ) = ∆ q t ( x ). Median: HE: SVGD Blob GFSD GFSF Figure 1: Comparison of HE (bottom row) with the median method (top row) for bandwidth selection. 4

  6. Nesterov’s Acceleration Method on Riemannian Manifolds • Riemannian Accelerated Gradient (RAG) [4] (with simplification):  q k = Exp r k − 1 ( ε v k − 1 ) ,  � � �� − Γ q k k − 1 Exp − 1 r k − 1 ( q k − 1 ) − k + α − 2 r k = Exp q k ε v k − 1 . r k − 1  k k • Riemannian Nesterov’s method (RNes) [5] (with simplification):  q k = Exp r k − 1 ( ε v k − 1 ) ,  c 1 Exp − 1 (1 − c 2 ) Exp − 1 r k − 1 ( q k − 1 )+ c 2 Exp − 1 � � � ��� r k = Exp q k Exp r k − 1 r k − 1 ( q k ) .  q k • Inverse exponential map: computationally expensive Proposition 3 (Inverse exponential map) For pairwise close samples { x ( i ) } i of q and { y ( i ) } i of r, we have ( x ( i ) ) ≈ y ( i ) − x ( i ) . Exp − 1 � � q ( r ) • Parallel transport: hard to implement Proposition 4 (Parallel transport) For pairwise close samples { x ( i ) } i of q and { y ( i ) } i of r, we have � Γ r � ( y ( i ) ) ≈ v ( x ( i ) ) , ∀ v ∈ T q P 2 . q ( v ) 5

  7. Acceleration Framework for ParVIs Algorithm 1 The acceleration framework with Wasserstein Accelerated Gradient (WAG) and Wasserstein Nesterov’s method (WNes) 1: WAG: select acceleration factor α > 3; WNes: select or calculate c 1 , c 2 ∈ R + ; 2: Initialize { x ( i ) i =1 distinctly; let y ( i ) = x ( i ) 0 } N 0 ; 0 3: for k = 1 , 2 , · · · , k max , do for i = 1 , · · · , N , do 4: Find v ( y ( i ) k − 1 ) by SVGD/Blob/DGF/GFSD/GFSF; 5: x ( i ) = y ( i ) k − 1 + ε v ( y ( i ) k − 1 ); 6: k � k ( y ( i ) k − 1 − x ( i ) ε v ( y ( i ) k − 1 k − 1 ) + k + α − 2 WAG: k − 1 ); y ( i ) = x ( i ) k + 7: k k WNes: c 1 ( c 2 − 1)( x ( i ) − x ( i ) k − 1 ); k end for 8: 9: end for 10: Return { x ( i ) k max } N i =1 . 6

  8. Bayesian Logistic Regression (BLR) 0.76 0.76 0.74 0.74 accuracy accuracy 0.72 0.72 SVGD-WGD Blob-WGD SVGD-PO Blob-PO SVGD-WAG Blob-WAG 0.70 0.70 SVGD-WNes Blob-WNes 0 2500 5000 7500 0 2500 5000 7500 0.76 iteration 0.76 iteration 0.74 0.74 accuracy accuracy 0.72 0.72 GFSD-WGD GFSF-WGD GFSD-PO GFSF-PO 0.70 GFSD-WAG 0.70 GFSF-WAG GFSD-WNes GFSF-WNes 0 2500 5000 7500 0 2500 5000 7500 iteration iteration Figure 2: Acceleration effect of WAG and WNes on BLR on the Covertype dataset, measured by prediction accuracy on test dataset. Each curve is averaged over 10 runs. 7

  9. Latent Dirichlet Allocation (LDA) 1100 1100 SGNHT-seq SVGD-WGD Blob-WGD 1150 SGNHT-para SVGD-PO Blob-PO holdout perplexity holdout perplexity holdout perplexity 1080 1080 1125 SVGD-WGD SVGD-WAG Blob-WAG SVGD-WNes SVGD-WNes Blob-WNes 1100 1060 1060 1075 1040 1040 1050 1025 1020 1020 200 400 200 400 100 200 300 400 iteration 1100 iteration 1100 iteration GFSD-WGD GFSF-WGD GFSD-PO GFSF-PO holdout perplexity holdout perplexity 1080 1080 GFSD-WAG GFSF-WAG Figure 4: Comparison GFSD-WNes GFSF-WNes 1060 1060 of SVGD and SGNHT on LDA, as 1040 1040 representatives of 1020 1020 200 400 200 400 ParVIs and MCMCs. iteration iteration Average over 10 runs. Figure 3: Acceleration effect of WAG and WNes on LDA. Inference results are measured by the hold-out perplexity. Curves are averaged over 10 runs. 8

  10. Summary Contributions (in theory): • ParVIs approximate the Wasserstein gradient flow by a compulsory smoothing assumption. • ParVIs either smooth the density or smooth functions, and they are equivalent. Contributions (in practice): • Two new ParVIs (GFSF and GFSD). • A principled bandwidth selection method for the smoothing kernel. • An acceleration framework for general ParVIs. 9

  11. Changyou Chen, Ruiyi Zhang, Wenlin Wang, Bai Li, and Liqun Chen. A unified particle-optimization framework for scalable bayesian sampling. arXiv preprint arXiv:1805.11659 , 2018. Qiang Liu. Stein variational gradient descent as gradient flow. In Advances in neural information processing systems , pages 3118–3126, 2017. Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems , pages 2378–2386, 2016. Yuanyuan Liu, Fanhua Shang, James Cheng, Hong Cheng, and Licheng Jiao. 9

  12. Accelerated first-order methods for geodesically convex optimization on riemannian manifolds. In Advances in Neural Information Processing Systems , pages 4875–4884, 2017. Hongyi Zhang and Suvrit Sra. An estimate sequence for geodesically convex optimization. In Conference On Learning Theory , pages 1703–1723, 2018. 9

Recommend


More recommend