Understanding and Accelerating Particle-Based Variational Inference - PowerPoint PPT Presentation

Understanding and Accelerating Particle-Based Variational Inference Chang Liu † , Jingwei Zhuo † , Pengyu Cheng ‡ , Ruiyi Zhang ‡ , Jun Zhu †§ , Lawrence Carin ‡§ ICML 2019 † Tsinghua University ‡ Duke University § : Corresponding authors

Introduction Particle-based Variational Inference Methods (ParVIs): • Represent the variational distribution q by particles; update the particles to minimize KL p ( q ). • More flexible than classical VIs; more particle-efficient than MCMCs. Related Work: • Stein Variational Gradient Descent (SVGD) [3] simulates the gradient flow (steepest descending curves) of KL p on P H ( X ) [2]. • The Blob and DGF methods [1] simulate the gradient flow of KL p on the Wasserstein space P 2 ( X ). 1

ParVIs Approximate P 2 ( X ) (Wasserstein) Gradient Flow Remark 1 Existing ParVI methods approximate Wasserstein Gradient flow by smoothing the density or functions . Smoothing the Density • Blob [1] partially smooths the density. � δ � δ GF = −∇ Blob = −∇ δ q E q [log( q / p )] � = δ q E q [log(˜ q / p )] � . v ⇒ v • GFSD fully smooths the density. GF := ∇ log p − ∇ log q = GFSD := ∇ log p − ∇ log ˜ v ⇒ v q . Smoothing Functions • SVGD restricts the optimization domain L 2 q to H D . v GFSF = ˆ K ′ ˆ g + ˆ K − 1 . • GFSF smoothed functions in a similar way: ˆ v SVGD = ˆ v GFSF ˆ (Note ˆ K .) j ∇ x ( j ) K ( x ( j ) , x ( i ) ) . g : , i = ∇ x ( i ) log p ( x ( i ) ), ˆ K ij = K ( x ( i ) , x ( j ) ), ˆ K ′ ˆ : , i = � 2

ParVIs Approximate P 2 ( X ) Gradient Flow by Smoothing • Equivalence: Smoothing-function objective = E q [ L ( v )], L : L 2 q → L 2 q linear. = q [ L ( v )] = E q ∗ K [ L ( v )] = E q [ L ( v ) ∗ K ] = E q [ L ( v ∗ K )] . ⇒ E ˜ � N q := 1 • Necessity: grad KL p ( q ) undefined at q = ˆ i =1 δ x ( i ) . N Theorem 2 (Necessity of smoothing for SVGD) q and v ∈ L 2 For q = ˆ p : GF , v � � max v q , L 2 v ∈L 2 p , � v � L 2 p =1 ˆ has no optimal solution. ☛ ✟ ParVIs rely on the smoothing assumption! No free lunch! ✡ ✠ 3

Bandwidth Selection via the Heat Equation Note Under the dynamics d x = −∇ log q t ( x ) d t , q t evolves following the heat equation (HE): ∂ t q t ( x ) = ∆ q t ( x ). Median: HE: SVGD Blob GFSD GFSF Figure 1: Comparison of HE (bottom row) with the median method (top row) for bandwidth selection. 4

Nesterov’s Acceleration Method on Riemannian Manifolds • Riemannian Accelerated Gradient (RAG) [4] (with simplification):  q k = Exp r k − 1 ( ε v k − 1 ) ,  � � �� − Γ q k k − 1 Exp − 1 r k − 1 ( q k − 1 ) − k + α − 2 r k = Exp q k ε v k − 1 . r k − 1  k k • Riemannian Nesterov’s method (RNes) [5] (with simplification):  q k = Exp r k − 1 ( ε v k − 1 ) ,  c 1 Exp − 1 (1 − c 2 ) Exp − 1 r k − 1 ( q k − 1 )+ c 2 Exp − 1 � � � �� r k = Exp q k Exp r k − 1 r k − 1 ( q k ) .  q k • Inverse exponential map: computationally expensive Proposition 3 (Inverse exponential map) For pairwise close samples { x ( i ) } i of q and { y ( i ) } i of r, we have ( x ( i ) ) ≈ y ( i ) − x ( i ) . Exp − 1 � � q ( r ) • Parallel transport: hard to implement Proposition 4 (Parallel transport) For pairwise close samples { x ( i ) } i of q and { y ( i ) } i of r, we have � Γ r � ( y ( i ) ) ≈ v ( x ( i ) ) , ∀ v ∈ T q P 2 . q ( v ) 5

Acceleration Framework for ParVIs Algorithm 1 The acceleration framework with Wasserstein Accelerated Gradient (WAG) and Wasserstein Nesterov’s method (WNes) 1: WAG: select acceleration factor α > 3; WNes: select or calculate c 1 , c 2 ∈ R + ; 2: Initialize { x ( i ) i =1 distinctly; let y ( i ) = x ( i ) 0 } N 0 ; 0 3: for k = 1 , 2 , · · · , k max , do for i = 1 , · · · , N , do 4: Find v ( y ( i ) k − 1 ) by SVGD/Blob/DGF/GFSD/GFSF; 5: x ( i ) = y ( i ) k − 1 + ε v ( y ( i ) k − 1 ); 6: k � k ( y ( i ) k − 1 − x ( i ) ε v ( y ( i ) k − 1 k − 1 ) + k + α − 2 WAG: k − 1 ); y ( i ) = x ( i ) k + 7: k k WNes: c 1 ( c 2 − 1)( x ( i ) − x ( i ) k − 1 ); k end for 8: 9: end for 10: Return { x ( i ) k max } N i =1 . 6

Bayesian Logistic Regression (BLR) 0.76 0.76 0.74 0.74 accuracy accuracy 0.72 0.72 SVGD-WGD Blob-WGD SVGD-PO Blob-PO SVGD-WAG Blob-WAG 0.70 0.70 SVGD-WNes Blob-WNes 0 2500 5000 7500 0 2500 5000 7500 0.76 iteration 0.76 iteration 0.74 0.74 accuracy accuracy 0.72 0.72 GFSD-WGD GFSF-WGD GFSD-PO GFSF-PO 0.70 GFSD-WAG 0.70 GFSF-WAG GFSD-WNes GFSF-WNes 0 2500 5000 7500 0 2500 5000 7500 iteration iteration Figure 2: Acceleration effect of WAG and WNes on BLR on the Covertype dataset, measured by prediction accuracy on test dataset. Each curve is averaged over 10 runs. 7

Latent Dirichlet Allocation (LDA) 1100 1100 SGNHT-seq SVGD-WGD Blob-WGD 1150 SGNHT-para SVGD-PO Blob-PO holdout perplexity holdout perplexity holdout perplexity 1080 1080 1125 SVGD-WGD SVGD-WAG Blob-WAG SVGD-WNes SVGD-WNes Blob-WNes 1100 1060 1060 1075 1040 1040 1050 1025 1020 1020 200 400 200 400 100 200 300 400 iteration 1100 iteration 1100 iteration GFSD-WGD GFSF-WGD GFSD-PO GFSF-PO holdout perplexity holdout perplexity 1080 1080 GFSD-WAG GFSF-WAG Figure 4: Comparison GFSD-WNes GFSF-WNes 1060 1060 of SVGD and SGNHT on LDA, as 1040 1040 representatives of 1020 1020 200 400 200 400 ParVIs and MCMCs. iteration iteration Average over 10 runs. Figure 3: Acceleration effect of WAG and WNes on LDA. Inference results are measured by the hold-out perplexity. Curves are averaged over 10 runs. 8

Summary Contributions (in theory): • ParVIs approximate the Wasserstein gradient flow by a compulsory smoothing assumption. • ParVIs either smooth the density or smooth functions, and they are equivalent. Contributions (in practice): • Two new ParVIs (GFSF and GFSD). • A principled bandwidth selection method for the smoothing kernel. • An acceleration framework for general ParVIs. 9

Changyou Chen, Ruiyi Zhang, Wenlin Wang, Bai Li, and Liqun Chen. A unified particle-optimization framework for scalable bayesian sampling. arXiv preprint arXiv:1805.11659 , 2018. Qiang Liu. Stein variational gradient descent as gradient flow. In Advances in neural information processing systems , pages 3118–3126, 2017. Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems , pages 2378–2386, 2016. Yuanyuan Liu, Fanhua Shang, James Cheng, Hong Cheng, and Licheng Jiao. 9

Accelerated first-order methods for geodesically convex optimization on riemannian manifolds. In Advances in Neural Information Processing Systems , pages 4875–4884, 2017. Hongyi Zhang and Suvrit Sra. An estimate sequence for geodesically convex optimization. In Conference On Learning Theory , pages 1703–1723, 2018. 9

Understanding and Accelerating Particle-Based Variational Inference - PowerPoint PPT Presentation

Understanding and Accelerating Particle-Based Variational Inference Chang Liu , Jingwei Zhuo , Pengyu Cheng , Ruiyi Zhang , Jun Zhu , Lawrence Carin ICML 2019 Tsinghua University Duke University :

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

Project 2: Basic particle system Constrained Particle System Tinkertoys Requirements for

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

Variational mapping particle filter combining particle filters with local optimal transport Manuel

Particle dynamics Particle overview Particle system Forces Constraints

Elementary Particle Physics in a Nutshell Elementary Particle Physics in a Nutshell

Particle dynamics Particle overview Particle system Forces Constraints

! Importance of Particle Adhesion ! Importance of Particle Adhesion ! History of Particle

20 Particle Systems Steve Marschner Eston Schweickart CS4620 Spring 2017 Examples of Particle

THEORETICAL PARTICLE PHYSICS IN KARLSRUHE I. The Team II. Research in Theoretical Particle

Variational Laplace Autoencoders Yookoon Park, Chris Dongjoo Kim and Gunhee Kim Vision and

Probabilistic & Unsupervised Learning Factored Variational Approximations and Variational

Probabilistic & Unsupervised Learning Factored Variational Approximations and Variational

Department of Computer Science Tokyo Institute of Technology INTRODUCTION 2 Prefactoring [1]

Visualization in ROSproessingjs Sungmin Lee Brown University Department of Computer Science

TPM: Trusted Platform Module Sumeet Bajaj sbajaj@cs.stonybrook.edu 9 Feb 2011 CSE 408

Rapid Processing of Synthetic Seismograms using Windows Azure Cloud Vedaprakash Subramanian,

Should I Go to the Bathroom Now? Sketch of physics input from 6 players on Kinect Me,

Administrivia And now the winners of homework #3 hybrid images as decided by the graders

Where do we go from here? Let me solve that for you, Grampa EM, circa 2004 Handful of

MetaSync File Synchronization Across Multiple Untrusted Storage Services Seungyeop Han, Haichen

Understanding and Accelerating Particle-Based Variational Inference - PowerPoint PPT Presentation

Understanding and Accelerating Particle-Based Variational Inference Chang Liu , Jingwei Zhuo , Pengyu Cheng , Ruiyi Zhang , Jun Zhu , Lawrence Carin ICML 2019 Tsinghua University Duke University :

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

Project 2: Basic particle system Constrained Particle System Tinkertoys Requirements for

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

Variational mapping particle filter combining particle filters with local optimal transport Manuel

Particle dynamics Particle overview Particle system Forces Constraints

Elementary Particle Physics in a Nutshell Elementary Particle Physics in a Nutshell

Particle dynamics Particle overview Particle system Forces Constraints

! Importance of Particle Adhesion ! Importance of Particle Adhesion ! History of Particle

20 Particle Systems Steve Marschner Eston Schweickart CS4620 Spring 2017 Examples of Particle

THEORETICAL PARTICLE PHYSICS IN KARLSRUHE I. The Team II. Research in Theoretical Particle

Variational Laplace Autoencoders Yookoon Park, Chris Dongjoo Kim and Gunhee Kim Vision and

Probabilistic &amp; Unsupervised Learning Factored Variational Approximations and Variational

Probabilistic &amp; Unsupervised Learning Factored Variational Approximations and Variational

Department of Computer Science Tokyo Institute of Technology INTRODUCTION 2 Prefactoring [1]

Visualization in ROSproessingjs Sungmin Lee Brown University Department of Computer Science

TPM: Trusted Platform Module Sumeet Bajaj sbajaj@cs.stonybrook.edu 9 Feb 2011 CSE 408

Rapid Processing of Synthetic Seismograms using Windows Azure Cloud Vedaprakash Subramanian,

Should I Go to the Bathroom Now? Sketch of physics input from 6 players on Kinect Me,

Administrivia And now the winners of homework #3 hybrid images as decided by the graders

Where do we go from here? Let me solve that for you, Grampa EM, circa 2004 Handful of

MetaSync File Synchronization Across Multiple Untrusted Storage Services Seungyeop Han, Haichen

Probabilistic & Unsupervised Learning Factored Variational Approximations and Variational

Probabilistic & Unsupervised Learning Factored Variational Approximations and Variational