stochastic optimization for regularized wasserstein
play

Stochastic Optimization for Regularized Wasserstein Estimators ICML - PowerPoint PPT Presentation

Stochastic Optimization for Regularized Wasserstein Estimators ICML 2020 Francis Bach Quentin Berthet Marin Ballu Wasserstein Distance: a natural geometry for distributions How does one compute the distance between two data distributions? 1


  1. Stochastic Optimization for Regularized Wasserstein Estimators ICML 2020 Francis Bach Quentin Berthet Marin Ballu

  2. Wasserstein Distance: a natural geometry for distributions How does one compute the distance between two data distributions? 1

  3. Wasserstein Distance: a natural geometry for distributions How does one compute the distance between two data distributions? • Relative entropy and other f-divergences allow classical statistical approaches. 1

  4. Wasserstein Distance: a natural geometry for distributions How does one compute the distance between two data distributions? • Relative entropy and other f-divergences allow classical statistical approaches. • Optimal transport theory allows us to capture the geometry of the data distributions, with the Wasserstein distance . W c p µ, ν q “ OT p µ, ν q “ min T # µ “ ν E X „ µ r c p X , T p X qqs 1

  5. Wasserstein Distance: a natural geometry for distributions How does one compute the distance between two data distributions? • Relative entropy and other f-divergences allow classical statistical approaches. • Optimal transport theory allows us to capture the geometry of the data distributions, with the Wasserstein distance . W c p µ, ν q “ OT p µ, ν q “ π P Π p µ,ν q E p X , Y q„ π r c p X , Y qs min 1

  6. Wasserstein distance in machine learning Wasserstein GAN (Arjovsky et al., 2017) Wasserstein Discriminant Analysis (Flamary et al., 2018) Clustered point-matching (Alvarez-Melis et al., 2018) 2

  7. Wasserstein distance in machine learning Diffeomorphic registration (Feydy et al., 2017) Sinkhorn divergence for generative models (Genevay et al., 2019) Alignment of embeddings (Grave et al., 2019) 3

  8. Our contribution µ OT p µ, ν q We consider the minimum Kantorovich estimator (Bassetti et al., 2006), or Wasserstein estimator of the measure µ : min ν P M OT p µ, ν q , ν which is often used for µ “ ř i δ x i to fit a parametric M model M (as with MLE, where KL divergence replaces OT). 4

  9. Our contribution µ OT p µ, ν q • We add two layers of entropic regularization. • We propose a new stochastic optimization scheme to minimize the regularized problem. ν • Time per step is sublinear in the natural dimension of the problem. M • We provide theoretical guarantees, and simulations. 5

  10. Regularized Wasserstein Distance Wasserstein distance W c p µ, ν q “ OT p µ, ν q “ π P Π p µ,ν q E p X , Y q„ π r c p X , Y qs min 6

  11. Regularized Wasserstein Distance Wasserstein distance W c p µ, ν q “ OT p µ, ν q “ π P Π p µ,ν q E p X , Y q„ π r c p X , Y qs min Regularized Wasserstein distance OT ε p µ, ν q “ π P Π p µ,ν q E p X , Y q„ π r c p X , Y qs ` ε KL p π, µ b ν q min Computed at lightspeed by Sinkhorn algorithm (Cuturi 2013) SGD on dual problem (Genevay et al. 2016) 6

  12. Regularized Wasserstein Estimator Wasserstein estimator ν P M OT p µ, ν q min 7

  13. Regularized Wasserstein Estimator Wasserstein estimator ν P M OT p µ, ν q min First layer of regularization ν P M OT ε p µ, ν q min 7

  14. Regularized Wasserstein Estimator Wasserstein estimator ν P M OT p µ, ν q min First layer of regularization ν P M OT ε p µ, ν q min Second layer of regularization min ν P M OT ε p µ, ν q` η KL p ν, β q 7

  15. First layer: Gaussian deconvolution This is a recent interpretation (Rigollet, Weed 2018). Let X i be iid random variables following ν ˚ , Z i „ ϕ ε “ N p 0 , ε Id q an iid gaussian noise and Y i “ X i ` Z i the perturbed observation with distribution µ . X i ` Z i X i „ ν ˚ Y i „ ϕ ε ˚ ν ˚ Ñ 8

  16. First layer: Gaussian deconvolution For c p x , y q “ } x ´ y } 2 , the MLE for ν ˚ is ÿ ν : “ arg max ˆ log p ϕ ε ˚ ν qp X i q ô ˆ ν “ arg min ν P M OT ε p µ, ν q . ν P M i X i ` Z i X i „ ν ˚ Y i „ ϕ ε ˚ ν ˚ Ð 8

  17. First layer: adds entropy to the transport matrix Figure 1: Small regularization ε “ 0 . 01 Figure 2: Big regularization ε “ 0 . 1 9

  18. Second Layer: Interpolation with likelihood estimators Wasserstein Estimator Maximum Likelihood Estimator min ν P M OT p µ, ν q ν P M KL p ν, β q min Regularized Wasserstein Estimator min ν P M OT ε p µ, ν q` η KL p ν, β q 10

  19. Second Layer: adds entropy to the target measure Figure 3: Small regularization η “ 0 . 02 Figure 4: Big regularization η “ 0 . 2 11

  20. Dual Formulation of the problem min ν P M OT ε p µ, ν q` η KL p ν, β q with OT ε p µ, ν q “ π P Π p µ,ν q E p X , Y q„ π r c p X , Y qs ` ε KL p π, µ b ν q min is min π P Π p µ,ν q E p X , Y q„ π r c p X , Y qs ` ε KL p π, µ b ν q` η KL p ν, β q . min ν P M We consider the dual of the second min. 12

  21. Dual Formulation The dual problem can be written as a saddle point problem, where the min and the max can be swapped. The final formulation is of the form p a , b qP R I ˆ R J F p a , b q . max 13

  22. Properties of the function F in the discrete case 1. F is λ -strongly convex on the hyperplane E “ t ř i µ i a i “ ř j β j b j u . 2. There exists a solution of p a , b qP R I ˆ R J F p a , b q , which is in E , and it is unique. max 3. The gradients of F can be written as expectations ∇ a F “ E rp 1 ´ D i , j q e i s , ∇ b F “ E rp f j ´ D i , j q e j s . ´ ¯ a i ` b j ´ C i , j and f j “ ν j p b q with D i , j p a , b q “ exp β j . ε 14

  23. Stochastic Gradient Descent We have stochastic gradients for F G a “ p 1 ´ D i , j q e i G b “ p f j ´ D i , j q e j . SGD algorithm: • Sample i P t 1 , . . . , I u with probability µ i , • Sample j P t 1 , . . . , J u with probability β j , • Compute G a and G b • a Ð a ` γ t G a , • b Ð b ` γ t G b . 15

  24. Stochastic Gradient Descent We only have to compute a and b one coefficient at a time • Sample i P t 1 , . . . , I u with probability µ i , • Sample j P t 1 , . . . , J u with probability β j , • Compute f j and D i , j • a i Ð a i ` γ t p 1 ´ D i , j q , • b j Ð b j ` γ t p f j ´ D i , j q . 16

  25. The sum memorization trick ´ ¯ a i ` b j ´ C i , j and f j “ ν j p b q The computation of D i , j p a , b q “ exp is O p 1 q . ε β j However β j e ´ b j {p η ´ ε q ν ˚ j “ k β k e ´ b k {p η ´ ε q , ř but we can do it in O p 1 q if we update S p t q “ β k e ´ b p t q ÿ k {p η ´ ε q , k with S p t ` 1 q “ S p t q ` β j e ´ b p t ` 1 q {p η ´ ε q ´ β j e ´ b p t q {p η ´ ε q . j j 17

  26. Convergence Bounds 1 With stepsize γ t “ λ t , the estimator verifies C 1 1 ` log t E r KL p ν ˚ , ν t qs ď . p η ´ ε q λ 2 t With stepsize γ t “ C 2 ? t , the estimator verifies the following bound: C 3 2 ` log t E r KL p ν ˚ , ν t qs ď ? t . p η ´ ε q λ 18

  27. Simulations Figure 5: Convergence of the gradient norm for different dimensions. 19

  28. Using for Wasserstein Barycenters Wasserstein barycenter K ÿ θ k OT p µ k , ν q . min ν k “ 1 Doubly regularized Wasserstein barycenter K ÿ θ k OT ε p µ k , ν q ` η KL p ν, β q . min ν k “ 1 20

  29. Conclusion Takeaways: • Wasserstein estimators are ”projections” according to Wasserstein distances, • Two layers of entropic regularization are used here, • It is then possible to compute stochastic gradients in O p 1 q for this problem, • The results are also valid for Wasserstein barycenters. Thank you for your attention! 21

Recommend


More recommend