Online Sinkhorn: Optimal Transport distances from sample streams - PowerPoint PPT Presentation

Online Sinkhorn: Optimal Transport distances from sample streams Arthur Mensch Joint work with Gabriel Peyr´ e ´ Ecole Normale Sup´ erieure D´ epartement de Math´ ematiques et Applications Paris, France CIRM, 3/12/2020

Optimal transport for machine learning Density fitting 1 / 29

Optimal transport for machine learning Density fitting Distance between points 1 / 29

Optimal transport for machine learning Density fitting Distance between points Distance between distributions : α ∈ P ( X ) , β ∈ P ( X ) Dependency on the cost C : X × X → R W ( α, β, C ) 1 / 29

The trouble with optimal transport Réseau de neurone Échantillon Figure: StyleGAN2 In ML, at least one distribution is not discrete α � = 1 � δ x i n Algorithms for OT works with discrete distributions 2 / 29

The trouble with optimal transport Réseau de neurone Échantillon Figure: StyleGAN2 In ML, at least one distribution is not discrete α � = 1 � δ x i n Algorithms for OT works with discrete distributions Need for consistent estimators of W ( α, β ) Using streams of samples ( x t ) t , ( y t ) t from α and β 2 / 29

The trouble with optimal transport Réseau de neurone Échantillon Figure: StyleGAN2 In ML, at least one distribution is not discrete α � = 1 � δ x i n Algorithms for OT works with discrete distributions Need for consistent estimators of W ( α, β ) and its backward operator Using streams of samples ( x t ) t , ( y t ) t from α and β 2 / 29

Outline Tractable algorithms for optimal transport 1 Online Sinkhorn 2 3 / 29

Wasserstein distance (Kantorovich, 1942) n n � � α = a i δ x i β = b j δ y i C = ( C ( x i , y j ) i,j i =1 i =1 α ∈ P ( X ) : positions x = ( x i ) i ∈ X , weights a = ( a i ) i ∈ △ n 4 / 29

Wasserstein distance (Kantorovich, 1942) n n � � α = a i δ x i β = b j δ y i C = ( C ( x i , y j ) i,j i =1 i =1 α ∈ P ( X ) : positions x = ( x i ) i ∈ X , weights a = ( a i ) i ∈ △ n P ∈ △ n × m , P 1 = a , P ⊤ 1 = b 4 / 29

Wasserstein distance (Kantorovich, 1942) n n � � α = a i δ x i β = b j δ y i C = ( C ( x i , y j ) i,j i =1 i =1 α ∈ P ( X ) : positions x = ( x i ) i ∈ X , weights a = ( a i ) i ∈ △ n P ∈ △ n × m , P 1 = a , P ⊤ 1 = b � Cost: P i,j C i,j = � P , C � i,j 4 / 29

Wasserstein distance (Kantorovich, 1942) W C ( α, β ) = min � P , C � P ∈△ n × m P 1= a , P ⊤ 1= b 1 M. Cuturi. “Sinkhorn Distances: Lightspeed Computation of Optimal Transport”. In: Advances in Neural Information Processing Systems . 2013. 5 / 29

Wasserstein distance (Kantorovich, 1942) W C ( α, β ) = min � P , C � P ∈△ n × m P 1= a , P ⊤ 1= b Entropic regularization 1 W ( α, β ) = W 1 C ( α, β ) = min � P , C � + KL ( P | a ⊗ b ) P ∈△ n × m P 1= a , P ⊤ 1= b 1 M. Cuturi. “Sinkhorn Distances: Lightspeed Computation of Optimal Transport”. In: Advances in Neural Information Processing Systems . 2013. 5 / 29

Working with continuous objects C : X × X → R functions α, β ∈ P ( X ) distributions π ∈ P ( X × X ) with marginals � � y d π ( · , y ) = d α ( · ) x d π ( x, · ) = d β ( · ) 6 / 29

Working with continuous objects C : X × X → R functions α, β ∈ P ( X ) distributions π ∈ P ( X × X ) with marginals � � y d π ( · , y ) = d α ( · ) x d π ( x, · ) = d β ( · ) W ( α, β ) � π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min d π � � = x,y C ( x, y )d π ( x, y ) + x,y log d α d β ( x, y )d π ( x, y ) 6 / 29

Working with continuous objects C : X × X → R functions α, β ∈ P ( X ) distributions π ∈ P ( X × X ) with marginals � � y d π ( · , y ) = d α ( · ) x d π ( x, · ) = d β ( · ) W ( α, β ) � π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min d π � � = x,y C ( x, y )d π ( x, y ) + x,y log d α d β ( x, y )d π ( x, y ) Discrete case: π = � i,j P i,j δ x i ,y j 6 / 29

Computing W using matrix scaling W ( α, β ) = π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min Fenchel-Rockafeller 2 dual W ( α, β ) = f,g ∈C ( X ) � α, f � + � β, g �−� α ⊗ β, exp( f ⊕ g − C ) � +1 max 2 R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In: Proceedings of Symposia in Pure Mathematics . Vol. 18.1. 1970, pp. 241–250. 3 Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29

Computing W using matrix scaling W ( α, β ) = π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min Fenchel-Rockafeller 2 dual W ( α, β ) = f,g ∈C ( X ) � α, f � + � β, g �−� α ⊗ β, exp( f ⊕ g − C ) � +1 max Alternated maximisation : Sinkhorn-Knopp algorithm 3 � f t ( · ) = T β ( g t − 1 )( · ) = − log y ∈X exp( g t − 1 ( y ) − C ( · , y ))d β ( y ) � g t ( · ) = T β ( f t )( · ) = − log x ∈X exp( f t ( x ) − C ( x, · ))d α ( x ) 2 R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In: Proceedings of Symposia in Pure Mathematics . Vol. 18.1. 1970, pp. 241–250. 3 Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29

Computing W using matrix scaling W ( α, β ) = π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min Fenchel-Rockafeller 2 dual W ( α, β ) = f,g ∈C ( X ) � α, f � + � β, g �−� α ⊗ β, exp( f ⊕ g − C ) � +1 max Alternated maximisation : Sinkhorn-Knopp algorithm 3 � f ⋆ ( · ) = T β ( g ⋆ )( · ) � − log y ∈X exp( g ⋆ ( y ) − C ( · , y ))d β ( y ) � g ⋆ ( · ) = T α ( f ⋆ )( · ) � − log x ∈X exp( f ⋆ ( x ) − C ( x, · ))d α ( x ) 2 R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In: Proceedings of Symposia in Pure Mathematics . Vol. 18.1. 1970, pp. 241–250. 3 Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29

Computing W using matrix scaling W ( α, β ) = π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min Fenchel-Rockafeller 2 dual (non strongly convex) W ( α, β ) = f,g ∈C ( X ) � α, f � + � β, g �−� α ⊗ β, exp( f ⊕ g − C ) � +1 max Alternated maximisation : Sinkhorn-Knopp algorithm 3 � f ⋆ ( · ) = T β ( g ⋆ )( · ) � − log y ∈X exp( g ⋆ ( y ) − C ( · , y ))d β ( y ) � g ⋆ ( · ) = T α ( f ⋆ )( · ) � − log x ∈X exp( f ⋆ ( x ) − C ( x, · ))d α ( x ) 2 R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In: Proceedings of Symposia in Pure Mathematics . Vol. 18.1. 1970, pp. 241–250. 3 Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29

Implementing Sinkhorn algorithm Discrete distributions n n � � α = β = C = ( C ( x i , y j )) i,j a i δ x i b j δ y i i =1 i =1 Repeat until convergence � f t ( x i ) = − log b j exp( g t − 1 ( y j ) − C ( x i , y j )) j � g t ( y j ) = − log a i exp( f t ( x i ) − C ( x i , y j )) i 8 / 29

Implementing Sinkhorn algorithm Discrete distributions n n � � α = β = C = ( C ( x i , y j )) i,j a i δ x i b j δ y i i =1 i =1 Repeat until convergence � f t ( x i ) = − log b j exp( g t − 1 ( y j ) − C ( x i , y j )) j � g t ( y j ) = − log a i exp( f t ( x i ) − C ( x i , y j )) i Finite representation of potentials / transportation plan 8 / 29

Distances between continuous distributions Classic approach α, ˆ α, ˆ α, β − Sampling ˆ − − − − → β − Cost C = ( C ( x i , y j )) i,j − − → Sinkhorn W (ˆ − − − → β ) b b α = 1 β = 1 ˆ � � ˆ δ x i δ y i b b i =1 i =1 9 / 29

Distances between continuous distributions Classic approach α, ˆ α, ˆ α, β − Sampling ˆ − − − − → β − Cost C = ( C ( x i , y j )) i,j − − → Sinkhorn W (ˆ − − − → β ) b b α = 1 β = 1 ˆ � � ˆ δ x i δ y i b b i =1 i =1 α, ˆ Sampling once , approximation W (ˆ β ) ≈ W ( α, β ) 9 / 29

Distances between continuous distributions Classic approach α, ˆ α, ˆ α, β − Sampling ˆ − − − − → β − Cost C = ( C ( x i , y j )) i,j − − → Sinkhorn W (ˆ − − − → β ) n t +1 n t +1 α t = 1 β t = 1 ˆ � � ˆ δ x i δ y i b t b t i = n t i = n t α, ˆ Sampling once , approximation W (ˆ β ) ≈ W ( α, β ) Our approach α t , ˆ α, β − Repeated sampling (ˆ − − − − − − − − − → β t ) t − Cost + transform ( f t , g t ) t − − − − − − − − → 9 / 29

Online Sinkhorn: Optimal Transport distances from sample streams - PowerPoint PPT Presentation

Online Sinkhorn: Optimal Transport distances from sample streams Arthur Mensch Joint work with Gabriel Peyr e Ecole Normale Sup erieure D epartement de Math ematiques et Applications Paris, France CIRM, 3/12/2020 Optimal

Bridging the gap between Optimal Transport and MMD with Sinkhorn Divergences Aude Genevay MIT

Martingale Optimal Transport in Higher Hadrien De March Dimension Optimal transport

Merci pour votre attention E QUILIBRE DE N ASH & TRANSPORT OPTIMAL LJK G RENOBLE 3-4

Cosmic Flows via NIR FP Distances to an All-sky Sample of Rich Clusters John Lucey (Durham)

MK Optimal Transport and entropic relaxations Soumik Pal University of Washington, Seattle

Online Success!!! Why is it important to my Business? How do I achieve an optimal presence

Optimal regions for congested transport Giuseppe Buttazzo Dipartimento di Matematica Universit`

Optimal Transport for structured data with application on graphs Titouan Vayer Joint work with

An Optimal Transport View on Generalization Nemo Fournier January 13, 2020 An Optimal Transport

Advanced Section #2: Optimal Transport AC 209B: Data Science 2 Javier Zazo Pavlos Protopapas

Detecting and visualizing cell phenotype differences from microscopy images using transport-based

Kantorovich optimal transport problem and Shannons optimal channel problem Roman V. Belavkin

Optimal Transport for Machine Learning Aude Genevay CEREMADE (Universit Paris-Dauphine) DMA

Crowdsourced Classification with XOR Queries: An Algorithm with Optimal Sample Complexity

Sydney , Australia Haemostasis - sample collection and transport Laboratory issues

Optimal Operation of Transient Gas Transport Networks Kai Hoppmann-Baum Combinatorial

Beyond Online Balanced Descent: An Optimal Algorithm for Smoothed Online Convex Optimization

Path Planning in Unknown Environment by Optimal Transport on Graph Haomin Zhou School of

To To Get Started Paper sheet Online:

Sample mple-Opt Optimal imal Pa Para rametric metric Q-Le Learning arning Usi Using ng

On optimal protocols in stochastic thermodynamics Anomalous transport: From Billiards to

Optimal Transport Networks in Spatial Equilibrium Pablo D. Fajgelbaum Edouard Schaal UCLA/NBER,

Applications of optimal transport to machine learning and signal processing Prsentation par

GANs, Optimal Transport, and Implicit Distribution Estimation Tengyuan Liang Econometrics and

Online Sinkhorn: Optimal Transport distances from sample streams - PowerPoint PPT Presentation

Online Sinkhorn: Optimal Transport distances from sample streams Arthur Mensch Joint work with Gabriel Peyr e Ecole Normale Sup erieure D epartement de Math ematiques et Applications Paris, France CIRM, 3/12/2020 Optimal

Bridging the gap between Optimal Transport and MMD with Sinkhorn Divergences Aude Genevay MIT

Martingale Optimal Transport in Higher Hadrien De March Dimension Optimal transport

Merci pour votre attention E QUILIBRE DE N ASH &amp; TRANSPORT OPTIMAL LJK G RENOBLE 3-4

Cosmic Flows via NIR FP Distances to an All-sky Sample of Rich Clusters John Lucey (Durham)

MK Optimal Transport and entropic relaxations Soumik Pal University of Washington, Seattle

Online Success!!! Why is it important to my Business? How do I achieve an optimal presence

Optimal regions for congested transport Giuseppe Buttazzo Dipartimento di Matematica Universit`

Optimal Transport for structured data with application on graphs Titouan Vayer Joint work with

An Optimal Transport View on Generalization Nemo Fournier January 13, 2020 An Optimal Transport

Advanced Section #2: Optimal Transport AC 209B: Data Science 2 Javier Zazo Pavlos Protopapas

Detecting and visualizing cell phenotype differences from microscopy images using transport-based

Kantorovich optimal transport problem and Shannons optimal channel problem Roman V. Belavkin

Optimal Transport for Machine Learning Aude Genevay CEREMADE (Universit Paris-Dauphine) DMA

Crowdsourced Classification with XOR Queries: An Algorithm with Optimal Sample Complexity

Sydney , Australia Haemostasis - sample collection and transport Laboratory issues

Optimal Operation of Transient Gas Transport Networks Kai Hoppmann-Baum Combinatorial

Beyond Online Balanced Descent: An Optimal Algorithm for Smoothed Online Convex Optimization

Path Planning in Unknown Environment by Optimal Transport on Graph Haomin Zhou School of

To To Get Started Paper sheet Online:

Sample mple-Opt Optimal imal Pa Para rametric metric Q-Le Learning arning Usi Using ng

On optimal protocols in stochastic thermodynamics Anomalous transport: From Billiards to

Optimal Transport Networks in Spatial Equilibrium Pablo D. Fajgelbaum Edouard Schaal UCLA/NBER,

Applications of optimal transport to machine learning and signal processing Prsentation par

GANs, Optimal Transport, and Implicit Distribution Estimation Tengyuan Liang Econometrics and

Merci pour votre attention E QUILIBRE DE N ASH & TRANSPORT OPTIMAL LJK G RENOBLE 3-4