Online Sinkhorn: Optimal Transport distances from sample streams Arthur Mensch Joint work with Gabriel Peyr´ e ´ Ecole Normale Sup´ erieure D´ epartement de Math´ ematiques et Applications Paris, France CIRM, 3/12/2020
Optimal transport for machine learning Density fitting 1 / 29
Optimal transport for machine learning Density fitting Distance between points 1 / 29
Optimal transport for machine learning Density fitting Distance between points Distance between distributions : α ∈ P ( X ) , β ∈ P ( X ) Dependency on the cost C : X × X → R W ( α, β, C ) 1 / 29
The trouble with optimal transport Réseau de neurone Échantillon Figure: StyleGAN2 In ML, at least one distribution is not discrete α � = 1 � δ x i n Algorithms for OT works with discrete distributions 2 / 29
The trouble with optimal transport Réseau de neurone Échantillon Figure: StyleGAN2 In ML, at least one distribution is not discrete α � = 1 � δ x i n Algorithms for OT works with discrete distributions Need for consistent estimators of W ( α, β ) Using streams of samples ( x t ) t , ( y t ) t from α and β 2 / 29
The trouble with optimal transport Réseau de neurone Échantillon Figure: StyleGAN2 In ML, at least one distribution is not discrete α � = 1 � δ x i n Algorithms for OT works with discrete distributions Need for consistent estimators of W ( α, β ) and its backward operator Using streams of samples ( x t ) t , ( y t ) t from α and β 2 / 29
Outline Tractable algorithms for optimal transport 1 Online Sinkhorn 2 3 / 29
Wasserstein distance (Kantorovich, 1942) n n � � α = a i δ x i β = b j δ y i C = ( C ( x i , y j ) i,j i =1 i =1 α ∈ P ( X ) : positions x = ( x i ) i ∈ X , weights a = ( a i ) i ∈ △ n 4 / 29
Wasserstein distance (Kantorovich, 1942) n n � � α = a i δ x i β = b j δ y i C = ( C ( x i , y j ) i,j i =1 i =1 α ∈ P ( X ) : positions x = ( x i ) i ∈ X , weights a = ( a i ) i ∈ △ n P ∈ △ n × m , P 1 = a , P ⊤ 1 = b 4 / 29
Wasserstein distance (Kantorovich, 1942) n n � � α = a i δ x i β = b j δ y i C = ( C ( x i , y j ) i,j i =1 i =1 α ∈ P ( X ) : positions x = ( x i ) i ∈ X , weights a = ( a i ) i ∈ △ n P ∈ △ n × m , P 1 = a , P ⊤ 1 = b � Cost: P i,j C i,j = � P , C � i,j 4 / 29
Wasserstein distance (Kantorovich, 1942) W C ( α, β ) = min � P , C � P ∈△ n × m P 1= a , P ⊤ 1= b 1 M. Cuturi. “Sinkhorn Distances: Lightspeed Computation of Optimal Transport”. In: Advances in Neural Information Processing Systems . 2013. 5 / 29
Wasserstein distance (Kantorovich, 1942) W C ( α, β ) = min � P , C � P ∈△ n × m P 1= a , P ⊤ 1= b Entropic regularization 1 W ( α, β ) = W 1 C ( α, β ) = min � P , C � + KL ( P | a ⊗ b ) P ∈△ n × m P 1= a , P ⊤ 1= b 1 M. Cuturi. “Sinkhorn Distances: Lightspeed Computation of Optimal Transport”. In: Advances in Neural Information Processing Systems . 2013. 5 / 29
Working with continuous objects C : X × X → R functions α, β ∈ P ( X ) distributions π ∈ P ( X × X ) with marginals � � y d π ( · , y ) = d α ( · ) x d π ( x, · ) = d β ( · ) 6 / 29
Working with continuous objects C : X × X → R functions α, β ∈ P ( X ) distributions π ∈ P ( X × X ) with marginals � � y d π ( · , y ) = d α ( · ) x d π ( x, · ) = d β ( · ) W ( α, β ) � π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min d π � � = x,y C ( x, y )d π ( x, y ) + x,y log d α d β ( x, y )d π ( x, y ) 6 / 29
Working with continuous objects C : X × X → R functions α, β ∈ P ( X ) distributions π ∈ P ( X × X ) with marginals � � y d π ( · , y ) = d α ( · ) x d π ( x, · ) = d β ( · ) W ( α, β ) � π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min d π � � = x,y C ( x, y )d π ( x, y ) + x,y log d α d β ( x, y )d π ( x, y ) Discrete case: π = � i,j P i,j δ x i ,y j 6 / 29
Computing W using matrix scaling W ( α, β ) = π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min Fenchel-Rockafeller 2 dual W ( α, β ) = f,g ∈C ( X ) � α, f � + � β, g �−� α ⊗ β, exp( f ⊕ g − C ) � +1 max 2 R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In: Proceedings of Symposia in Pure Mathematics . Vol. 18.1. 1970, pp. 241–250. 3 Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29
Computing W using matrix scaling W ( α, β ) = π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min Fenchel-Rockafeller 2 dual W ( α, β ) = f,g ∈C ( X ) � α, f � + � β, g �−� α ⊗ β, exp( f ⊕ g − C ) � +1 max Alternated maximisation : Sinkhorn-Knopp algorithm 3 � f t ( · ) = T β ( g t − 1 )( · ) = − log y ∈X exp( g t − 1 ( y ) − C ( · , y ))d β ( y ) � g t ( · ) = T β ( f t )( · ) = − log x ∈X exp( f t ( x ) − C ( x, · ))d α ( x ) 2 R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In: Proceedings of Symposia in Pure Mathematics . Vol. 18.1. 1970, pp. 241–250. 3 Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29
Computing W using matrix scaling W ( α, β ) = π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min Fenchel-Rockafeller 2 dual W ( α, β ) = f,g ∈C ( X ) � α, f � + � β, g �−� α ⊗ β, exp( f ⊕ g − C ) � +1 max Alternated maximisation : Sinkhorn-Knopp algorithm 3 � f ⋆ ( · ) = T β ( g ⋆ )( · ) � − log y ∈X exp( g ⋆ ( y ) − C ( · , y ))d β ( y ) � g ⋆ ( · ) = T α ( f ⋆ )( · ) � − log x ∈X exp( f ⋆ ( x ) − C ( x, · ))d α ( x ) 2 R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In: Proceedings of Symposia in Pure Mathematics . Vol. 18.1. 1970, pp. 241–250. 3 Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29
Computing W using matrix scaling W ( α, β ) = π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min Fenchel-Rockafeller 2 dual (non strongly convex) W ( α, β ) = f,g ∈C ( X ) � α, f � + � β, g �−� α ⊗ β, exp( f ⊕ g − C ) � +1 max Alternated maximisation : Sinkhorn-Knopp algorithm 3 � f ⋆ ( · ) = T β ( g ⋆ )( · ) � − log y ∈X exp( g ⋆ ( y ) − C ( · , y ))d β ( y ) � g ⋆ ( · ) = T α ( f ⋆ )( · ) � − log x ∈X exp( f ⋆ ( x ) − C ( x, · ))d α ( x ) 2 R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In: Proceedings of Symposia in Pure Mathematics . Vol. 18.1. 1970, pp. 241–250. 3 Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29
Implementing Sinkhorn algorithm Discrete distributions n n � � α = β = C = ( C ( x i , y j )) i,j a i δ x i b j δ y i i =1 i =1 Repeat until convergence � f t ( x i ) = − log b j exp( g t − 1 ( y j ) − C ( x i , y j )) j � g t ( y j ) = − log a i exp( f t ( x i ) − C ( x i , y j )) i 8 / 29
Implementing Sinkhorn algorithm Discrete distributions n n � � α = β = C = ( C ( x i , y j )) i,j a i δ x i b j δ y i i =1 i =1 Repeat until convergence � f t ( x i ) = − log b j exp( g t − 1 ( y j ) − C ( x i , y j )) j � g t ( y j ) = − log a i exp( f t ( x i ) − C ( x i , y j )) i Finite representation of potentials / transportation plan 8 / 29
Distances between continuous distributions Classic approach α, ˆ α, ˆ α, β − Sampling ˆ − − − − → β − Cost C = ( C ( x i , y j )) i,j − − → Sinkhorn W (ˆ − − − → β ) b b α = 1 β = 1 ˆ � � ˆ δ x i δ y i b b i =1 i =1 9 / 29
Distances between continuous distributions Classic approach α, ˆ α, ˆ α, β − Sampling ˆ − − − − → β − Cost C = ( C ( x i , y j )) i,j − − → Sinkhorn W (ˆ − − − → β ) b b α = 1 β = 1 ˆ � � ˆ δ x i δ y i b b i =1 i =1 α, ˆ Sampling once , approximation W (ˆ β ) ≈ W ( α, β ) 9 / 29
Distances between continuous distributions Classic approach α, ˆ α, ˆ α, β − Sampling ˆ − − − − → β − Cost C = ( C ( x i , y j )) i,j − − → Sinkhorn W (ˆ − − − → β ) n t +1 n t +1 α t = 1 β t = 1 ˆ � � ˆ δ x i δ y i b t b t i = n t i = n t α, ˆ Sampling once , approximation W (ˆ β ) ≈ W ( α, β ) Our approach α t , ˆ α, β − Repeated sampling (ˆ − − − − − − − − − → β t ) t − Cost + transform ( f t , g t ) t − − − − − − − − → 9 / 29
Recommend
More recommend