Distances Entropic Regularization Sinkhorn Divergences Conclusion Bridging the gap between Optimal Transport and MMD with Sinkhorn Divergences Aude Genevay MIT CSAIL CIRM Workshop - March 2020 Joint work with Gabriel Peyré, Marco Cuturi, Francis Bach, Lénaïc Chizat 1/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Comparing Probability Measures continuous 훼 훽 semi-discrete 훼 훽 Discrete 2/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Discrete Setting (Quantization) � k � n ( x 1 ,..., x k ) D ( 1 i = 1 δ x i , 1 Figure 1 – min i = 1 δ y j ) k n 3/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Discrete Setting (Quantization) � k � n ( x 1 ,..., x k ) D ( 1 i = 1 δ x i , 1 Figure 1 – min i = 1 δ y j ) k n 3/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Discrete Setting (Quantization) � k � n ( x 1 ,..., x k ) D ( 1 i = 1 δ x i , 1 Figure 1 – min i = 1 δ y j ) k n 3/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Discrete Setting (Quantization) � k � n ( x 1 ,..., x k ) D ( 1 i = 1 δ x i , 1 Figure 1 – min i = 1 δ y j ) k n 3/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Semi-discrete Setting (Density Fitting) 훽 Figure 2 – min θ D ( α θ , β ) 4/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Semi-discrete Setting (Density Fitting) 훼 휽 훽 Figure 2 – min θ D ( α θ , β ) 4/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Semi-discrete Setting (Density Fitting) 훼 휽 훽 Figure 2 – min θ D ( α θ , β ) 4/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Semi-discrete Setting (Density Fitting) 훼 휽 * 훽 Figure 2 – min θ D ( α θ , β ) 4/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion 1 Notions of Distance between Measures 2 Entropic Regularization of Optimal Transport 3 Sinkhorn Divergences : Interpolation between OT and MMD 4 Conclusion 5/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion ϕ -divergences (Czisar ’63) Definition ( ϕ -divergence) Let ϕ convex l.s.c. function such that ϕ ( 1 ) = 0, the ϕ -divergence D ϕ between two measures α and β is defined by : � � d α ( x ) � def. D ϕ ( α | β ) = ϕ d β ( x ) . d β ( x ) X Example (Kullback Leibler Divergence) � d α � � D KL ( α | β ) = ↔ log d β ( x ) d α ( x ) ϕ ( x ) = x log( x ) X 6/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Weak Convergence of measures Example On R , α = δ 0 and α n = δ 1 / n : D KL ( α n | α ) = + ∞ . 0 1 n = 1 7/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Weak Convergence of measures Example On R , α = δ 0 and α n = δ 1 / n : D KL ( α n | α ) = + ∞ . 0 1 n = 2 7/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Weak Convergence of measures Example On R , α = δ 0 and α n = δ 1 / n : D KL ( α n | α ) = + ∞ . 0 1 n = 3 7/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Weak Convergence of measures Example On R , α = δ 0 and α n = δ 1 / n : D KL ( α n | α ) = + ∞ . 0 1 n = 4 7/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Weak Convergence of measures Example On R , α = δ 0 and α n = δ 1 / n : D KL ( α n | α ) = + ∞ . 0 1 n = 5 7/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Weak Convergence of measures Example On R , α = δ 0 and α n = δ 1 / n : D KL ( α n | α ) = + ∞ . 0 1 n = 6 7/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Weak Convergence of measures Example On R , α = δ 0 and α n = δ 1 / n : D KL ( α n | α ) = + ∞ . 0 1 n = 7 7/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Weak Convergence of measures Example On R , α = δ 0 and α n = δ 1 / n : D KL ( α n | α ) = + ∞ . 0 1 n = 8 7/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Weak Convergence of measures Example On R , α = δ 0 and α n = δ 1 / n : D KL ( α n | α ) = + ∞ . 0 1 n = 9 7/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Weak Convergence of measures Example On R , α = δ 0 and α n = δ 1 / n : D KL ( α n | α ) = + ∞ . 0 1 n = 10 Definition (Weak Convergence) α n weakly converges to α , ( denoted α n ⇀ α ) � � ⇔ f ( x ) d α n ( x ) → f ( x ) d α ( x ) ∀ f ∈ C b ( X ) . Let D distance between measures , D metrises weak � � D ( α n , α ) → 0 ⇔ α n ⇀ α convergence IFF . 7/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Maximum Mean Discrepancies (Gretton ’06) Definition (RKHS) Let H a Hilbert space with kernel k , then H is a Reproducing Kernel Hilbert Space (RKHS) IFF : 1 ∀ x ∈ X , k ( x , · ) ∈ H , 2 ∀ f ∈ H , f ( x ) = � f , k ( x , · ) � H . Let H a RKHS avec kernel k , the distance MMD between two probability measures α and β is defined by : � 2 � def. MMD 2 | E α ( f ( X )) − E β ( f ( Y )) | k ( α, β ) = sup { f || | f | | H � 1 } E α ⊗ α [ k ( X , X ′ )] + E β ⊗ β [ k ( Y , Y ′ )] = − 2 E α ⊗ β [ k ( X , Y )] . 8/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Optimal Transport (Monge 1781, Kantorovitch ’42) • c ( x , y ) : cost of moving a unit of mass from x to y • π ( x , y ) (coupling) : how much mass moves from x to y 9/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion The Wasserstein Distance Minimal cost of moving all the mass from α to β ? Let α ∈ M 1 + ( X ) and β ∈ M 1 + ( Y ) , � W c ( α, β ) = min c ( x , y ) d π ( x , y ) ( P ) π ∈ Π( α,β ) X×Y 2 , W c ( α, β ) 1 / p is the p-Wasserstein | p For c ( x , y ) = | | x − y | distance . 10/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Optimal Transport vs. MMD MMD OT O ( n − 1 / d ) sample complexity ( 1 √ n ) (curse of dimension) O ( n 3 log( n )) O ( n 2 ) computation 11/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Simple example n n ( x 1 ,..., x n ) D ( 1 δ x i , 1 � � min δ y j ) n n i = 1 i = 1 12/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Discrete gradient flow of MMD 13/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Discrete gradient flow of OT 14/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Another example n n ( x 1 ,..., x n ) D ( 1 δ x i , 1 � � min δ y j ) n n i = 1 i = 1 15/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Discrete gradient flow of MMD 16/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Discrete gradient flow of OT 17/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion Optimal Transport vs. MMD MMD OT O ( n − 1 / d ) sample complexity ( 1 √ n ) (curse of dimension) O ( n 3 log( n )) O ( n 2 ) computation better gradients ! � k � n ( x 1 ,..., x k ) D ( 1 i = 1 δ x i , 1 min i = 1 δ y j ) after 200 steps of grad. descent. k n 18/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion 1 Notions of Distance between Measures 2 Entropic Regularization of Optimal Transport The basics A magic regularizing tool ! Sample Complexity 3 Sinkhorn Divergences : Interpolation between OT and MMD 4 Conclusion 19/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion The basics Entropic Regularization (Cuturi ’13) Let α ∈ M 1 + ( X ) and β ∈ M 1 + ( Y ) , � def. W c ( α, β ) = min c ( x , y ) d π ( x , y ) ( P ) π ∈ Π( α,β ) X×Y 20/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion The basics Entropic Regularization (Cuturi ’13) Let α ∈ M 1 + ( X ) and β ∈ M 1 + ( Y ) , � def. W c ,ε ( α, β ) = min c ( x , y ) d π ( x , y ) + ε H ( π | α ⊗ β ) , ( P ε ) π ∈ Π( α,β ) X×Y where � � d π ( x , y ) � def. H ( π | α ⊗ β ) = log d π ( x , y ) . d α ( x ) d β ( y ) X×Y relative entropy of the transport plan π with respect to the product measure α ⊗ β . 20/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion The basics Entropic Regularization Figure 3 – Influence of the regularization parameter ε on the transport plan π . Intuition : the entropic penalty ‘smoothes’ the problem and avoids over fitting (think of ridge regression for least squares) 21/46
Distances Entropic Regularization Sinkhorn Divergences Conclusion The basics Dual Formulation Contrary to standard OT, no constraint on the dual problem : � � ( D ) W c ( α, β ) = max u ( x ) d α ( x ) + v ( y ) d β ( y ) u ∈C ( X ) X Y v ∈C ( Y ) such that { u ( x ) + v ( y ) � c ( x , y ) ∀ ( x , y ) ∈ X × Y} 22/46
Recommend
More recommend