A Review of Regularized Optimal Transport Marco Cuturi Joint work with many people, including: G. Peyré, A. Genevay (ENS) , A. Doucet (Oxford) J. Solomon (MIT) , J.D. Benamou, N. Bonneel, F. Bach, L. Nenna (INRIA), G. Carlier ( Dauphine ).
What is Optimal Transport? A geometric toolbox to compare probability measures supported on a metric space. Monge Kantorovich Dantzig Wasserstein Brenier Otto McCann Villani 2
What is Optimal Transport? A geometric toolbox to compare probability measures supported on a metric space. h 1 p θ p θ 0 h 2 d Bags Brain Activation Maps Statistical Models of features ν Empirical µ Measures Color Histograms 3
What is Optimal Transport? A geometric toolbox to compare probability measures supported on a metric space. p θ p θ 0 h 2 d Bags Brain Activation Maps Statistical Models of features ν Empirical µ Measures Color Histograms 4
What is Optimal Transport? A geometric toolbox to compare probability measures supported on a metric space. p θ 0 P ( Ω ) p θ 5
What is Optimal Transport? A geometric toolbox to compare probability measures supported on a metric space. W ( p θ , p θ 0 ) p θ 0 P ( Ω ) Wasserstein Distance p θ 5
What is Optimal Transport? A geometric toolbox to compare probability measures supported on a metric space. p θ 0 P ( Ω ) [McCann’95] Interpolant p θ 5
What is Optimal Transport? A geometric toolbox to compare probability measures supported on a metric space. p θ 0 P ( Ω ) p θ p θ 00 6
What is Optimal Transport? A geometric toolbox to compare probability measures supported on a metric space. p θ 0 Wasserstein Barycenter P ( Ω ) [Agueh’11] p θ p θ 00 6
OT and data-analysis • Key developments in (applied) maths ~’90s [McCann’95] , [JKO’98], [Benamou’98], [Gangbo’98], [Ambrosio’06], [Villani’03/’09]. � • Key developments in TCS / graphics since ’00s [Rubner’98], [Indyk’03], [Naor’07], [Andoni’15]. � ๏ Small to no-impact in large-scale data analysis: ✦ computationally heavy; ✦ Wasserstein distance is not differentiable 7
OT and data-analysis • Key developments in Today’s talk: Entropy Regularized OT [McCann’95] • Very fast compared to usual approaches, [Ambrosio’06], [Villani’03/’09]. GPGPU parallel. � • Differentiable , important if we want to use • Key developments in OT distances as loss functions . [Rubner’98], • Can be automatically differentiated , simple � ๏ Small to iterative process, DL -toolboxes compatible. • OT can become a building block in ML. ✦ computationally heavy; ✦ Wasserstein distance is not differentiable 7
Background: OT Geometry Consider ( Ω , D ), a metric probability space. Let µ , ν be probability measures in P ( Ω ). • [Monge’81] problem: find a map T : Ω → Ω Z inf D ( x, T ( x )) µ ( dx ) T # µ = ν Ω T ( x ) x 8
Background: OT Geometry Consider ( Ω , D ), a metric probability space. Let µ , ν be probability measures in P ( Ω ). • [Monge’81] problem: find a map T : Ω → Ω Z inf D ( x, T ( x )) µ ( dx ) T # µ = ν Ω δ x 8
[Kantorovich’42] Relaxation • Instead of maps , consider T : Ω → Ω P ∈ P ( Ω × Ω ) probabilistic maps, i.e. couplings : def Π ( µ , ν ) = { P ∈ P ( Ω × Ω ) | ∀ A , B ⊂ Ω , P ( A × Ω ) = µ ( A ) , P ( Ω × B ) = ν ( B ) } 9
[Kantorovich’42] Relaxation def Π ( µ , ν ) = { P ∈ P ( Ω × Ω ) | ∀ A , B ⊂ Ω , P ( A × Ω ) = µ ( A ) , P ( Ω × B ) = ν ( B ) } { } { } { } { µ ( x ) ν ( y ) 0 . 6 0 . 4 P 0 . 2 0 4 − 1 P ( x, y ) 0 3 0 . 3 1 2 0 . 2 2 1 0 . 1 y x 3 0 0 4 − 1 10
[Kantorovich’42] Relaxation def Π ( µ , ν ) = { P ∈ P ( Ω × Ω ) | ∀ A , B ⊂ Ω , P ( A × Ω ) = µ ( A ) , P ( Ω × B ) = ν ( B ) } { } { } { } { µ ( x ) µ ( x ) ν ( y ) ν ( y ) 0 . 6 0 . 6 0 . 4 0 . 4 P P 0 . 2 0 . 2 0 0 4 4 − 1 − 1 P ( x, y ) P ( x, y ) 0 0 3 3 0 . 15 0 . 3 0 . 3 1 1 2 2 0 . 1 0 . 2 0 . 2 2 2 1 1 5 · 10 0 . 1 0 . 1 y y x x 3 3 0 0 0 0 0 4 − 1 4 − 1 10
Couplings { } { } { } { µ ( x ) ν ( y ) 0 . 6 0 . 4 P 0 . 2 0 4 − 1 P ( x, y ) 0 3 0 . 3 1 2 0 . 2 2 1 0 . 1 y x 3 0 0 4 − 1 11
Couplings µ ( x ) ν ( y ) 0 . 6 0 . 4 P 0 . 2 0 4 − 1 P ( x, y ) 0 3 0 . 15 0 . 3 1 2 0 . 1 0 . 2 2 1 5 · 10 0 . 1 y x 3 0 0 0 4 − 1 12
Wasserstein Distance Def. For p ≥ 1, the p -Wasserstein distance between µ , ν in P ( Ω ) is ◆ 1 /p ✓ def P ∈ Π ( µ , ν ) E P [ D ( X, Y ) p ] W p ( µ , ν ) = inf . 13
Wasserstein between 2 Diracs δ x ( Ω , D ) δ y p ( δ x , δ y ) = D ( x , y ) W p 14
Wasserstein on Uniform Measures n 1 X n δ x i µ = i =1 ( Ω , D ) n 1 X ν = n δ y j j =1 15
Wasserstein on Uniform Measures n 1 X n δ x i µ = i =1 ( Ω , D ) n 1 n C ( σ ) = 1 X ν = n δ y j X D ( x i , y σ i ) p n j =1 i =1 15
Optimal Assignment ⊂ Wasserstein n 1 X n δ x i µ = i =1 ( Ω , D ) n 1 X ν = n δ y j W p p ( µ , ν ) = min σ ∈ S n C ( σ ) j =1 16
Wasserstein on Empirical Measures n X a i δ x i µ = i =1 ( Ω , D ) m X ν = b j δ y j j =1 17
Wasserstein on Empirical Measures n m X X a i δ x i and ν = b j δ y j . Consider µ = i =1 j =1 def = [ D ( x i , y j ) p ] ij M XY | P 1 m = a , P T 1 n = b } def = { P ∈ R n × m U ( a , b ) + b 1 ... b m y 1 ... y m x 1 a 1 · · · · · · · · · · · · . . . . D ( x i , y j ) p P 1 m = a . . · · · · · · · · x n a n · · · · · · · · · · · · 18
Wasserstein on Empirical Measures n m X X a i δ x i and ν = b j δ y j . Consider µ = i =1 j =1 def = [ D ( x i , y j ) p ] ij M XY | P 1 m = a , P T 1 n = b } def = { P ∈ R n × m U ( a , b ) + b 1 ... b m y 1 ... y m . . . . . . . . . a 1 x 1 · · · . . . . . . P T 1 n = b . D ( x i , y j ) p . . . . · · . . . . . . . x n · · · . . . a n 18
Wasserstein on Empirical Measures n m X X a i δ x i and ν = b j δ y j . Consider µ = i =1 j =1 def = [ D ( x i , y j ) p ] ij M XY | P 1 m = a , P T 1 n = b } def = { P ∈ R n × m U ( a , b ) + Def. Optimal Transport Problem W p p ( µ , ν ) = P ∈ U ( a , b ) h P , M XY i min 18
Discrete OT Problem M XY U ( a , b ) 19
Discrete OT Problem M XY U ( a , b ) P ? 20
Discrete OT Problem M XY U ( a , b ) P ? Def. Dual OT problem α T a + β T b W p p ( µ , ν ) = max α ∈ R n , β ∈ R m α i + β j ≤ D ( x i , y j ) p 20
Discrete OT Problem network flow solver M XY used in practice. O ( n 3 log( n )) U ( a , b ) P ? Note: flow/PDE formulations [Beckman’61]/[Benamou’98] can be used for p=1/p=2 for a sparse-graph metric/Euclidean metric. 20
Discrete OT Problem M XY U ( a , b ) P ? 21
Discrete OT Problem network flow solver M XY used in practice. O ( n 3 log( n )) U ( a , b ) P ? 21
Discrete OT Problem network flow solver M XY used in practice. O ( n 3 log( n )) U ( a , b ) P ? 21
Discrete OT Problem M XY U ( a , b ) P ? 22
Discrete OT Problem network flow solver M XY used in practice. O ( n 3 log( n )) U ( a , b ) P ? 22
Discrete OT Problem network flow solver M XY used in practice. O ( n 3 log( n )) U ( a , b ) P ? 23
Discrete OT Problem network flow solver M XY used in practice. O ( n 3 log( n )) U ( a , b ) P ? P ? Solution unstable and not always unique. 23
Discrete OT Problem network flow solver M XY used in practice. O ( n 3 log( n )) U ( a , b ) P ? Solution unstable { P ? } and not always unique. 23
Discrete OT Problem network flow solver M XY used in practice. O ( n 3 log( n )) U ( a , b ) P ? Solution unstable { P ? } and not always unique. 24
Discrete OT Problem network flow solver M XY used in practice. O ( n 3 log( n )) U ( a , b ) P ? Solution unstable and not always unique. P ? 24
Discrete OT Problem network flow solver M XY used in practice. O ( n 3 log( n )) U ( a , b ) P ? Solution unstable and not always unique. P ? p ( µ , ν ) not di ff erentiable. W p 24
Entropic Regularization [Wilson’62] Def. Regularized Wasserstein, γ ≥ 0 def W γ ( µ , ν ) = P ∈ U ( a , b ) h P , M XY i � γ E ( P ) min nm def X E ( P ) = − P ij (log P ij ) i,j =1 Note: Unique optimal solution because of strong concavity of Entropy 25
Entropic Regularization [Wilson’62] Def. Regularized Wasserstein, γ ≥ 0 def W γ ( µ , ν ) = P ∈ U ( a , b ) h P , M XY i � γ E ( P ) min ν P γ µ γ Note: Unique optimal solution because of strong concavity of Entropy 25
Fast & Scalable Algorithm def Prop. If P γ = argmin h P , M XY i� γ E ( P ) P ∈ U ( a , b ) then 9 ! u 2 R n + , v 2 R m + , such that def = e − M XY / γ P γ = diag ( u ) K diag ( v ) , K 26
Recommend
More recommend