Data-driven optimal transport Esteban G. Tabak NYU - Courant Institute of Mathematical Sciences Giulio Trigila Technische Universit¨ at M¨ unchen Machine Learning Seminar, November 5 2013
The story in a picture µ (y) ρ (x) Y X π (x,y) x i y j
Optimal transport (Monge formulation) Find a plan to transport material from one location to another that minimizes the total transportation cost. In terms of the probability densities ρ ( x ) and µ ( y ) with x , y ∈ R n , � min y ( x ) M ( y ) = R n c ( x , y ( x )) ρ ( x ) dx . subject to � � ρ ( x ) dx = µ ( y ) dy y − 1 (Ω) Ω for all measurable sets Ω. If y is smooth and one to one, this is equivalent to the point-wise relation ρ ( x ) = µ ( y ( x )) J y ( x ) ,
Kantorovich formulation � π ( x , y ) K ( π ) = min c ( x , y ) π ( x , y ) dxdy subject to � � ρ ( x ) = π ( x , y ) dy , µ ( y ) = π ( x , y ) dx If c ( x , y ) is convex, min M ( y ) = max K ( π ) and π ( S ) = ρ ( { x : ( x , y ( x )) ∈ S } ) .
Dual formulation Maximize � � D ( u , v ) = u ( x ) ρ ( x ) dx + v ( x ) µ ( y ) dy over all continuous functions u and v satisfying u ( x ) + v ( y ) ≤ c ( x , y ) .
The standard example: c ( c , y ) = � y − x � 2 � � min c ( x , y ) π ( x , y ) dxdy → max x · y π ( x , y ) dxdy , with dual � � min u ( x ) ρ ( x ) dx + v ( y ) µ ( y ) dy , u ( x ) + v ( y ) ≥ x · y . y ( x · y − v ( y )) ≡ v ∗ ( x ) u ( x ) = max x ( x · y − u ( x )) ≡ u ∗ ( y ) v ( y ) = max and the optimal map for Monge’s problem is given by y ( x ) = ∇ u ( x ) , with potential u ( x ) satisfying the Monge-Ampere equation ρ ( x ) = µ ( ∇ u ( x )) det ( D 2 u )( x ) .
A data-based formulation In Kantorovich dual formulation, the objective function to maximize is a sum of two expected values: � � D ( u , v ) = u ( x ) ρ ( x ) dx + v ( x ) µ ( y ) dy . Hence, if ρ ( x ) and µ ( y ) are known only through samples, it is natural to pose the problem in terms of empirical means: Maximize m n D ( u , v ) = 1 u ( x i ) + 1 � � v ( y j ) m n i =1 j =1 over functions u and v satisfying u ( x ) + v ( y ) ≤ c ( x , y ) .
A purely discrete reduction Maximize m n D ( u , v ) = 1 u i + 1 � � v j (1) m n i =1 j =1 over vectors u and v satisfying u i + v j ≤ c ij . (2) This is dual to the uncapacitated transportation problem : Minimize � C ( π ) = c ij π ij (3) i , j subject to � = (4) π ij n j � π ij = m . (5) i
Fully constrained scenario m n D ( u , v ) = 1 u ( x i )+1 � � � � v ( y j ) = u ( x ) ρ ( x ) dx + v ( x ) µ ( y ) dy m n i =1 j =1 with m n ρ ( x ) = 1 δ ( x − x i ) , µ ( y ) = 1 � � δ ( y − y j ) , m n i =1 j =1 which brings us back to the purely discrete case.
The Lagrangian � L ( π, u , v ) = c ( x , y ) π ( x , y ) dxdy � � m � ρ ( x ) − 1 � − δ ( x − x i ) u ( x ) dx m i =1 � n µ ( y ) − 1 � v ( y ) dy , − δ ( y − y j ) n j =1 where ρ ( x ) and µ ( y ) are shortcut notations for � ρ ( x ) = π ( x , y ) dy and � µ ( y ) = π ( x , y ) dx .
In terms of the Lagrangian, the problem can be formulated as a game: d : max π ( x , y ) ≥ 0 L ( π, u , v ) , min u ( x ) , v ( y ) with dual p : min u ( x ) , v ( y ) L ( π, u , v ) . max π ( x , y ) ≥ 0 If the functions u ( x ) and v ( y ) are unrestricted, the problem p becomes � p : min c ( x , y ) π ( x , y ) dxdy π ( x , y ) ≥ 0 subject to m n ρ ( x ) = 1 δ ( x − x i ) , µ ( y ) = 1 � � δ ( y − y j ) , m n i =1 j =1 as before.
A suitable relaxation Instead, restrictict the space of functions F from where u and v can be selected. If F is invariant under dilations: u ∈ F , λ ∈ R → λ u ∈ F , then the problem p becomes � p : min c ( x , y ) π ( x , y ) dxdy π ( x , y ) ≥ 0 � � m � ρ ( x ) − 1 � δ ( x − x i ) u ( x ) dx = 0 , m i =1 � n µ ( y ) − 1 � v ( y ) dy δ ( y − y j ) = 0 n j =1 for all u ( x ) , v ( y ) ∈ F , a weak formulation of the constraints.
Some possible choices for F 1. Constants: The constraints just guaranty that ρ and µ integrate to one. 2. Linear Functions: The solution is a rigid displacement that moves the mean of { x i } into the mean of { y j } . 3. Quadratic Functions: A linear transformation mapping the empirical mean and covariance matrix of { x i } into those of { y j } . 4. Smooth functions with appropriate local bandwidth, neither to over nor under-resolve ρ ( x ) and µ ( y ).
A flow-based, primal-dual approach c = 1 2 � x − y � 2 , y ( x ) = ∇ u ( x ) . We think of the map y ( x ) as the endpoint of a time-dependent flow z ( x , t ), such that z ( x , 0) = x , and z ( x , ∞ ) = y ( x ) . z ( x , t ) follows the gradient flow � � δ ˜ D t z = −∇ z ˙ δ u u = 1 2 � x � 2 z ( x , 0) = x where � � ˜ u ∗ ( y ) µ ( y ) dy D t = u ( z ) ρ t ( z ) dz + and ρ t is the evolving probability distribution underlying the points z ( x , t ).
The variational derivative of ˜ D adopts the form δ ˜ D t � � D 2 u � � µ ( ∇ u ) δ u = ρ t − 2 � x � 2 (the potential corresponding to the which, applied at u = 1 identity map), yields the simpler expression δ ˜ D t δ u ( z ) = ρ t ( z ) − µ ( z ) , so � z = −∇ z [ ρ t ( z ) − µ ( z )] ˙ z ( x , 0) = x
The probability density ρ t satisfies the continuity equation ∂ρ t ( z ) + ∇ z · [ ρ t ( z )˙ z ] = 0 , ∂ t yielding ∂ρ t ( z ) − ∇ z · [ ρ t ( z ) ∇ z ( ρ t ( z ) − µ ( z ))] = 0 , ∂ t a nonlinear heat equation that converges to ρ t = ∞ ( z ) = µ ( z ).
In terms of samples, 1. Objetive function: u ∗ ( y ) µ ( y ) dy → 1 u ( z i )+1 � � ˜ � � u ∗ ( y j ) D t = u ( z ) ρ t ( z ) dz + m n i j 2. Test functions: General u ( z ) ∈ F → localized u ( z , β ), with β a finite-dimensional parameter and appropriate bandwidth. u = δ ˜ D t 3. Gradient descent: − ˙ δ u ( z ) = ρ t ( z ) − µ ( z ) → D t ( z ) = 1 ∇ β u ( z i ) − 1 − ˙ β = ∇ β ˜ � � ∇ β u ( y j ) m n i j
Left to discuss: 1. Form of u ( z , β ), determination of bandwidth: u ( z , β ) = 1 � � z − z k � � 2 � x � 2 + � β k F k , α k k � 1 � 1 d α k ∝ . ρ t ( z k ) 2. Enforcement of the map’s optimality.
Some applications and extensions 1. Maps: fluid flow from tracers, effects of a medical treatment, resource allocation, . . . 2. Density estimation: ρ ( x ) 3. Classification and clustering: ρ k ( x ) 4. Regression: π ( x , y ) , dim( x ) � = dim( y ), � y − y k � 2 � c ( x , y ) = w k α 2 k + � x − x k � 2 k 5. Determination of worst scenario under given marginals: π ( x 1 , x 2 , . . . , x n )
Thanks!
Recommend
More recommend