advanced section 2 optimal transport
play

Advanced Section #2: Optimal Transport AC 209B: Data Science 2 - PowerPoint PPT Presentation

Advanced Section #2: Optimal Transport AC 209B: Data Science 2 Javier Zazo Pavlos Protopapas Lecture Outline Historical overview Definitions and formulations Metric properties about optimal transport Application I: Supervised learning with


  1. Advanced Section #2: Optimal Transport AC 209B: Data Science 2 Javier Zazo Pavlos Protopapas

  2. Lecture Outline Historical overview Definitions and formulations Metric properties about optimal transport Application I: Supervised learning with Wasserstein Loss Application II: Domain adaptation 2

  3. Historical overview 3

  4. The origins of optimal transport ◮ Gaspard Monge proposed the first idea in 1781. ◮ How to move dirt from one place (d’eblais) to another (remblais) with minimal effort? ◮ Enunciated the problem of finding a mapping F between two distributions of mass. ◮ Optimization with respect to a displacement cost c ( x, y ). 4

  5. Transportation problem I ◮ Formulated by Frank Lauren Hitchcock in 1941. Factories & warehouses example ◮ Fixed number of factories, each of which produces good at a fixed output rate. ◮ Fixed number of warehouses, each of which has a fixed storage capacity. ◮ There is a cost to transport goods from a factory to a warehouse. ◮ Goal: Find the transportation of goods from factory → warehouse with lowest possible cost. 5

  6. Transportation problem II: Example Factories: Transportation costs: W 1 W 2 W 3 W 4 ◮ F 1 makes 5 units. F 1 5 4 7 6 F 2 2 5 3 5 ◮ F 2 makes 4 units. F 3 6 3 4 4 ◮ F 3 makes 6 units. 5 4 5 Warehouses: 3 1 ◮ W 1 can store 5 units. 1 4 3 ◮ W 2 can store 3 units. 5 ◮ W 3 can store 5 units. 6 4 ◮ W 4 can store 2 units. 2 2 6

  7. Transportation problem III: ◮ One factory can transport product to multiple warehouses. ◮ One warehouse can receive product from multiple factories. ◮ The Transportation problem can be formulated as an ordinary linear constrained optimization problem (LP): min 5 x 11 + 4 x 12 + 7 x 13 + 6 x 14 + 2 x 21 + 5 x 22 x ij +3 x 23 + 2 x 24 + 6 x 31 + 3 x 32 + 4 x 33 + 4 x 34 s.t. x 11 + x 12 + x 13 + x 14 = 5 x 21 + x 22 + x 23 + x 24 = 4 x 31 + x 32 + x 33 + x 34 = 6 x 11 + x 21 + x 31 ≤ 5 x 12 + x 22 + x 32 ≤ 3 x 13 + x 23 + x 33 ≤ 5 x 14 + x 24 + x 34 ≤ 2 7

  8. Definitions and formulations 8

  9. Definitions ◮ Probability simplex: � � n � � a i ∈ R n � ∆ n = a i = 1 + � i =1 ◮ Discrete probability distribution: p = ( p 1 , p 2 , . . . , p n ) ∈ ∆ n . ◮ Space X : support for the distritution (coordinates vector/array, temperature, etc.). ◮ Discrete measure: given weights p = ( p 1 , p 2 , . . . , p n ) and x = ( x 1 , x 2 , . . . , x n ) locations, � α = p i δ x i i ◮ Radon measure: α ∈ M ( X ), – X is equipped with a distance, integrating it against a continuous function f � � R d f ( x ) dα ( x ) = f ( x ) ρ α ( x ) dx X X 9

  10. More definitions � ◮ Set of positive measures: M + , such that X f ( x ) dα ( x ) → R + . � ◮ Set of probability measures: M 1 + , such that X dα ( x ) = 1. 10

  11. Assingment and Monge problems ◮ n origin elements ( factories ), ◮ m = n destination elements ( warehouses ), ◮ we look for a permutation (an assignment in the general case) of elements n 1 � min C i,σ ( i ) n σ ∈ Perm(n) i =1 ◮ The set of n discrete elements has n ! possible permutations. ◮ Works after Monge, aimed to simplify the problem, such as Hitchcock in 1941, or Kantorovich in 1942. 11

  12. Kantorovich relaxation ◮ Goal: find a minimal transport plan F such that | F1 = p and F T 1 = q } F ∈ U ( p , q ) = { F ∈ R n × n + ◮ F1 = p sum the rows of F → all goods are transported from p . ◮ F T 1 = q sum the columns of F → all goods are received in q . ◮ p and q are probability distributions → mass is conserved and equals 1. 12

  13. Relation to linear programming ◮ The Kantorovich problem is an LP: L C ( p , q ) = min tr( FC ) F ≥ 0 (1) F T 1 = q F1 = p , ◮ LP programs can be solved with simplex method , interior point methods , dual descent methods , etc. The problem is convex . ◮ One option is to use LP solvers: Clp, Gurobi, Mosek, SeDuMi, CPLEX, ECOS, etc. ◮ Spezialized methods exist (and Python, C, Julia, etc. libraries) – Network simplex – Approximate methods: Sinkhorn, smoothed versions, etc. 13

  14. Kantorovich formulation for arbitrary measures ◮ Now C needs to be a function: c ( x, y ) : X × Y → R + ◮ Discrete measures α = � i p i δ x i and β = � i q i δ y i : – c ( x, y ) is still a matrix where costs depends on locations of measures. ◮ For arbitrary probabilistic measures: – Define a coupling π ∈ M 1 + ( X , Y ) → joint probability distribution of X and Y . � � � π ∈ M 1 � U ( α, β ) = + ( X , Y ) � P X ♯ π = α and P Y ♯ π = β – The continuous problem: � � � � � L c ( α, β ) = min c ( x, y ) dπ ( x, y ) = min E ( X,Y ) ( c ( X, Y )) � X ∼ α, Y ∼ β π ∈ U ( α,β ) ( X,Y ) X×Y 14

  15. Example of transport maps for arbitrary measures 15

  16. Metric properties about optimal transport 16

  17. Metric properties of the discrete optimal transport ◮ Wasserstein distance is also referred as OT, or Earth mover’s distance (EMD). Discrete Wasserstein distance Consider p , q ∈ ∆ n and � � � � C = C T , diag( C ) = 0 and ∀ ( i, j, k ) C ∈ R n × n � C ∈ C n = C i,j ≤ C i,k + C k,j . + Then, W p ( p , q ) = L C p ( p , q ) 1 /p defines a p-Wasserstein distance on ∆ n . ◮ Recall that L C ( p , q ) refers to the discrete Kantorovich problem: � � � F T 1 = q � � F ≥ 0 , L C ( p , q ) = min tr( FC ) F1 = p , 17

  18. Proof that p-Wasserstein constitutes a distance ◮ We need to show positivity , symmetry and triangular inequality . ◮ Since diag( C ) = 0, W p ( p , p ) = 0, and F ∗ = diag( p ). ◮ Because of strict positivity of off-diagonal elements, W p ( p , q ) = tr( CF ) > 0 for p � = q . ◮ Since W p ( p , q ) = tr( CF ), and C is symmetric, W p ( p , q ) = W p ( q , p ). ◮ For triangularity, define p , q and t and F = sol( W p ( p , q )) G = sol( W p ( q , t )) . ◮ For simplicity, assume q > 0 (detailed proof in the lecture notes). Define S = F diag(1 / q ) G ∈ R n × n . + ◮ Note that F ∈ U ( p , t ), i.e., is a feasible transport plan: S1 = F diag(1 / q ) G1 = F diag( q / q ) = F1 = p ���� � �� � q 1 S T 1 = G T diag(1 / q ) F T 1 = G T diag( q / q ) = G T 1 = t ���� � �� � q 1 18

  19. Wasserstein distance for arbitrary measures Wasserstein distance for arbitrary measures Consider α ( x ) ∈ M 1 + ( X ) , β ( y ) ∈ M 1 + ( Y ), X = Y , and for some p ≥ 1, ◮ c ( x, y ) = c ( y, x ) ≥ 0; ◮ c ( x, y ) = 0 if and only if x = y ; ◮ ∀ ( x, y, z ) ∈ X 3 , c ( x, y ) ≤ c ( x, z ) + c ( z, y ) Then, W p ( α, β ) = L c p ( α, β ) 1 /p defines a p-Wasserstein distance on X . ◮ Recall, that the Kantorovich problem for arbitrary measures is given by: � L c ( α, β ) = min c ( x, y ) dπ ( x, y ) π ∈ U ( α,β ) X×Y 19

  20. Special cases I ◮ Binary cost matrix: If C = 11 T − I , then L C ( p , q ) = � p − q � 1 . ◮ 1D case of empirical measures: � � – X = R ; α = 1 i δ x i β = 1 i δ y i ; n n – x 1 ≤ x 2 , . . . ≤ x n and y 1 ≤ y 2 , . . . ≤ y n ordered observations. n � W p ( p , q ) p = | x i − y i | p i =1 ◮ Histogram equalization: 20

  21. Color transfer 21

  22. Special cases II: Distance between Gaussians ◮ If α = N ( m α , Σ α ) and β = N ( m β , Σ β ) are two gaussians in R d , ◮ The following map: T : x → m β + A ( x − m α ) where A = Σ − 1 / 2 ( Σ 1 / 2 Σ β Σ 1 / 2 ) 1 / 2 Σ − 1 / 2 constitutes an optimal transport plan. α α α α 2 ( α, β ) = � m α − m β � 2 + tr( Σ α + Σ β − 2 ( Σ 1 / 2 Σ β Σ 1 / 2 ◮ Furthermore, W 2 ) 1 / 2 ) 2 . α α 22

  23. Application I: Supervised learning with Wasserstein Loss 23

  24. Learning with Wasserstein Loss ◮ Natural metric on the outputs that can be used to improve predictions. ◮ Wasserstein distance provides a natural notion of dissimilarity for probability measures − → Can encourage smoothness on the predictions. – In ImageNet, 1000 categories may have inherent semantic relationships. – Speech recognition systems, output correspond to keywords that also have semantic relations → this correlation can be exploited. 24

  25. Semantic relationships: Flickr dataset 25

  26. Problem setup ◮ Goal: Learn a mapping X ⊂ R d → K ⊂ Y = R K + , where |K| = K . ◮ Assume K possesses a metric d K ( · , · ), or ground metric. ◮ Learning over a hypothessis space H of predictors: h θ : X → Y , param. by θ ∈ Θ. – These can be a logistic regression, output of a NN, etc. ◮ Empirical risk minimization: N E { l ( h θ ( x ) , y ) } ≈ 1 � min l ( h θ ( x i ) , y i ) N h θ ∈H i =1 26

  27. Discrete Wasserstein loss ◮ Assuming h θ outputs a probability measure (or a discrete probability distribution), and y i corresponds to the one-hot encoding of the label classes, N � W c ( α, β ) = L C ( h θ ( x i ) , y i ) i =1 where C encodes the ground metric given by c ( x, y ). ◮ In order to optimize the loss function, how do we compute gradients? – Gradients are easy to compute in the dual domain. 27

  28. Dual problem formulation 1. Construct the Lagrangian: � � L ( x, λ, ν ) = f ( x )+ λ i g i ( x )+ ν j h j ( x ) . i j 2. Dual function : the minimum of the Lagrangian over x : q ( λ, ν ) = min x L ( x, λ, ν ) . weak duality strong duality 3. Dual problem : maximization of the dual function over λ i ≥ 0: max q ( λ, ν ) λ ∈ R m ,ν R p (2) s.t. λ i ≥ 0 ∀ i. 28

Recommend


More recommend