Domain adaptation with optimal transport from mapping to learning - PowerPoint PPT Presentation

Domain adaptation with optimal transport from mapping to learning with joint distribution R. Flamary - Lagrange, OCA, CNRS, Universit´ e Cˆ ote d’Azur Joint work with N. Courty, A. Habrard, A. Rakotomamonjy and B. Bushan Damodoran Data Science Meetup January 15, Nice 1 / 32

Table of content Introduction Supervised learning Domain adaptation Optimal transport Optimal transport for domain adaptation Learning strategy and mapping estimation Discussion : labels and final classifier ? Joint distribution OT for domain adaptation (JDOT) Joint distribution and classifier estimation Generalization bound Learning with JDOT : regression and classification Numerical experiments and large scale JDOT Conclusion 2 / 32

Introduction

Supervised learning Traditional supervised learning • We want to learn predictor such that y ≈ f ( x ) . • Actual P ( X, Y ) unknown. • We have access to training dataset ( x i , y i ) i =1 ,...,n ( � P ( X, Y ) ). • We choose a loss function L ( y, f ( x )) that measure the discrepancy. Empirical risk minimization We week for a predictor f minimizing � � � min L ( y, f ( x )) = L ( y j , f ( x j )) (1) E f ( x ,y ) ∼ � P j • Well known generalization results for predicting on new data. • Loss is usually L ( y, f ( x )) = ( y − f ( x )) 2 for least square regression and is L ( y, f ( x )) = max(0 , 1 − yf ( x )) 2 for squared Hinge loss SVM. 3 / 32

Domain Adaptation problem Amazon DLSR Feature extraction Feature extraction Probability Distribution Functions over the domains Our context • Classification problem with data coming from different sources (domains). • Distributions are different but related. 4 / 32

Unsupervised domain adaptation problem Amazon DLSR Feature extraction Feature extraction no labels ! + Labels not working !!!! decision function Source Domain Target Domain Problems • Labels only available in the source domain , and classification is conducted in the target domain . • Classifier trained on the source domain data performs badly in the target domain 5 / 32

Domain adaptation short state of the art Reweighting schemes [Sugiyama et al., 2008] • Distribution change between domains. • Reweigh samples to compensate this change. Subspace methods • Data is invariant in a common latent subspace. • Minimization of a divergence between the projected domains [Si et al., 2010]. • Use additional label information [Long et al., 2014]. Gradual alignment • Alignment along the geodesic between source and target subspace [R. Gopalan and Chellappa, 2014]. • Geodesic flow kernel [Gong et al., 2012]. 6 / 32

The origins of optimal transport Problem [Monge, 1781] • How to move dirt from one place (d´ eblais) to another (remblais) while minimizing the effort ? • Find a mapping T between the two distributions of mass (transport). • Optimize with respect to a displacement cost c ( x, y ) (optimal). 7 / 32

The origins of optimal transport Source T(x) s Target t x c(x,y) y x y Problem [Monge, 1781] • How to move dirt from one place (d´ eblais) to another (remblais) while minimizing the effort ? • Find a mapping T between the two distributions of mass (transport). • Optimize with respect to a displacement cost c ( x, y ) (optimal). 7 / 32

Optimal transport (Monge formulation) Distributions Quadratic cost c ( x , y ) = | x y | 2 c (20, y ) c (40, y ) c (60, y ) 0 20 40 60 80 100 0 20 40 60 80 100 x,y y • Probability measures µ s and µ t on and a cost function c : Ω s × Ω t → R + . • The Monge formulation [Monge, 1781] aim at finding a mapping T : Ω s → Ω t � inf c ( x , T ( x )) µ s ( x ) d x (2) T # µ s = µ t Ω s • Non-convex optimization problem, mapping does not exist in the general case. • [Brenier, 1991] proved existence and unicity of the Monge map for c ( x, y ) = � x − y � 2 and distributions with densities. 8 / 32

Optimal transport (Kantorovich formulation) Joint distribution ( x , y ) = s ( x ) t ( y ) y | 2 Transport cost c ( x , y ) = | x Source s ( x ) Target t ( y ) x x ( x , y ) c ( x , y ) y y • The Kantorovich formulation [Kantorovich, 1942] seeks for a probabilistic coupling γ ∈ P (Ω s × Ω t ) between Ω s and Ω t : � γ 0 = argmin c ( x , y ) γ ( x , y ) d x d y , (3) Ω s × Ω t γ � � � � s.t. γ ∈ P = γ ≥ 0 , γ ( x , y ) dy = µ s , γ ( x , y ) dx = µ t Ω t Ω s • γ is a joint probability measure with marginals µ s and µ t . • Linear Program that always have a solution. 9 / 32

Wasserstein distance Wasserstein distance � W p p ( µ s , µ t ) = min c ( x , y ) γ ( x , y ) d x d y = E ( x , y ) ∼ γ [ c ( x , y )] (4) γ ∈P Ω s × Ω t where c ( x , y ) = � x − y � p • A.K.A. Earth Mover’s Distance ( W 1 1 ) [Rubner et al., 2000]. • Do not need the distribution to have overlapping support. • Subgradients can be computed with the dual variables of the LP. • Works for continuous and discrete distributions (histograms, empirical). 10 / 32

Optimal transport for domain adaptation

Optimal transport for domain adaptation Classification on transported samples Dataset Optimal transport Class 1 Class 2 Samples Samples Samples Samples Samples Samples Classifier on Classifier on Assumptions • There exist a transport in the feature space T between the two domains. • The transport preserves the conditional distributions: P s ( y | x s ) = P t ( y | T ( x s )) . 3-step strategy [Courty et al., 2016a] 1. Estimate optimal transport between distributions. 2. Transport the training samples with barycentric mapping . 3. Learn a classifier on the transported training samples. 11 / 32

OT for domain adaptation : Step 1 Dataset Optimal transport Classification on transported samples Class 1 Class 2 Samples Samples Samples Samples Samples Samples Classifier on Classifier on Step 1 : Estimate optimal transport between distributions. • Choose the ground metric (squared euclidean in our experiments). • Using regularization allows • Large scale and regular OT with entropic regularization [Cuturi, 2013]. • Class labels in the transport with group lasso [Courty et al., 2016a]. • Efficient optimization based on Bregman projections [Benamou et al., 2015] and • Majoration minimization for non-convex group lasso. • Generalized Conditionnal gradient for general regularization (cvx. lasso, Laplacian). 12 / 32

OT for domain adaptation : Steps 2 & 3 Dataset Optimal transport Classification on transported samples Class 1 Class 2 Samples Samples Samples Samples Samples Samples Classifier on Classifier on Step 2 : Transport the training samples onto the target distribution. • The mass of each source sample is spread onto the target samples (line of γ 0 ). • Transport using barycentric mapping [Ferradans et al., 2014]. • The mapping can be estimated for out of sample prediction [Perrot et al., 2016, Seguy et al., 2017]. Step 3 : Learn a classifier on the transported training samples • Transported sample keep their labels. • Classic ML problem when samples are well transported. 13 / 32

Visual adaptation datasets Datasets • Digit recognition , MNIST VS USPS (10 classes, d=256, 2 dom.). • Face recognition , PIE Dataset (68 classes, d=1024, 4 dom.). • Object recognition , Caltech-Office dataset (10 classes, d=800/4096, 4 dom.). Numerical experiments • Comparison with state of the art on the 3 datasets. • OT works very well on digits and object recognition. • Works well on deep features adaptation and extension to semi-supervised DA. 14 / 32

Histogram matching in images Pixels as empirical distribution [Ferradans et al., 2014] 15 / 32

Histogram matching in images Image colorization [Ferradans et al., 2014] 15 / 32

Seamless copy in images Poisson image editing [P´ erez et al., 2003] • Use the color gradient from the source image. • Use color border conditions on the target image. • Solve Poisson equation to reconstruct the new image. 16 / 32

Seamless copy in images Poisson image editing [P´ erez et al., 2003] • Use the color gradient from the source image. • Use color border conditions on the target image. • Solve Poisson equation to reconstruct the new image. Seamless copy with gradient adaptation [Perrot et al., 2016] • Transport the gradient from the source to target color gradient distribution. • Solve the Poisson equation with the mapped source gradients. • Better respect of the color dynamic and limits false colors. 16 / 32

Seamless copy with gradient adaptation 17 / 32

Domain adaptation with optimal transport from mapping to learning - PowerPoint PPT Presentation

Domain adaptation with optimal transport from mapping to learning with joint distribution R. Flamary - Lagrange, OCA, CNRS, Universit e C ote dAzur Joint work with N. Courty, A. Habrard, A. Rakotomamonjy and B. Bushan Damodoran Data

Martingale Optimal Transport in Higher Hadrien De March Dimension Optimal transport

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Adaptation Philipp Koehn 27 October 2020 Philipp Koehn Machine Translation: Adaptation 27

Robust Causal Domain Adaptation in a Simple Diagnostic Setting Thijs van Ommen Ghent, July 4,

discrepancy for unsupervised domain adaptation Hongliang Yan 2017/06/21 Domain Adaptation DA

Few-shot Domain Adaptation 1/12 by Causal Mechanism Transfer Domain adaptation Causal mechanism

Coastal Adaptation Kellie Fisher FCERM Senior Advisor Why Adaptation? Adaptation to a

Domain Adaptation with Asymmetrically Relaxed Distribution Alignment Yifan Wu , Ezra Winston,

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

Adaptation Techniques for Acoustic Adaptation Techniques for Acoustic Adaptation Techniques for

An Optimal Transport View on Generalization Nemo Fournier January 13, 2020 An Optimal Transport

Strong Baselines for Neural Semi-supervised Learning under Domain Shift Sebastian Ruder Barbara

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

Web Hosting and Domain Names Introduction to Web Design Web Hosting and Domain Names

Focusing the Core Domain Model A Domain-Driven Design Case Study, Eric Evans, Domain Language

http://ecosequestrust.org/GAMA resilience.io programme Global Update Global Update Preliminary

1 8th Grade Thermal Energy Study Guide 20151009 www.njctl.org 2 Thermal Energy Study

The Diopsis Multiprocessor Tile of ShApes The Diopsis Multiprocessor Tile of ShApes Pier

XX Shan Deep Somehistory 80 S N N 2018 Approximation Theory is depth better why Shallow

1 & 2 Samuel Series Lesson #021 August 11, 2015 Dean Bible Ministries

Tracking Communities of Spammers by Evolutionary Clustering Kevin Xu 1 , Mark Kliger 2 , Alfred O.

Racetrack video Hangry is a clever portmanteau of hungry and angry, and an adjective that

Extending the MHD ePortal Feng He Supervisors: Rob Baxter, Lindsay Pottage, Gavin Pringle Summer

Domain adaptation with optimal transport from mapping to learning - PowerPoint PPT Presentation

Domain adaptation with optimal transport from mapping to learning with joint distribution R. Flamary - Lagrange, OCA, CNRS, Universit e C ote dAzur Joint work with N. Courty, A. Habrard, A. Rakotomamonjy and B. Bushan Damodoran Data

Martingale Optimal Transport in Higher Hadrien De March Dimension Optimal transport

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Adaptation Philipp Koehn 27 October 2020 Philipp Koehn Machine Translation: Adaptation 27

Robust Causal Domain Adaptation in a Simple Diagnostic Setting Thijs van Ommen Ghent, July 4,

discrepancy for unsupervised domain adaptation Hongliang Yan 2017/06/21 Domain Adaptation DA

Few-shot Domain Adaptation 1/12 by Causal Mechanism Transfer Domain adaptation Causal mechanism

Coastal Adaptation Kellie Fisher FCERM Senior Advisor Why Adaptation? Adaptation to a

Domain Adaptation with Asymmetrically Relaxed Distribution Alignment Yifan Wu , Ezra Winston,

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

Adaptation Techniques for Acoustic Adaptation Techniques for Acoustic Adaptation Techniques for

An Optimal Transport View on Generalization Nemo Fournier January 13, 2020 An Optimal Transport

Strong Baselines for Neural Semi-supervised Learning under Domain Shift Sebastian Ruder Barbara

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

Web Hosting and Domain Names Introduction to Web Design Web Hosting and Domain Names

Focusing the Core Domain Model A Domain-Driven Design Case Study, Eric Evans, Domain Language

http://ecosequestrust.org/GAMA resilience.io programme Global Update Global Update Preliminary

1 8th Grade Thermal Energy Study Guide 20151009 www.njctl.org 2 Thermal Energy Study

The Diopsis Multiprocessor Tile of ShApes The Diopsis Multiprocessor Tile of ShApes Pier

XX Shan Deep Somehistory 80 S N N 2018 Approximation Theory is depth better why Shallow

1 &amp; 2 Samuel Series Lesson #021 August 11, 2015 Dean Bible Ministries

Tracking Communities of Spammers by Evolutionary Clustering Kevin Xu 1 , Mark Kliger 2 , Alfred O.

Racetrack video Hangry is a clever portmanteau of hungry and angry, and an adjective that

Extending the MHD ePortal Feng He Supervisors: Rob Baxter, Lindsay Pottage, Gavin Pringle Summer

1 & 2 Samuel Series Lesson #021 August 11, 2015 Dean Bible Ministries