Sinkhorn Barycenters with Free Support via Frank Wolfe algorithm - PowerPoint PPT Presentation

Sinkhorn Barycenters with Free Support via Frank Wolfe algorithm Giulia Luise 1 , Saverio Salzo 2 , Massimiliano Pontil 1 , 2 , Carlo Ciliberto 3 1 Department of Computer Science, University College London, UK 2 CSML, Istituto Italiano di Tecnologia, Genova, Italy 3 Department of Electrical and Electronic Engineering, Imperial College London, UK Sinkhorn Barycenters via Frank Wolfe 1 / 22

Outline 1. Introduction: Goal and Contributions 2. Setting and problem statement 3. Approach 4. Convergence analysis 5. Experiments Sinkhorn Barycenters via Frank Wolfe 2 / 22

Introduction: Goal and Contributions Goal and contributions We propose a novel method to compute the barycenter of a set of probability distributions with respect to the Sinkhorn divergence that: • does not fix the support beforehand • handles both discrete and continuous measures • admits convergence analysis. Sinkhorn Barycenters via Frank Wolfe 3 / 22

Introduction: Goal and Contributions Goal and contributions Our analyais hinges on the following contributions: • We show that the gradient of the Sinkhorn divergence is Lipschitz continuous on the space of probability measures with respect to the Total Variation. • We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients. • A byproduct of our analysis is the generalization of the Frank-Wolfe algorithm to settings where the objective functional is defined only on a set with empty interior, which is the case for Sinkhorn divergence barycenter problem . Sinkhorn Barycenters via Frank Wolfe 4 / 22

Setting and problem statement Setting and Notation X ⊂ R d is a compact set c : X × X → R is a symmetric cost function, e.g. c ( · , · ) = �· − ·� 2 2 M + 1 ( X ) is the space of probability measures on X . M ( X ) is the Banach space of finite signed measures on X . Sinkhorn Barycenters via Frank Wolfe 5 / 22

Setting and problem statement Entropic Regularized Optimal Transport For any α, β ∈ M + 1 ( X ) , the Optimal Transport problem with entropic regularization is defined as follow � OT ε ( α, β ) = min X 2 c ( x, y ) dπ ( x, y )+ ε KL ( π | α ⊗ β ) , ε ≥ 0 (1) π ∈ Π( α,β ) where: KL ( π | α ⊗ β ) is the Kullback-Leibler divergence between transport plan π and the product distribution α ⊗ β Π( α, β ) = { π ∈ M 1 + ( X 2 ): P 1# π = α, P 2# π = β } is the transport polytope (with P i : X × X → X the projector onto the i -th component and # the push-forward) Sinkhorn Barycenters via Frank Wolfe 6 / 22

Setting and problem statement Sinkhorn Divergences To remove the bias induced by the KL, [Genevay et al., 2018] proposed to remove the autocorrelation terms − 1 2 OT ε ( α, α ) , − 1 2 OT ε ( β, β ) from OT ε ( α, β ) in order to get a divergence S ε ( α, β ) = OT ε ( α, β ) − 1 2 OT ε ( α, α ) − 1 2 OT ε ( β, β ) , (2) which is nonnegative, convex and metrizes the weak convergence (see [Feydy et al., 2019] ). In the following we study barycenter problem with this Sinkhorn divergence. Sinkhorn Barycenters via Frank Wolfe 7 / 22

Setting and problem statement Barycenter Problem Barycenters of probabilities are useful in a range of applications, as texture mixing, Bayesian inference, imaging. The barycenter problem w.r.t. Sinkhorn divergence is formulated as follows: given β 1 , . . . β m ∈ M + 1 ( X ) input measures, and ω 1 , . . . , ω m ≥ 0 a set of weights such that � m j =1 ω j = 1 , solve m � min B ε ( α ) , with B ε ( α ) = ω j S ε ( α, β j ) . (3) α ∈M + 1 ( X ) j =1 Sinkhorn Barycenters via Frank Wolfe 8 / 22

Setting and problem statement Approach: Frank-Wolfe algorithm Classic methods to approach barycenter problem: 1 . fix the support of the barycenter beforehand and optimize the weights only (convergence analysis available) OR 2 . alternately optimize on weights and support points (no convergence guarantees) Our approach via Frank-Wolfe: − It iteratively populates the target barycenter, one point at the time; − It does not require the support to be fixed beforehand; − There is no hyperparameter tuning. Sinkhorn Barycenters via Frank Wolfe 9 / 22

Approach Frank-Wolfe Algorithm on Banach spaces W Banach space, W ∗ topological dual and D ⊂ W ∗ nonempty, convex, closed, bounded set. G : D → R convex + some smoothness properties Theorem Suppose in addition that ∇ G is L -Lipschitz continuous with L > 0 . Let ( w k ) k ∈ N be obtained according to Alg 1. Then, for every integer k ≥ 1 , 2 k + 2 L ( diam D ) 2 + ∆ k . G ( w k ) − min G ≤ (4) Sinkhorn Barycenters via Frank Wolfe 10 / 22

Approach Can Frank-Wolfe be applied? Optimization domain. M + 1 ( X ) is convex, closed, and bounded in the Banach space M ( X ) : ✔ Objective functional. The objective functional B ε is convex since it is a convex combination of S ε ( · , β j ) , with j = 1 . . . m . ✔ Lipschitz continuity of the gradient. This is the most critical condition. Sinkhorn Barycenters via Frank Wolfe 11 / 22

Approach Lipschitz continuity of Sinkhorn potentials This is one of the main contributions of the paper. Theorem The gradient ∇ S ε is Lipschitz continuous, i.e. for all α, α ′ , β, β ′ ∈ M + 1 ( X ) , � ∇ S ε ( α, β ) − ∇ S ε ( α ′ , β ′ ) � α − α ′ � � β − β ′ � � � � � ∞ � ( TV + TV ) . (5) � � � It follows that ∇ B ε is also Lipschitz continuous and hence our framework is suitable to apply FW algorithm. Sinkhorn Barycenters via Frank Wolfe 12 / 22

Approach How the algorithm works - I The inner step in FW algorithm amounts to: m � µ k +1 ∈ argmin ω j �∇ S ε [( · , β j )]( α k ) , µ � . (6) µ ∈M + 1 ( X ) j =1 Note that: • by Bauer maximum principle → solutions of (6) are achieved at the extreme points of the optimization domain; • extreme points of M + 1 ( X ) are Dirac deltas. Hence (6) is equivalent to m � � � µ k +1 = δ x k +1 with x k +1 ∈ argmin ∇ S ε [( · , β j )]( α k )( x ) ω j . x ∈X j =1 (7) Sinkhorn Barycenters via Frank Wolfe 13 / 22

Approach How the algorithm works - II Once the new support point x k +1 has been obtained, FW update corresponds to 2 2 k α k +1 = α k + k + 2( δ x k +1 − α k ) = k + 2 α k + k + 2 δ x k +1 . (8) Weights and support points are updated simultaneously at each iteration. Sinkhorn Barycenters via Frank Wolfe 14 / 22

Convergence analysis Convergence analysis-finite case Theorem Suppose that β 1 , . . . β m ∈ M + 1 ( X ) have finite support and let α k be the k -th iterate of our algorithm. Then, C ε B ε ( α k ) − min B ε ( α ) ≤ (9) k + 2 , α ∈M + 1 ( X ) where C ε is a constant depending on ε and on the domain X . What if the input measures β 1 , . . . β m ∈ M + 1 ( X ) are continuous and we only have access to samples? Sinkhorn Barycenters via Frank Wolfe 15 / 22

Convergence analysis Sample complexity of Sinkhorn Potentials FW can be applied when only an approximation of the gradient is available. Hence we need quantify the approximation error between ∇ S ε ( · , β ) and ∇ S ε ( · , ˆ β ) in terms of the sample size of ˆ β : Theorem (Sample Complexity of Sinkhorn Potentials) Suppose that c is smooth. Then, for any α, β ∈ M + 1 ( X ) and any empirical measure ˆ β of a set of n points independently sampled from β , we have, for every τ ∈ (0 , 1] β ) � ∞ ≤ C ε log 3 �∇ 1 S ε ( α, β ) − ∇ 1 S ε ( α, ˆ τ √ n (10) with probability at least 1 − τ . Sinkhorn Barycenters via Frank Wolfe 16 / 22

Convergence analysis Convergence analysis-general case Using the sample complexity of Sinkhorn gradient, we are able to characterize the convergence analysis of our algorithm in the general setting. Theorem Suppose that c is smooth. Let n ∈ N and ˆ β 1 , . . . , ˆ β m be empirical distributions with n support points, each independently sampled from β 1 , . . . , β m . Let α k be the k -th iterate of our algorithm applied to β 1 , . . . , ˆ ˆ β m . Then for any τ ∈ (0 , 1] , the following holds with probability larger than 1 − τ C ε log 3 m τ B ε ( α k ) − min B ε ( α ) ≤ min( k, √ n ) . (11) α ∈M + 1 ( X ) Sinkhorn Barycenters via Frank Wolfe 17 / 22

Experiments Barycenter of nested ellipses Barycenter of 30 randomly generated nested ellipses on a 50 × 50 grid similarly to [Cuturi and Doucet, 2014] . Each image is interpreted as a probability distribution in 2 D. Sinkhorn Barycenters via Frank Wolfe 18 / 22

Experiments Barycenters of continuous measures Barycenter of 5 Gaussian distributions with mean and covariance randomly generated. scatter plot: output of our method level sets of its density: true Wasserstein barycenter FW recovers both the mean and covariance of the target barycenter. Sinkhorn Barycenters via Frank Wolfe 19 / 22

Experiments Matching of a distribution “Barycenter” of a single measure β ∈ M 1 + ( X ) . Solution of this problem is β itself → we can interpret the intermediate iterates as compressed version of the original measure. FW prioritizes the support points with higher weight. Sinkhorn Barycenters via Frank Wolfe 20 / 22

Sinkhorn Barycenters with Free Support via Frank Wolfe algorithm - PowerPoint PPT Presentation

Sinkhorn Barycenters with Free Support via Frank Wolfe algorithm Giulia Luise 1 , Saverio Salzo 2 , Massimiliano Pontil 1 , 2 , Carlo Ciliberto 3 1 Department of Computer Science, University College London, UK 2 CSML, Istituto Italiano di

WOLFE RESIDENCE 337 Kenmore Road The Douglaston Historic District Kevin Wolfe Architect 1

Wasserstein barycenters over Riemannian manifolds Brendan Pass (joint work with Y.H. Kim (UBC))

Bridging the gap between Optimal Transport and MMD with Sinkhorn Divergences Aude Genevay MIT

Wasserstein Adversarial Examples via Projected Sinkhorn Iterations ICML 19 Eric Wong 1 Frank R.

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

Frank-Wolfe Splitting via Augmented Lagrangian Method Fabian Pedregosa 2 Simon Lacoste-Julien 1

Presentation for: Prospect Name February 17, 2015 Gregg Wolfe-Principal Doreen Guss National

Fortran Programmers Michael Wolfe PGI compiler engineer michael.wolfe@pgroup.com Outline GPU

On the Complexity of Approximating Wasserstein Barycenters Alexey Kroshnin, Darina Dvinskikh,

Decentralize and Randomize: Faster Algorithm for Wasserstein Barycenters Pavel Dvurechensky,

Online Sinkhorn: Optimal Transport distances from sample streams Arthur Mensch Joint work with

Sinkhorn Algorithm as a Special Case of Stochastic Mirror Descent Konstantin Mishchenko, KAUST

Revisiting Frank-Wolfe Projection-Free Sparse Convex Optimization Martin Jaggi Ecole

Revisiting Frank-Wolfe Projection-Free Sparse Convex Optimization Martin Jaggi Ecole

Free algebras via a functor Sam van Gool on partial algebras Free algebra step-by-step Free

Live eMate eMate repair at WWNC repair at WWNC Live Frank Gr Gr ndel ndel Frank

Intelligent Mobility Networks: Why Is it Different This Time? Hani S. Mahmassani Northwestern

Worst-Case Optimal Redistribution of VCG Payments in Multi-Unit Auctions Mingyu Guo Vincent

Conversational Agents Human-AI Interaction Luigi De Russis Academic Year 2019/2020 Background:

COMMENCING 1H FY20 January 2020 1 New business stream reporting Commencing 1H FY20 Aligning

Shortest Vector from Lattice Sieving: a Few Dimensions for Free eo Ducas 1 L Cryptology Group,

SAXS and SANS facilities and experimental practice Clement Blanchet EMBL Hamburg Small Angle

IRSPM 2012 Track 17 - Good Governance The (legal) design of hybrid organizations as

Baseline Predictors of Mortality in Pa2ents With Relapsed or