Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions What is Bregman Divergence Function? BDF D φ ( x , y ) Let φ : R d �→ R be a strictly convex, differentiable function Then, D φ : R d × R d �→ R is defined as D φ ( x , y ) = φ ( x ) − φ ( y ) − � x − y , ∇ φ ( y ) � . For any x , y ∈ R d , D φ ( x , y ) ≥ 0, the equality holds iff x = y . 6 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Some examples of BDFs φ ( x ) = x 2 , then D φ ( x , y ) = ( x − y ) 2 7 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Some examples of BDFs φ ( x ) = x 2 , then D φ ( x , y ) = ( x − y ) 2 Let p . = ( p 1 , . . . , p d ) be a probability distribution 7 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Some examples of BDFs φ ( x ) = x 2 , then D φ ( x , y ) = ( x − y ) 2 Let p . = ( p 1 , . . . , p d ) be a probability distribution j =1 p j = 1, with φ ( p ) . � d = � d j =1 p j log p j (negative Shannon entropy) is strictly convex on the d -simplex. 7 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Some examples of BDFs φ ( x ) = x 2 , then D φ ( x , y ) = ( x − y ) 2 Let p . = ( p 1 , . . . , p d ) be a probability distribution j =1 p j = 1, with φ ( p ) . � d = � d j =1 p j log p j (negative Shannon entropy) is strictly convex on the d -simplex. Let q = ( q 1 , . . . , q d ) be another probability distribution 7 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Some examples of BDFs φ ( x ) = x 2 , then D φ ( x , y ) = ( x − y ) 2 Let p . = ( p 1 , . . . , p d ) be a probability distribution j =1 p j = 1, with φ ( p ) . � d = � d j =1 p j log p j (negative Shannon entropy) is strictly convex on the d -simplex. Let q = ( q 1 , . . . , q d ) be another probability distribution d d � � D φ ( p , q ) = p j log p j − q j log q j j =1 j =1 −� p − q , ∇ φ ( q ) � d � = p j log ( p j / q j ) , j =1 is the KL-divergence between p and q 7 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Proof of sufficiency Let Y be any G -measurable random variable, and Y ∗ . = E [ X |G ]. Then E [ D φ ( X , Y )] − E [ D φ ( X , Y ∗ )] E [ φ ( Y ∗ ) − φ ( Y ) − � X − Y , ∇ φ ( Y ) � = + � X − Y ∗ , ∇ φ ( Y ∗ ) � ] . 8 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Proof of sufficiency Let Y be any G -measurable random variable, and Y ∗ . = E [ X |G ]. Then E [ D φ ( X , Y )] − E [ D φ ( X , Y ∗ )] E [ φ ( Y ∗ ) − φ ( Y ) − � X − Y , ∇ φ ( Y ) � = + � X − Y ∗ , ∇ φ ( Y ∗ ) � ] . Notice E [ � X − Y , ∇ φ ( Y ) � ] = E [ E [ � X − Y , ∇ φ ( Y ) �|G ]] = E [ � Y ∗ − Y , ∇ φ ( Y ) � ] 8 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Proof of sufficiency Let Y be any G -measurable random variable, and Y ∗ . = E [ X |G ]. Then E [ D φ ( X , Y )] − E [ D φ ( X , Y ∗ )] E [ φ ( Y ∗ ) − φ ( Y ) − � X − Y , ∇ φ ( Y ) � = + � X − Y ∗ , ∇ φ ( Y ∗ ) � ] . Notice E [ � X − Y , ∇ φ ( Y ) � ] = E [ E [ � X − Y , ∇ φ ( Y ) �|G ]] = E [ � Y ∗ − Y , ∇ φ ( Y ) � ] Thus E [ � X − Y ∗ , ∇ φ ( Y ∗ ) � ] = 0, 8 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Proof of sufficiency Let Y be any G -measurable random variable, and Y ∗ . = E [ X |G ]. Then E [ D φ ( X , Y )] − E [ D φ ( X , Y ∗ )] E [ φ ( Y ∗ ) − φ ( Y ) − � X − Y , ∇ φ ( Y ) � = + � X − Y ∗ , ∇ φ ( Y ∗ ) � ] . Notice E [ � X − Y , ∇ φ ( Y ) � ] = E [ E [ � X − Y , ∇ φ ( Y ) �|G ]] = E [ � Y ∗ − Y , ∇ φ ( Y ) � ] Thus E [ � X − Y ∗ , ∇ φ ( Y ∗ ) � ] = 0, And E [ D φ ( X , Y )] − E [ D φ ( X , Y ∗ )] = E [ D φ ( Y ∗ , Y )] ≥ 0 . 8 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions More facts about BDFs Introduced and studied in the context of projection (Csiszar (1975)) 9 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions More facts about BDFs Introduced and studied in the context of projection (Csiszar (1975)) Pythagoras theorem holds for BDF (Censor and Lent (1981)) 9 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions More facts about BDFs Introduced and studied in the context of projection (Csiszar (1975)) Pythagoras theorem holds for BDF (Censor and Lent (1981)) Bijection between family of exponential distributions and BDFs, via Legendre duality (Merugu, Banerjee, Dhillon, Ghosh (2003)) 9 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions More facts about BDFs Introduced and studied in the context of projection (Csiszar (1975)) Pythagoras theorem holds for BDF (Censor and Lent (1981)) Bijection between family of exponential distributions and BDFs, via Legendre duality (Merugu, Banerjee, Dhillon, Ghosh (2003)) Widely applied to data analysis and machine learning, such as K-means clustering 9 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions More facts about BDFs Introduced and studied in the context of projection (Csiszar (1975)) Pythagoras theorem holds for BDF (Censor and Lent (1981)) Bijection between family of exponential distributions and BDFs, via Legendre duality (Merugu, Banerjee, Dhillon, Ghosh (2003)) Widely applied to data analysis and machine learning, such as K-means clustering Well adopted in convex optimization 9 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generator Network [Goodfellow et al., 2014] Generate the samples according to P θ . 10 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generator Network [Goodfellow et al., 2014] Generate the samples according to P θ . The real samples X is inaccessible. 10 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generator Network [Goodfellow et al., 2014] Generate the samples according to P θ . The real samples X is inaccessible. Generate more compelling copies of X . 10 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generator Network [Goodfellow et al., 2014] Generate the samples according to P θ . The real samples X is inaccessible. Generate more compelling copies of X . inaccessible generate 10 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions How to Make Generator Network Better? A knowledgeable mentor (discriminator)— 11 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Discriminator Network [Goodfellow et al., 2014] Determines whether the samples are generated or not. 12 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Discriminator Network [Goodfellow et al., 2014] Determines whether the samples are generated or not. has access to the real samples X . 12 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Discriminator Network [Goodfellow et al., 2014] Determines whether the samples are generated or not. has access to the real samples X . optimizes the generator network by identifying faked samples. 12 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Discriminator Network [Goodfellow et al., 2014] Determines whether the samples are generated or not. has access to the real samples X . optimizes the generator network by identifying faked samples. pass! fail again! 12 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Graphical Model 13 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generative modeling The procedure of generative modeling is to construct a class of suitable parametric probability distributions P θ . 14 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generative modeling The procedure of generative modeling is to construct a class of suitable parametric probability distributions P θ . Generates latent variable Z ∈ Z with a fixed probability distribution P Z . 14 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generative modeling The procedure of generative modeling is to construct a class of suitable parametric probability distributions P θ . Generates latent variable Z ∈ Z with a fixed probability distribution P Z . P Z is known and simple , e.g., uniform distribution. 14 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generative modeling The procedure of generative modeling is to construct a class of suitable parametric probability distributions P θ . Generates latent variable Z ∈ Z with a fixed probability distribution P Z . P Z is known and simple , e.g., uniform distribution. Generates a sequence of parametric functions g θ : Z → X . 14 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generative modeling The procedure of generative modeling is to construct a class of suitable parametric probability distributions P θ . Generates latent variable Z ∈ Z with a fixed probability distribution P Z . P Z is known and simple , e.g., uniform distribution. Generates a sequence of parametric functions g θ : Z → X . g θ is complicated but structured . 14 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generative modeling The procedure of generative modeling is to construct a class of suitable parametric probability distributions P θ . Generates latent variable Z ∈ Z with a fixed probability distribution P Z . P Z is known and simple , e.g., uniform distribution. Generates a sequence of parametric functions g θ : Z → X . g θ is complicated but structured . g θ is the reason why the generative modeling is powerful . 14 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generative modeling The procedure of generative modeling is to construct a class of suitable parametric probability distributions P θ . Generates latent variable Z ∈ Z with a fixed probability distribution P Z . P Z is known and simple , e.g., uniform distribution. Generates a sequence of parametric functions g θ : Z → X . g θ is complicated but structured . g θ is the reason why the generative modeling is powerful . Construct P θ as the probability distribution of g θ ( Z ). More specifically, � � � P θ ( dx ) = 1 { g θ ( z )= dx } P Z ( dz ) = E Z 1 { g θ ( Z )= dx } . Z 14 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions GANs: different divergence functions GANs: 15 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions GANs: different divergence functions GANs: LSGANs [Mao et al., 2016]: Least square loss. DRAGANs [Kodali et al., 2017]: Regret minimization. CGANs [Mirza and Osindero, 2014]: Conditional extension. InfoGANs [Chen et al., 2016]: Information-theoretic extension. ACGANs [Odena et al., 2017] Structured latent space. EBGANs [Zhao et al., 2016]: New perspective of the energy. BEGANs [Berthelot et al., 2017]: Auto-encoder extension. 15 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions GANs: different divergence functions GANs: LSGANs [Mao et al., 2016]: Least square loss. DRAGANs [Kodali et al., 2017]: Regret minimization. CGANs [Mirza and Osindero, 2014]: Conditional extension. InfoGANs [Chen et al., 2016]: Information-theoretic extension. ACGANs [Odena et al., 2017] Structured latent space. EBGANs [Zhao et al., 2016]: New perspective of the energy. BEGANs [Berthelot et al., 2017]: Auto-encoder extension. GANs training: [Arjovsky and Bottou, 2017] 15 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions GANs: different divergence functions GANs: LSGANs [Mao et al., 2016]: Least square loss. DRAGANs [Kodali et al., 2017]: Regret minimization. CGANs [Mirza and Osindero, 2014]: Conditional extension. InfoGANs [Chen et al., 2016]: Information-theoretic extension. ACGANs [Odena et al., 2017] Structured latent space. EBGANs [Zhao et al., 2016]: New perspective of the energy. BEGANs [Berthelot et al., 2017]: Auto-encoder extension. GANs training: [Arjovsky and Bottou, 2017] Wasserstein GANs: 15 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions GANs: different divergence functions GANs: LSGANs [Mao et al., 2016]: Least square loss. DRAGANs [Kodali et al., 2017]: Regret minimization. CGANs [Mirza and Osindero, 2014]: Conditional extension. InfoGANs [Chen et al., 2016]: Information-theoretic extension. ACGANs [Odena et al., 2017] Structured latent space. EBGANs [Zhao et al., 2016]: New perspective of the energy. BEGANs [Berthelot et al., 2017]: Auto-encoder extension. GANs training: [Arjovsky and Bottou, 2017] Wasserstein GANs: WGANs [Arjovsky et al., 2017]: Wasserstein L 1 divergence. Improved WGANs [Gulrajani et al., 2017]: Gradient Penalty. 15 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Several Choices of Divergence The divergences to measure the difference between P and Q inlcude 16 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Several Choices of Divergence The divergences to measure the difference between P and Q inlcude Kullback-Leibler divergence: � P ( dx ) � � KL ( P , Q ) = P ( dx ) · log . Q ( dx ) X 16 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Several Choices of Divergence The divergences to measure the difference between P and Q inlcude Kullback-Leibler divergence: � P ( dx ) � � KL ( P , Q ) = P ( dx ) · log . Q ( dx ) X Jensen-Shannon (JS) divergence: JS ( P , Q ) = 1 � KL ( P , P + Q ) + KL ( Q , P + Q � ) . 2 2 2 16 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Several Choices of Divergence The divergences to measure the difference between P and Q inlcude Kullback-Leibler divergence: � P ( dx ) � � KL ( P , Q ) = P ( dx ) · log . Q ( dx ) X Jensen-Shannon (JS) divergence: JS ( P , Q ) = 1 � KL ( P , P + Q ) + KL ( Q , P + Q � ) . 2 2 2 Wasserstein divergence/distance of order p � 1 � � p m ( x , y ) p π ( dx , dy ) W p ( P , Q ) = inf , π ∈ Π( P , Q ) X×X with m a metric such as m ( x , y ) = || x − y || q for q ≥ 1. 16 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Discussions on these divergences Example: Given θ ∈ [0 , 1], assume that P and Q satisfy ∀ ( x , y ) ∈ P , x = 0 , y ∼ Uniform(0 , 1) , ∀ ( x , y ) ∈ Q , x = θ, y ∼ Uniform(0 , 1) , 17 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Discussions on these divergences Example: Given θ ∈ [0 , 1], assume that P and Q satisfy ∀ ( x , y ) ∈ P , x = 0 , y ∼ Uniform(0 , 1) , ∀ ( x , y ) ∈ Q , x = θ, y ∼ Uniform(0 , 1) , As θ � = 0, KL ( P , Q ) = KL ( Q , P ) = + ∞ , JS ( P , Q ) = log(2) , W 1 ( P , Q ) = | θ | . 17 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Discussions on these divergences Example: Given θ ∈ [0 , 1], assume that P and Q satisfy ∀ ( x , y ) ∈ P , x = 0 , y ∼ Uniform(0 , 1) , ∀ ( x , y ) ∈ Q , x = θ, y ∼ Uniform(0 , 1) , As θ � = 0, KL ( P , Q ) = KL ( Q , P ) = + ∞ , JS ( P , Q ) = log(2) , W 1 ( P , Q ) = | θ | . As θ = 0, KL ( P , Q ) = KL ( Q , P ) = JS ( P , Q ) = W 1 ( P , Q ) = 0 . 17 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Remark KL is infinite when two distributions are disjoint; 18 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Remark KL is infinite when two distributions are disjoint; JS has sudden jump, discontinuous at θ = 0; 18 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Remark KL is infinite when two distributions are disjoint; JS has sudden jump, discontinuous at θ = 0; W 1 is continuous and relatively smooth; 18 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Remark KL is infinite when two distributions are disjoint; JS has sudden jump, discontinuous at θ = 0; W 1 is continuous and relatively smooth; Wasserstein L 1 divergence outperforms KL and JS divergences but lacks the flexibility. 18 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Remedy: Relaxed Wasserstein Definition (G., Hong, Lin, and Yang 2018) The Relaxed Wasserstein divergence between the probability distributions P and Q is defined as � W D φ ( P , Q ) = inf D φ ( x , y ) π ( dx , dy ) , π ∈ Π( P , Q ) X×X where D φ is the Bregman divergence with a strictly convex and differentiable function φ : R d → R , i.e., D φ ( x , y ) = φ ( x ) − φ ( y ) − �∇ φ ( y ) , x − y � 1 W D φ ( P , Q ) ≥ 0 and = 0 iff P = Q almost everywhere. 19 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Remedy: Relaxed Wasserstein Definition (G., Hong, Lin, and Yang 2018) The Relaxed Wasserstein divergence between the probability distributions P and Q is defined as � W D φ ( P , Q ) = inf D φ ( x , y ) π ( dx , dy ) , π ∈ Π( P , Q ) X×X where D φ is the Bregman divergence with a strictly convex and differentiable function φ : R d → R , i.e., D φ ( x , y ) = φ ( x ) − φ ( y ) − �∇ φ ( y ) , x − y � 1 W D φ ( P , Q ) ≥ 0 and = 0 iff P = Q almost everywhere. 2 W D φ ( P , Q ) is a metric, as it is asymmetric. 19 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Remedy: Relaxed Wasserstein Definition (G., Hong, Lin, and Yang 2018) The Relaxed Wasserstein divergence between the probability distributions P and Q is defined as � W D φ ( P , Q ) = inf D φ ( x , y ) π ( dx , dy ) , π ∈ Π( P , Q ) X×X where D φ is the Bregman divergence with a strictly convex and differentiable function φ : R d → R , i.e., D φ ( x , y ) = φ ( x ) − φ ( y ) − �∇ φ ( y ) , x − y � 1 W D φ ( P , Q ) ≥ 0 and = 0 iff P = Q almost everywhere. 2 W D φ ( P , Q ) is a metric, as it is asymmetric. 3 W D φ ( P , Q ) includes W KL with φ ( x ) = − x ⊤ log( x ). 19 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Relaxed Wasserstein as Divergence Question: Is W φ a good divergence? 20 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Relaxed Wasserstein as Divergence Question: Is W φ a good divergence? Point 1: W φ ( P , Q ) should be small when P and Q are close. 20 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Relaxed Wasserstein as Divergence Question: Is W φ a good divergence? Point 1: W φ ( P , Q ) should be small when P and Q are close. Requirement: W φ ( P , Q ) should be dominated by standard divergence, | P ( A ) − Q ( A ) | . TV ( P , Q ) := sup A ∈B 20 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Relaxed Wasserstein as Divergence Question: Is W φ a good divergence? Point 1: W φ ( P , Q ) should be small when P and Q are close. Requirement: W φ ( P , Q ) should be dominated by standard divergence, | P ( A ) − Q ( A ) | . TV ( P , Q ) := sup A ∈B Point 2: W φ ( P n , P r ) → 0 as n → ∞ where P r is a true distribution P r and P n is the empirical distribution based on X = ( X 1 , X 2 , . . . , X n ) i . i . d . ∼ P r . 20 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Relaxed Wasserstein as Divergence Question: Is W φ a good divergence? Point 1: W φ ( P , Q ) should be small when P and Q are close. Requirement: W φ ( P , Q ) should be dominated by standard divergence, | P ( A ) − Q ( A ) | . TV ( P , Q ) := sup A ∈B Point 2: W φ ( P n , P r ) → 0 as n → ∞ where P r is a true distribution P r and P n is the empirical distribution based on X = ( X 1 , X 2 , . . . , X n ) i . i . d . ∼ P r . Requirement: W φ ( P n , P r ) should have the moment estimate and concentration inequality, i.e., there exist α, β > 0 such that O ( n − α ) � � E W D φ ( P n , P r ) = (Moment Estimate) , O ( n − β ) Prob � W D φ ( P n , P r ) ≥ ǫ � = (Concentration Inequality) . 20 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Dominated by TV and Standard Wasserstein Theorem (G., Hong, Lin, and Yang 2018) Assume that φ : X → R is a strictly convex and smooth function with an L-Lipschitz continuous factor, ≤ L [ diam ( X )] 2 · TV ( P , Q ) W D φ ( P , Q ) ≤ L 2 W L 2 ( P , Q ) 2 W D φ ( P , Q ) where P and Q are two probability distributions supported on a compact set X ⊂ R d . 21 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Table of Contents Bregman Divergence Function 1 Generative Adversarial Networks (GANs) 2 Wasserstein Divergence and GANs 3 Relaxed Wasserstein 4 Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme Empirical Results 5 Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets Conclusions 6 22 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Moment Estimate for RW Theorem (G, Hong, Lin, and Yang 2018) Assume that � � x � q 2 P r ( dx ) < + ∞ M q ( P r ) = X for some q > 2 , then there exists a constant C ( q , d ) > 0 such that, for n ≥ 1 , � � E W D φ ( P n , P r ) n − 1 2 + n − q − 2 q , 1 ≤ d ≤ 3 , q � = 4 , 2 q C ( q , d ) LM q ( P r ) n − 1 2 log(1 + n ) + n − q − 2 ≤ · q , d = 4 , q � = 4 , 2 d + n − q − 2 n − 2 q , d ≥ 5 , q � = d / ( d − 2) . 23 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Concentration Inequality for RW Theorem (G., Hong, Lin, and Yang 2018) Assume that � exp ( γ � x � α E α,γ ( P r ) = 2 ) P r ( dx ) . X and one of the three following conditions holds, ∃ α > 2 , ∃ γ > 0 , E α,γ ( P r ) < ∞ , or ∃ α ∈ (0 , 2) , ∃ γ > 0 , E α,γ ( P r ) < ∞ , ∃ q > 4 , M q ( P r ) < ∞ . or Then for n ≥ 1 and ǫ > 0 , there exist the scalar a ( n , ǫ ) and b ( n , ǫ ) such that � W D φ ( P n , P r ) ≥ ǫ � ≤ a ( n , ǫ )1 { ǫ ≤ L Prob 2 } + b ( n , ǫ ) . 24 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Duality Representation for RW Theorem (G., Hong, Lin, and Yang 2018) Assume that two probability distributions P and Q satisfy � � x � 2 2 ( P + Q ) ( dx ) < + ∞ . X Then there exists a Lipschitz continuous function f : X → R such that the RW divergence has a duality representation as � � W D φ ( P , Q ) = φ ( x ) ( P − Q ) ( dx ) + �∇ φ ( x ) , x � Q ( dx ) X X �� � � f ∗ ( ∇ φ ( x )) Q ( dx ) − f ( x ) P ( dx ) + , X X where f ∗ is the conjugate of f , i.e., f ∗ ( y ) = sup x ∈ R d � x , y � − f ( x ) . 25 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Key element for proof of duality The classical duality representation for the standard Wasserstein distance 26 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Key element for proof of duality The classical duality representation for the standard Wasserstein distance The RW can be decomposed in terms of a distorted squared Wasserstein- L 2 distance of order 2, plus some residual terms that are independent of the choice of the coupling π . 26 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Table of Contents Bregman Divergence Function 1 Generative Adversarial Networks (GANs) 2 Wasserstein Divergence and GANs 3 Relaxed Wasserstein 4 Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme Empirical Results 5 Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets Conclusions 6 27 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Relaxed Wasserstein for GANs Question: Is W φ tractable for GANs? 28 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Relaxed Wasserstein for GANs Question: Is W φ tractable for GANs? Requirement 1: W φ ( P r , P θ ) should be continuous and differentiable w.r.t. θ . 28 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Relaxed Wasserstein for GANs Question: Is W φ tractable for GANs? Requirement 1: W φ ( P r , P θ ) should be continuous and differentiable w.r.t. θ . Requirement 2: W φ ( P r , P θ ) should have the easily computed or approximated gradient evaluation, i.e., � � ∇ θ W D φ ( P r , P θ ) = F ( g θ , φ, Z , . . . ) . where F is an abstract mapping. 28 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Continuity and Differentiablity Theorem (G., Hong, Lin, and Yang 2018) 29 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Continuity and Differentiablity Theorem (G., Hong, Lin, and Yang 2018) 1 W D φ ( P r , P θ ) is continuous in θ if g θ is continuous in θ . 29 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Continuity and Differentiablity Theorem (G., Hong, Lin, and Yang 2018) 1 W D φ ( P r , P θ ) is continuous in θ if g θ is continuous in θ . 2 W D φ ( P r , P θ ) is differentiable almost everywhere if g θ is locally Lipschitz with a constant ¯ � ¯ L ( θ, Z ) 2 � < ∞ , L ( θ, z ) such that E i.e., for each given ( θ 0 , z 0 ) , there exists a neighborhood N such that � g θ ( z ) − g θ 0 ( z 0 ) � 2 ≤ L ( θ 0 , z 0 ) ( � θ − θ 0 � 2 + � z − z 0 � 2 ) . for any ( θ, z ) ∈ N . 29 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Table of Contents Bregman Divergence Function 1 Generative Adversarial Networks (GANs) 2 Wasserstein Divergence and GANs 3 Relaxed Wasserstein 4 Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme Empirical Results 5 Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets Conclusions 6 30 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Gradient Descent Scheme Corollary (G., Hong, Lin, and Yang 2018) Assume that g θ is locally Lipschitz with a constant L ( θ, z ) such X � x � 2 � L ( θ, Z ) 2 � � < ∞ , and 2 ( P r + P θ ) ( dx ) < + ∞ . Then that E there exists a Lipschitz continuous solution f : X → R such that the gradient of the RW divergence has an explicit form, i.e., [ ∇ θ g θ ( Z )] ⊤ ∇ 2 φ ( g θ ( Z )) g θ ( Z ) � � � � ∇ θ W D φ ( P r , P θ ) = E Z + E Z [ ∇ θ f ( ∇ φ ( g θ ( Z )))] . 31 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Experiment Setup Wasserstein Divergence and GANs MNIST and Fashion-MNIST datasets Relaxed Wasserstein CIFAR-10 and ImageNet datasets Empirical Results Conclusions Table of Contents Bregman Divergence Function 1 Generative Adversarial Networks (GANs) 2 Wasserstein Divergence and GANs 3 Relaxed Wasserstein 4 Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme Empirical Results 5 Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets Conclusions 6 32 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Experiment Setup Wasserstein Divergence and GANs MNIST and Fashion-MNIST datasets Relaxed Wasserstein CIFAR-10 and ImageNet datasets Empirical Results Conclusions Experiment Setup RW: KL divergence where φ ( x ) = − x ⊤ log( x ). 33 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Experiment Setup Wasserstein Divergence and GANs MNIST and Fashion-MNIST datasets Relaxed Wasserstein CIFAR-10 and ImageNet datasets Empirical Results Conclusions Experiment Setup RW: KL divergence where φ ( x ) = − x ⊤ log( x ). Approach: RMSProp [Tieleman and Hinton, 2012]. 33 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Experiment Setup Wasserstein Divergence and GANs MNIST and Fashion-MNIST datasets Relaxed Wasserstein CIFAR-10 and ImageNet datasets Empirical Results Conclusions Experiment Setup RW: KL divergence where φ ( x ) = − x ⊤ log( x ). Approach: RMSProp [Tieleman and Hinton, 2012]. Experiment I: Baselines: WGANs, CGANs, InfoGANs, GANs, LSGANs, DRAGANs, BEGANs, EBGANs and ACGANs. Datasets: MNIST: 60000 (train) and 10000 (test). Fashion-MNIST: 60000 (train) and 10000 (test). 33 / 50
Bregman Divergence Function Generative Adversarial Networks (GANs) Experiment Setup Wasserstein Divergence and GANs MNIST and Fashion-MNIST datasets Relaxed Wasserstein CIFAR-10 and ImageNet datasets Empirical Results Conclusions Experiment Setup RW: KL divergence where φ ( x ) = − x ⊤ log( x ). Approach: RMSProp [Tieleman and Hinton, 2012]. Experiment I: Baselines: WGANs, CGANs, InfoGANs, GANs, LSGANs, DRAGANs, BEGANs, EBGANs and ACGANs. Datasets: MNIST: 60000 (train) and 10000 (test). Fashion-MNIST: 60000 (train) and 10000 (test). Experiment II: Baselines: WGANs and WGANs-GP. Datasets: CIFAR-10 (color): 50000 (train) and 10000 (test). ImageNet (color): 14197122. 33 / 50
Recommend
More recommend