On Casting Importance Weighted Autoencoder to an EM Algorithm to Learn Deep Generative Models D.Kim 1 and J.Hwang 2 and Y.Kim 1 Speaker : Dongha Kim 1 Department of Statistics, Seoul National University, South Korea 2 SK Telecom, South Korea November 07, 2019 XAIENCE 2019 November 07, 2019 1 / 31
Introduction Outline 1 Introduction 2 Proposed methods IWAE as EM algorithm IWEM miss-IWAE 3 Empirical analysis 4 Summary 5 References XAIENCE 2019 November 07, 2019 2 / 31
Introduction Deep generative model with latent variable • X : observable variable • Z : latent variable ∼ p ( z ) (ex: N ( 0 , I )) Z X | Z = z ∼ p ( x | z ; θ ) XAIENCE 2019 November 07, 2019 3 / 31
Introduction Deep generative model with latent variable • The log-likelihood of the observable vector x : � log p ( x ; θ ) = log p ( x | z ; θ ) p ( z ) d z . • Marginalizing operation is problematic. → Hard to estimate MLE directly. • An alternative approach: • Calculate lower bound which is easy to compute. • VAE (Kingma and Welling, 2013; Rezende et al., 2014) � IWAE (Burda et al., 2015) XAIENCE 2019 November 07, 2019 4 / 31
Introduction Variational autoencoders (VAE) • Employ a variational posterior distribution q ( z | x ; φ ) : � � p ( x , z ; θ ) �� L VAE ( x ; θ, φ ) := E z ∼ q log q ( z | x ; φ ) • In practice, we use the Monte Carlo method: L � p ( x , z l ; θ ) � L VAE ( x ; θ, φ ) := 1 ˆ � log , L q ( z l | x ; φ ) l = 1 where z 1 , ..., z L ∼ q ( z | x ; φ ) . L VAE w.r.t. ( θ, φ ) . • Maximize ˆ XAIENCE 2019 November 07, 2019 5 / 31
Introduction Importance weighted autoencoders (IWAE) • Use multiple samples from q ( z | x ; φ ) . � � K �� p ( x , z k ; θ ) 1 � L IWAE ( x ; θ, φ ) := E z 1 ,..., z K ∼ q log q ( z k | x ; φ ) K k = 1 • More tight lower bound than VAE. • Use the Monte Carlo method: � K � 1 p ( x , z k ; θ ) L IWAE ( x ; θ, φ ) := log ˆ � , K q ( z k | x ; φ ) k = 1 where z 1 , ..., z K ∼ q ( z | x ; φ ) . L IWAE w.r.t. ( θ, φ ) . • Maximize ˆ XAIENCE 2019 November 07, 2019 6 / 31
Introduction Contents • Interpret IWAE as an EM algorithm with importance sampling (IS). • Develop IWAE by 1 learning the proposal distribution carefully 2 and devising an annealing strategy. → IWEM (importance weighted EM algorithm ) • Generalize IWEM for missing data problems. → miss-IWEM XAIENCE 2019 November 07, 2019 7 / 31
Proposed methods Outline 1 Introduction 2 Proposed methods IWAE as EM algorithm IWEM miss-IWAE 3 Empirical analysis 4 Summary 5 References XAIENCE 2019 November 07, 2019 8 / 31
Proposed methods IWAE as EM algorithm Outline 1 Introduction 2 Proposed methods IWAE as EM algorithm IWEM miss-IWAE 3 Empirical analysis 4 Summary 5 References XAIENCE 2019 November 07, 2019 9 / 31
Proposed methods IWAE as EM algorithm EM algorithm 1 E-step • θ c : the current estimate of θ . • Calculate the expected value of the complete log likelihood function: Q ( θ | θ c ; x ) := E z ∼ p ( z | x ; θ c ) [ log p ( x , z ; θ )] . 2 M-step • Update the current estimate by maximizing Q ( θ | θ c ; x ) . XAIENCE 2019 November 07, 2019 10 / 31
Proposed methods IWAE as EM algorithm EM algorithm with IS 1 E-step • Approximate Q by employing a proposal distribution q ( z | x ; φ ) : K w k ˆ � Q ( θ | θ c , φ ; x ) := k ′ = 1 w k ′ · log p ( x , z k ; θ ) � K k = 1 where z k ∼ q ( z | x ; φ ) and w k = p ( x , z k ; θ c ) q ( z k | x ; φ ) for k = 1 , ..., K . 2 M-step • Update θ by maximizing ˆ Q ( θ | θ c , φ ; x ) . 3 P-step (if necessary) • Update φ by encouraging q ( z | x ; φ ) to be a good proposal distribution. XAIENCE 2019 November 07, 2019 11 / 31
Proposed methods IWAE as EM algorithm IWAE = EM algorithm Proposition 1. The following equality holds for any θ c : � θ = θ c = ∇ θ ˆ ∇ θ ˆ L IWAE ( x ; θ, φ ) � Q ( θ | θ c , φ ; x ) � � � θ = θ c � IWAE = EM algorithm • if we use GD based optimization method. • Updating φ in IWAE can be understood as P-step : ˆ L IWAE ( x ; θ c , φ ) max φ XAIENCE 2019 November 07, 2019 12 / 31
Proposed methods IWAE as EM algorithm IWAE = EM algorithm (cont.) IWAE as EM algorithm 1 E-step • Calculate ˆ Q ( θ | θ c , φ ; x ) 2 M-step • Update θ by maximizing ˆ Q ( θ | θ c , φ ; x ) . 3 P-step • Update φ by maximizing ˆ L IWAE ( x ; θ c , φ ) . XAIENCE 2019 November 07, 2019 13 / 31
Proposed methods IWEM Outline 1 Introduction 2 Proposed methods IWAE as EM algorithm IWEM miss-IWAE 3 Empirical analysis 4 Summary 5 References XAIENCE 2019 November 07, 2019 14 / 31
Proposed methods IWEM Optimal P-step � Using ˆ Q inevitably causes variance due to IS. • Small variance → stable learning procedure • The optimal proposal distribution (Owen, 2013): q opt ( z ) | log p ( x , z ; θ c ) | · p ( x , z ; θ c ) . ∝ � IWAE uses p ( x , z ; θ c ) . • New P-step: replace p ( x , z ; θ c ) in IWAE to q opt ( z ) : � K � q opt ( z k ) 1 � L opt ( θ c , φ ; x ) := log ˆ . q ( z k | x ; φ ) K k = 1 XAIENCE 2019 November 07, 2019 15 / 31
Proposed methods IWEM Annealing strategy • In general, � � � ˆ ˆ L VAE ( x ; θ, φ ) � Var ≪ Var Q ( θ | θ, φ ; x ) . • Using VAE at early steps → small variance • New E-step : take a convex combination with VAE: Q α ( θ | θ c , φ ; x ) := α · ˆ ˆ Q ( θ | θ c , φ ; x ) + ( 1 − α ) · ˆ L VAE ( θ, φ ; x ) . • α ∈ [ 0 , 1 ] : annealing controller • start from zero and • increase it incrementally up to one as the iteration proceeds. XAIENCE 2019 November 07, 2019 16 / 31
Proposed methods IWEM IWAE vs. IWEM IWAE IWEM 1 E-step 1 E-step • Calculate ˆ • Calculate ˆ Q ( θ | θ c , φ ; x ) Q α ( θ | θ c , φ ; x ) 2 M-step 2 M-step • Update θ by maximizing ˆ • Update θ by maximizing ˆ Q α . Q . 3 P-step 3 P-step • Update φ by maximizing • Update φ by maximizing ˆ ˆ L IWAE ( x ; θ c , φ ) . L opt ( θ c , φ ; x ) . XAIENCE 2019 November 07, 2019 17 / 31
Proposed methods miss-IWAE Outline 1 Introduction 2 Proposed methods IWAE as EM algorithm IWEM miss-IWAE 3 Empirical analysis 4 Summary 5 References XAIENCE 2019 November 07, 2019 18 / 31
Proposed methods miss-IWAE Missing data problem • x = ( x ( o ) , x ( m ) ) • We only observe x ( o ) . • The log-likelihood is � log p ( x ( o ) ; θ ) = log p ( x ( o ) , x ( m ) , z ; θ ) d z d x ( m ) � Need to formulate a proposal distribution of ( x ( m ) , z ) . XAIENCE 2019 November 07, 2019 19 / 31
Proposed methods miss-IWAE Formulation of proposal distribution • We use the following proposal distribution: q ( x ( m ) , z | x ( o ) ; θ, φ ) := p ( x ( m ) | z ; θ ) · q ( z | ˘ x ; φ ) • q ( z | x ; φ ) : same distribution as q in IWEM. x = ( x ( o ) , ˘ x ( m ) ) • ˘ x ( m ) : imputed value of x ( m ) . • ˘ z from the distribution q ( z | ( x ( o ) , 0 ); φ ) , • Draw ˘ x ( m ) from the distribution p ( x ( m ) | ˘ • and draw ˘ z ; θ ) . XAIENCE 2019 November 07, 2019 20 / 31
Proposed methods miss-IWAE miss-IWEM � Simply replaces q ( z | x ; φ ) in IWEM to q ( x ( m ) , z | x ( o ) ; θ, φ ) . 1 E-step • Calculate ˆ m ( θ | θ c , φ ; x ( o ) ) := α · ˆ Q α Q m ( θ | θ c , φ ; x ( o ) ) + ( 1 − α ) · ˆ L VAE ( θ, φ ; x ( o ) ) . m 2 M-step • Update θ by maximizing ˆ m ( θ | θ c , φ ; x ( o ) ) . Q α 3 P-step • Update φ by maximizing ˆ L opt m ( θ c , φ ; x ( o ) ) . XAIENCE 2019 November 07, 2019 21 / 31
Empirical analysis Outline 1 Introduction 2 Proposed methods IWAE as EM algorithm IWEM miss-IWAE 3 Empirical analysis 4 Summary 5 References XAIENCE 2019 November 07, 2019 22 / 31
Empirical analysis Experimental setup • Model • p ( z ) : N ( 0 40 , I 40 ) • ( p ( x | z ; θ ) , q ( z | x ; φ )) : (MLP, MLP) or (DeConv, Conv) • Optimization algorithm • Adam (Kingma and Ba, 2014) • Performance measure • Approximated test log-likelihood. • Datasets • Static biMNIST, Dynamic biMNIST, Omniglot, Caltech 101 Silhouette XAIENCE 2019 November 07, 2019 23 / 31
Empirical analysis Complete data analysis Performance results IWEM-woa 1 MLP VAE IWAE IWEM sta. MNIST -88.21 -87.68 -87.00 -87.11 dyn. MNIST -85.31 -84.30 -84.10 -84.16 Omniglot -108.46 -106.80 -106.50 -106.38 Caltech 101 -119.67 -118.06 -116.92 -116.54 IWEM-woa 1 CNN VAE IWAE IWEM sta. MNIST -84.63 -83.54 -83.32 -83.77 dyn. MNIST -84.08 -81.56 -81.07 -81.28 Omniglot -101.63 -100.27 -100.15 -100.39 Caltech 101 -109.24 -106.94 -106.19 -106.05 1 IWEM w/o annealing strategy XAIENCE 2019 November 07, 2019 24 / 31
Empirical analysis Incomplete data analysis Generation of missing samples 1 Divide an image into 9 equal patches. 2 Generate an incomplete image by removing the predefined number of patches randomly. XAIENCE 2019 November 07, 2019 25 / 31
Empirical analysis Incomplete data analysis (cont.) Performance results • Static biMNIST + (MLP, MLP) • Missing rate ↑ ⇒ margin ↑ # of cropped missIWAE 2 miss-IWEM-woa 3 miss-IWEM patches 3 -90.29 -89.79 -89.71 4 -92.07 -90.97 -90.76 5 -95.54 -93.33 -92.23 6 -102.26 -97.66 -95.18 2 Mattei and Frellsen (2018) 3 miss-IWEM w/o annealing strategy XAIENCE 2019 November 07, 2019 26 / 31
Recommend
More recommend