Learning Discrete and Continuous Factors of Data via Alternating Disentanglement Yeonwoo Jeong, Hyun Oh Song Seoul National University ICML19 1
Motivation Shape? square Postion x? 0.3 ◮ Our goal is to disentangle the underlying explanatory factors of Postion y? 0.7 data without any supervision. Rotation? 40 ° Size? 0.5 2
Motivation square square 0.3 0.3 0.7 0.7 40 ° 40 ° 0.5 0.5 3
Motivation square ellipse 0.3 0.3 0.7 0.7 40 ° 40 ° 0.5 0.5 3
Motivation square square 0.3 1 0.7 0.7 40 ° 40 ° 0.5 0.5 3
Motivation square square 0.3 0.3 0.7 0.7 40 ° 0 ° 0.5 0.5 3
Motivation square square 0.3 0.3 0.7 0.7 40 ° 40 ° 0.5 1 3
Motivation ◮ Most recent methods focus on learning only the continuous factors of variation. 4
Motivation ◮ Most recent methods focus on learning only the continuous factors of variation. ◮ Learning discrete representations is known as a challenging problem. However, learning continuous and discrete representations is a more challenging problem. 4
Outline Method Experiments Conclusion Method 5
Overview of our method 𝑟 𝜚 𝑨 𝑦 𝑞 𝜄 𝑦 𝑨, 𝑒 𝑨 1 𝑨 𝑗 𝑨 𝑨 𝑜 𝑦 𝑒 𝑦 ො Min cost flow solver 𝛾 ℎ on KL regularizer 𝛾 𝑚 on KL regularizer Method 6
Overview of our method ◮ We propose an efficient procedure for implicitly penalizing the total correlation by controlling the information flow on each variables . ◮ We propose a method for jointly learning discrete and continuous latent variables in an alternating maximization framework . Method 6
Limitation of β -VAE framework ◮ β -VAE sets β > 1 to penalize TC ( z ) for disentangled representations . ◮ However, it penalizes the mutual information( = I ( x, z ) ) between the data and the latent variables. Method 7
Our method ◮ We aim at penalizing TC ( z ) by sequentially penalizing the individual summand I ( z 1 : i − 1 ; z i ) . m � TC ( z ) = I ( z 1 : i − 1 ; z i ) . i =2 Method 8
Our method ◮ We aim at penalizing TC ( z ) by sequentially penalizing the individual summand I ( z 1 : i − 1 ; z i ) . m � TC ( z ) = I ( z 1 : i − 1 ; z i ) . i =2 ◮ We implicitly minimizes each summand, I ( z 1 : i − 1 ; z i ) by sequentially maximizing the left hand side I ( x ; z 1: i ) for all i = 2 , . . . , m 1. I ( x ; z 1: i ) = I ( x ; z 1: i − 1 ) + I ( x ; z i ) − I ( z 1 : i − 1 ; z i ) . ↑ 2. I ( x ; z 1: i ) = I ( x ; z 1: i − 1 ) + I ( x ; z i ) − I ( z 1 : i − 1 ; z i ) . ↑ • ↑ ↓ Method 8
Our method ◮ In practice, we maximize I ( x ; z 1: i ) by minimizing reconstruction term while penalizing z i +1: m with high β ( := β h ) and the others with small β ( := β l ). Method 9
Our method 𝑟 𝜚 𝑨 𝑦 𝑞 𝜄 𝑦 𝑨, 𝑒 𝑨 1 𝑨 𝑗 𝑨 𝑨 𝑜 𝑦 𝑒 𝑦 ො Min cost flow solver 𝛾 ℎ on KL regularizer 𝛾 𝑚 on KL regularizer ◮ Every latent dimensions are heavily penalized with β h . Each penalty on latent dimension is sequentially relieved one at a time with β l in a cascading fashion . Method 10
Our method 𝑟 𝜚 𝑨 𝑦 𝑞 𝜄 𝑦 𝑨, 𝑒 𝑨 1 𝑨 𝑗 𝑨 𝑨 𝑜 𝑦 𝑒 𝑦 ො Min cost flow solver 𝛾 ℎ on KL regularizer 𝛾 𝑚 on KL regularizer ◮ Every latent dimensions are heavily penalized with β h . Each penalty on latent dimension is sequentially relieved one at a time with β l in a cascading fashion . Method 10
Our method 𝑟 𝜚 𝑨 𝑦 𝑞 𝜄 𝑦 𝑨, 𝑒 𝑨 1 𝑨 𝑗 𝑨 𝑨 𝑜 𝑦 𝑒 𝑦 ො Min cost flow solver 𝛾 ℎ on KL regularizer 𝛾 𝑚 on KL regularizer ◮ Every latent dimensions are heavily penalized with β h . Each penalty on latent dimension is sequentially relieved one at a time with β l in a cascading fashion . Method 10
Our method 𝑟 𝜚 𝑨 𝑦 𝑞 𝜄 𝑦 𝑨, 𝑒 𝑨 1 𝑨 𝑗 𝑨 𝑨 𝑜 𝑦 𝑒 𝑦 ො Min cost flow solver 𝛾 ℎ on KL regularizer 𝛾 𝑚 on KL regularizer ◮ Every latent dimensions are heavily penalized with β h . Each penalty on latent dimension is sequentially relieved one at a time with β l in a cascading fashion . Method 10
Graphical model Figure: Graphical models view. Solid lines denote the generative process and the dashed lines denote the inference process . x, z, d denotes the data, continuous latent code, and the discrete latent code respectively. Method 11
Motviation of our method ◮ AAE with supervised discrete variables(AAE-S) can learn good continuous representations when the burden of simultaneously modeling the continuous and discrete factors is relieved through supervision on discrete factors unlike jointVAE . Method 12
Motviation of our method ◮ AAE with supervised discrete variables(AAE-S) can learn good continuous representations when the burden of simultaneously modeling the continuous and discrete factors is relieved through supervision on discrete factors unlike jointVAE . ◮ Inspired by these findings, our idea is to alternate between finding the most likely discrete configuration of the variables given the continuous factors, and updating the parameters ( φ, θ ) given the discrete configurations. Method 12
Construct unary term 𝑦 (1) ◮ The discrete latent variables are represented using one-hot encodings of each variables d ( i ) ∈ { e 1 , . . . , e S } . 𝑦 (1) 𝑦 (1) Method 13
Construct unary term 𝑦 (1) 𝑓 1 𝑦 (1) ො ◮ The discrete latent variables are represented using one-hot encodings of each variables d ( i ) ∈ { e 1 , . . . , e S } . 𝑦 (1) 𝑦 (1) Method 13
Construct unary term 𝑦 (1) 𝑓 1 𝑦 (1) ො ◮ The discrete latent variables are represented using one-hot encodings of each variables d ( i ) ∈ { e 1 , . . . , e S } . 𝑦 (1) 𝑓 𝑙 𝑦 (1) ො 𝑦 (1) Method 13
Construct unary term 𝑦 (1) 𝑓 1 𝑦 (1) ො ◮ The discrete latent variables are represented using one-hot encodings of each variables d ( i ) ∈ { e 1 , . . . , e S } . 𝑦 (1) 𝑓 𝑙 𝑦 (1) ො 𝑦 (1) 𝑓 𝑇 𝑦 (1) ො Method 13
Construct unary term 𝑦 (1) 𝑓 1 𝑦 (1) ො ◮ The discrete latent variables are rec represented using one-hot encodings of each variables d ( i ) ∈ { e 1 , . . . , e S } . 𝑦 (1) 𝑓 𝑙 𝑦 (1) ො rec 𝑦 (1) 𝑓 𝑇 𝑦 (1) ො rec Method 13
Construct unary term 𝑦 (1) 𝑓 1 𝑦 (1) ො ◮ The discrete latent variables are rec represented using one-hot encodings of each variables d ( i ) ∈ { e 1 , . . . , e S } . 𝑣 1 𝑦 (1) 𝑓 𝑙 ◮ u ( i ) 𝑦 (1) denotes the vector of the likelihood ො θ log p θ ( x ( i ) | z ( i ) , e k ) evaluated at each rec k ∈ [ S ] . 𝑦 (1) 𝑓 𝑇 𝑦 (1) ො rec Method 13
Alternating minimization scheme ◮ Our goal is to maximize the variational lower bound of the following objective, L ( θ, φ ) = I ( x ; [ z, d ]) − β E x ∼ p ( x ) D KL ( q φ ( z | x ) � p ( z )) − λD KL ( q ( d ) � p ( d )) ◮ After rearranging the terms, we arrive at the following optimization problem. n � ⊺ d ( i ) − λ ′ � u ( i ) d ( i ) ⊺ d ( j ) maximize maximize θ θ,φ d (1) ,...d ( n ) i =1 i � = j � �� � := L LB ( θ,φ ) n � D KL ( q φ ( z | x ( i ) ) || p ( z )) − β i =1 � d ( i ) � 1 = 1 , d ( i ) ∈ { 0 , 1 } S , ∀ i, subject to Method 14
Finding the most likely discrete configuration 𝑦 (1) 𝑦 (i) 𝑦 (n) 𝑦 (1) 𝑦 (i) 𝑦 (n) 𝑦 (1) 𝑦 (i) 𝑦 (n) ◮ With the unary terms, we solve inner maximization problem L LB ( θ, φ ) over the discrete variables [ d (1) , . . . , d ( n ) ] . 1 1 Jeong, Y. and Song, H. O. “Efficient end-to-end learning for quantizable representations” ICML2018. Method 15
Finding the most likely discrete configuration 𝑓 1 𝑦 (1) 𝑓 1 𝑦 (i) 𝑦 (n) 𝑓 1 𝑦 (1) 𝑦 (i) 𝑦 (n) ො ො ො 𝑦 (1) 𝑦 (i) 𝑦 (n) 𝑦 (1) 𝑦 (i) 𝑦 (n) ◮ With the unary terms, we solve inner maximization problem L LB ( θ, φ ) over the discrete variables [ d (1) , . . . , d ( n ) ] . 1 1 Jeong, Y. and Song, H. O. “Efficient end-to-end learning for quantizable representations” ICML2018. Method 15
Finding the most likely discrete configuration 𝑓 1 𝑦 (1) 𝑓 1 𝑦 (i) 𝑦 (n) 𝑓 1 𝑦 (1) 𝑦 (i) 𝑦 (n) ො ො ො 𝑓 𝑙 𝑓 𝑙 𝑦 (1) 𝑦 (i) 𝑦 (n) 𝑓 𝑙 𝑦 (1) 𝑦 (i) 𝑦 (n) ො ො ො 𝑦 (1) 𝑦 (i) 𝑦 (n) ◮ With the unary terms, we solve inner maximization problem L LB ( θ, φ ) over the discrete variables [ d (1) , . . . , d ( n ) ] . 1 1 Jeong, Y. and Song, H. O. “Efficient end-to-end learning for quantizable representations” ICML2018. Method 15
Recommend
More recommend