MLE: Gaussian Distribution The definition of multi-variate Gaussian distribution exp � 2 ( x − µ ) T Σ − 1 ( x − µ ) � 1 − 1 q ( x | y ; µ , Σ ) � (18) � ( 2 π ) d | Σ | ◮ For y � + 1 , MLE on µ + and Σ + will only consider the samples x with y � + 1 (assume it’s S + ) ◮ MLE on µ + � 1 µ � (19) x i | S + | x i ∈ S + ◮ MLE on Σ + � 1 ( x i − µ )( x i − µ ) T (20) Σ + � | S + | x i ∈ S + ◮ Exercise : prove equations 19 and 20 with d � 1 15
Example: Parameter Estimation Given N � 1000 samples, here are the parameters Parameter p (·) q (·) [ 2 , 0 ] T [ 1 . 95 , − 0 . 11 ] T µ + � 1 . 0 � 0 . 88 � � 0 . 8 0 . 74 Σ + 0 . 8 2 . 0 0 . 74 1 . 97 [− 2 , 0 ] T [− 2 . 08 , 0 . 08 ] T µ - � 2 . 0 � 1 . 88 � � 0 . 6 0 . 55 Σ - 0 . 6 1 . 0 0 . 55 1 . 07 16
Prediction ◮ For a new data point x ′ , the prediction is given as q ( y ′ | x ′ ) � q ( y ′ ) q ( x | y ′ ) ∝ q ( y ′ ) q ( x ′ | y ′ ) (21) q ( x ′ ) No need to compute q ( x ′ ) 17
Prediction ◮ For a new data point x ′ , the prediction is given as q ( y ′ | x ′ ) � q ( y ′ ) q ( x | y ′ ) ∝ q ( y ′ ) q ( x ′ | y ′ ) (21) q ( x ′ ) No need to compute q ( x ′ ) ◮ Prediction rule � + 1 q ( y ′ � + 1 | x ′ ) > q ( y ′ � − 1 | x ′ ) y ′ � (22) q ( y ′ � + 1 | x ′ ) < q ( y ′ � + 1 | x ′ ) − 1 17
Prediction ◮ For a new data point x ′ , the prediction is given as q ( y ′ | x ′ ) � q ( y ′ ) q ( x | y ′ ) ∝ q ( y ′ ) q ( x ′ | y ′ ) (21) q ( x ′ ) No need to compute q ( x ′ ) ◮ Prediction rule � + 1 q ( y ′ � + 1 | x ′ ) > q ( y ′ � − 1 | x ′ ) y ′ � (22) q ( y ′ � + 1 | x ′ ) < q ( y ′ � + 1 | x ′ ) − 1 ◮ Although equation 22 looks like the one used in the Bayes optimal predictor, the prediction power is limited by q ( y ′ | x ′ ) ≈ p ( y | x ) (23) Again, we don’t know p (·) 17
Naive Bayes Classifiers
Number of Parameters Assume x � ( x · , 1 , . . . , x · , d ) ∈ R d , then the number of parameters in q ( x , y ) ◮ q ( y ) : 1 ( α ) ◮ q ( x | y � + 1 ) : ◮ µ + ∈ R d : d parameters ◮ Σ + ∈ R d × d : d 2 parameters ◮ q ( x | y � − 1 ) : d 2 + d parameters In total, we have 2 d 2 + 2 d + 1 parameters 19
Challenge of Parameter Estimation ◮ When d � 100 , we have 2 d 2 + 2 d + 1 � 20201 parameters ◮ A close look about the covariance matrix Σ in a multivariate Gaussian distribution σ 2 σ 2 · · · 1 , 1 1 , d . . ... . . Σ � (24) . . σ 2 σ 2 · · · d , 1 d , d 20
Challenge of Parameter Estimation ◮ When d � 100 , we have 2 d 2 + 2 d + 1 � 20201 parameters ◮ A close look about the covariance matrix Σ in a multivariate Gaussian distribution σ 2 σ 2 · · · 1 , 1 1 , d . . ... . . Σ � (24) . . σ 2 σ 2 · · · d , 1 d , d ◮ To reduce the number of parameters, we assume σ i , j � 0 if i � j (25) 20
Diagonal Covariance Matrix With the diagonal covariance matrix σ 2 · · · 0 1 , 1 . . ... . . Σ � (26) . . σ 2 0 · · · d , d Now, the multivariate Gaussian distribution can be rewritten with d � σ 2 | Σ | (27) � j , j j � 1 d � ( x · , j − µ j ) 2 ( x − µ ) T Σ − 1 ( x − µ ) (28) � σ 2 j , j j � 1 21
Diagonal Covariance Matrix (II) In other words d � q ( x · , j | y ; µ j , σ 2 q ( x | y , µ , Σ ) � j , j ) (29) j � 1 22
Diagonal Covariance Matrix (II) In other words d � q ( x · , j | y ; µ j , σ 2 q ( x | y , µ , Σ ) � j , j ) (29) j � 1 ◮ Conditional Independence : Equation 29 means, given y , each component x j is independent of other components 22
Diagonal Covariance Matrix (II) In other words d � q ( x · , j | y ; µ j , σ 2 q ( x | y , µ , Σ ) � j , j ) (29) j � 1 ◮ Conditional Independence : Equation 29 means, given y , each component x j is independent of other components ◮ This is a strong and naive assumption about q ( x | ·) 22
Diagonal Covariance Matrix (II) In other words d � q ( x · , j | y ; µ j , σ 2 q ( x | y , µ , Σ ) � j , j ) (29) j � 1 ◮ Conditional Independence : Equation 29 means, given y , each component x j is independent of other components ◮ This is a strong and naive assumption about q ( x | ·) ◮ Together with q ( y ) , this generative model is called the Naive Bayes classifier 22
Diagonal Covariance Matrix (II) In other words d � q ( x · , j | y ; µ j , σ 2 q ( x | y , µ , Σ ) � j , j ) (29) j � 1 ◮ Conditional Independence : Equation 29 means, given y , each component x j is independent of other components ◮ This is a strong and naive assumption about q ( x | ·) ◮ Together with q ( y ) , this generative model is called the Naive Bayes classifier ◮ Parameter estimation can be done per dimension 22
Example: Parameter Estimation Given N � 1000 samples, here are the parameters Parameter p (·) q (·) Naive Bayes [ 2 , 0 ] T [ 1 . 95 , − 0 . 11 ] T [ 1 . 95 , − 0 . 11 ] T µ + � 1 . 0 � 0 . 88 � 0 . 88 � � � 0 . 8 0 . 74 0 Σ + 0 . 8 2 . 0 0 . 74 1 . 97 0 1 . 97 [− 2 , 0 ] T [− 2 . 08 , 0 . 08 ] T [− 2 . 08 , 0 . 08 ] T µ - � 2 . 0 � 1 . 88 � 1 . 88 � � � 0 . 6 0 . 55 0 Σ - 0 . 6 1 . 0 0 . 55 1 . 07 0 1 . 07 23
Latent Variable Models
Data Generation Model, Revisited Consider the following model again without any label information p ( x ) � α · N ( x ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x ; µ 2 , Σ 2 ) (30) � ������������� �� ������������� � � �������������������� �� �������������������� � c � 1 c � 2 25
Data Generation Model, Revisited Consider the following model again without any label information p ( x ) � α · N ( x ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x ; µ 2 , Σ 2 ) (30) � ������������� �� ������������� � � �������������������� �� �������������������� � c � 1 c � 2 ◮ No labeling information ◮ Instead of having two classes, now it has two components c ∈ { 1 , 2 } 25
Data Generation Model, Revisited Consider the following model again without any label information p ( x ) � α · N ( x ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x ; µ 2 , Σ 2 ) (30) � ������������� �� ������������� � � �������������������� �� �������������������� � c � 1 c � 2 ◮ No labeling information ◮ Instead of having two classes, now it has two components c ∈ { 1 , 2 } ◮ It is a specific case of Gaussian mixture models ◮ A mixture model with two Gaussian components 25
Data Generation The data generation process: for each data point 1. Randomly select a component c based on p ( c � 1 ) � α p ( c � 2 ) � 1 − α (31) 26
Data Generation The data generation process: for each data point 1. Randomly select a component c based on p ( c � 1 ) � α p ( c � 2 ) � 1 − α (31) 2. Sample x from the corresponding component c � N ( x ; µ 1 , Σ 1 ) c � 1 p ( x | y ) � (32) N ( x ; µ 2 , Σ 2 ) c � 2 26
Data Generation The data generation process: for each data point 1. Randomly select a component c based on p ( c � 1 ) � α p ( c � 2 ) � 1 − α (31) 2. Sample x from the corresponding component c � N ( x ; µ 1 , Σ 1 ) c � 1 p ( x | y ) � (32) N ( x ; µ 2 , Σ 2 ) c � 2 3. Add x to S , go to step 1 26
Illustration Here is an example data set S with 1,000 samples No label information available 27
The Learning Problem Consider using the following distribution to fit the data S q ( x ) � α · N ( x ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x ; µ 2 , Σ 2 ) (33) 28
The Learning Problem Consider using the following distribution to fit the data S q ( x ) � α · N ( x ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x ; µ 2 , Σ 2 ) (33) ◮ This is a density estimation problem — one of the unsupervised learning problems 28
The Learning Problem Consider using the following distribution to fit the data S q ( x ) � α · N ( x ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x ; µ 2 , Σ 2 ) (33) ◮ This is a density estimation problem — one of the unsupervised learning problems ◮ The number of components in q ( x ) is part of the assumption based on our understanding about the data 28
The Learning Problem Consider using the following distribution to fit the data S q ( x ) � α · N ( x ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x ; µ 2 , Σ 2 ) (33) ◮ This is a density estimation problem — one of the unsupervised learning problems ◮ The number of components in q ( x ) is part of the assumption based on our understanding about the data ◮ Without knowing the true data distribution, the number of components is treated as a hyper-parameter (predetermined before learning) 28
Parameter Estimation ◮ Based on the general form of GMMs, the parameters are θ � { α, µ 1 , Σ 1 , µ 2 , Σ 2 } ◮ Given a set of training example S � { x 1 , . . . , x m } , the straightforward method is MLE m � L ( θ ) log q ( x i ; θ ) � i � 1 m � � log α · N ( x i ; µ 1 , Σ 1 ) � i � 1 � + ( 1 − α ) · N ( x i ; µ 2 , Σ 2 ) (34) ◮ Learning: θ ← argmax θ ′ L ( θ ′ ) 29
Singularity in GMM Parameter Estimation Singularity happens when one of the mixture component only captures a single data point, which eventually leads the (log-)likelihood to ∞ 30
Singularity in GMM Parameter Estimation Singularity happens when one of the mixture component only captures a single data point, which eventually leads the (log-)likelihood to ∞ ◮ It is easy to overfit the training set using GMMs, for example when K � m 30
Singularity in GMM Parameter Estimation Singularity happens when one of the mixture component only captures a single data point, which eventually leads the (log-)likelihood to ∞ ◮ It is easy to overfit the training set using GMMs, for example when K � m ◮ This issue does not exist when estimating parameters for a single Gaussian distribution 30
Gradient-based Learning Recall the definition of L ( θ ) m � � � L ( θ ) � log α · N ( x i ; µ 1 , Σ 1 ) + ( 1 − α )· N ( x i ; µ 2 , Σ 2 ) (35) i � 1 ◮ There is no closed form solution of ∇ L ( θ ) � 0 ◮ E.g., the value of α depends on { µ c , Σ c } 2 c � 1 , vice versa ◮ Gradient-based learning is still feasible as θ ( new ) ← θ ( old ) + η · ∇ L ( θ ) 31
Latent Variable Models To rewrite equation 33 into a full probabilistic form, we introduce a random variable z ∈ { 1 , 2 } , with q ( z � 1 ) � α q ( z � 2 ) � 1 − α (36) or q ( z ) � α δ ( z � 1 ) ( 1 − α ) δ ( z � 2 ) (37) 32
Latent Variable Models To rewrite equation 33 into a full probabilistic form, we introduce a random variable z ∈ { 1 , 2 } , with q ( z � 1 ) � α q ( z � 2 ) � 1 − α (36) or q ( z ) � α δ ( z � 1 ) ( 1 − α ) δ ( z � 2 ) (37) ◮ z is a random variable and indicates the mixture component for x (a similar role as y in the classification problem) 32
Latent Variable Models To rewrite equation 33 into a full probabilistic form, we introduce a random variable z ∈ { 1 , 2 } , with q ( z � 1 ) � α q ( z � 2 ) � 1 − α (36) or q ( z ) � α δ ( z � 1 ) ( 1 − α ) δ ( z � 2 ) (37) ◮ z is a random variable and indicates the mixture component for x (a similar role as y in the classification problem) ◮ z is not directly observed in the data, therefore it is a latent (random) variable. 32
GMM with Latent Variable With latent variable z , we can rewrite the probabilistic model as a joint distribution over x and z q ( x , z ) q ( z ) q ( x | z ) � α δ ( z � 1 ) · N ( x ; µ 1 , Σ 1 ) δ ( z � 1 ) � · ( 1 − α ) δ ( z � 2 ) · N ( x ; µ 2 , Σ 2 ) δ ( z � 2 ) (38) 33
GMM with Latent Variable With latent variable z , we can rewrite the probabilistic model as a joint distribution over x and z q ( x , z ) q ( z ) q ( x | z ) � α δ ( z � 1 ) · N ( x ; µ 1 , Σ 1 ) δ ( z � 1 ) � · ( 1 − α ) δ ( z � 2 ) · N ( x ; µ 2 , Σ 2 ) δ ( z � 2 ) (38) And the marginal probability p ( x ) is the same as in equation 33 q ( x ) q ( z � 1 ) q ( x | z � 1 ) + q ( z � 2 ) q ( x | z � 2 ) � α · N ( x ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x ; µ 2 , Σ 2 ) (39) � 33
Parameter Estimation: MLE? For each x i , we introduce a latent variable z i as mixture component indicator, then the log likelihood is defined as m � ℓ ( θ ) log q ( x i , z i ) � i � 1 m � � α δ ( z i � 1 ) · N ( x i ; µ 1 , Σ 1 ) δ ( z i � 1 ) log � i � 1 · ( 1 − α ) δ ( z i � 2 ) · N ( x i ; µ 2 , Σ 2 ) δ ( z i � 2 ) � (40) m � � δ ( z i � 1 ) log α + δ ( z i � 1 ) log N ( x i ; µ 1 , Σ 1 ) � i � 1 � δ ( z i � 2 ) log ( 1 − α ) + δ ( z i � 2 ) log N ( x i ; µ 2 , Σ 2 ) 34
Parameter Estimation: MLE? For each x i , we introduce a latent variable z i as mixture component indicator, then the log likelihood is defined as m � ℓ ( θ ) log q ( x i , z i ) � i � 1 m � � α δ ( z i � 1 ) · N ( x i ; µ 1 , Σ 1 ) δ ( z i � 1 ) log � i � 1 · ( 1 − α ) δ ( z i � 2 ) · N ( x i ; µ 2 , Σ 2 ) δ ( z i � 2 ) � (40) m � � δ ( z i � 1 ) log α + δ ( z i � 1 ) log N ( x i ; µ 1 , Σ 1 ) � i � 1 � δ ( z i � 2 ) log ( 1 − α ) + δ ( z i � 2 ) log N ( x i ; µ 2 , Σ 2 ) Question : we have already know that z i is a random variable, but E [ z i � 1 ] � α ? 34
EM Algorithm
Basic Idea ◮ The key challenge of GMM with latent variables is that we do not know the distributions of { z i } 36
Basic Idea ◮ The key challenge of GMM with latent variables is that we do not know the distributions of { z i } ◮ The basic idea of the EM algorithm is to alternatively address the challenge between { z i } m i � 1 ⇔ θ � { α, µ 1 , Σ 1 , µ 2 , Σ 2 } (41) 36
Basic Idea ◮ The key challenge of GMM with latent variables is that we do not know the distributions of { z i } ◮ The basic idea of the EM algorithm is to alternatively address the challenge between { z i } m i � 1 ⇔ θ � { α, µ 1 , Σ 1 , µ 2 , Σ 2 } (41) ◮ Basic procedure 1. Fix θ , estimate the distributions of { z i } m i � 1 2. Fix the distribution of { z i } m i � 1 , estimate the value of θ 3. Go back to step 1 36
How to Estimate z i ? Fix θ , we can estimate the distribution of each z i as (with equation 38 and 39) q ( x i , z i ) q ( z i | x i ) � q ( x i ) (42) Particularly, we have α · N ( x i ; µ 1 , Σ 1 ) q ( z i � 1 | x i ) � α · N ( x i ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x i ; µ 2 , Σ 2 ) (43) 37
Expectation Let γ i be the expectation of z i under the distribution of q ( z i | x i ) E [ z i ] � γ i (44) 38
Expectation Let γ i be the expectation of z i under the distribution of q ( z i | x i ) E [ z i ] � γ i (44) ◮ Since z i is a Bernoulli random variable, we also have q ( z i � 1 | x i ) � γ i 38
Expectation Let γ i be the expectation of z i under the distribution of q ( z i | x i ) E [ z i ] � γ i (44) ◮ Since z i is a Bernoulli random variable, we also have q ( z i � 1 | x i ) � γ i ◮ Furthermore, the expectation of δ ( z i � 1 ) under the distribution of q ( z i | x i ) E [ δ ( z i � 1 )] δ ( z i � 1 ) · q ( z i � 1 | x i ) � + δ ( z i � 1 ) · q ( z i � 2 | x i ) q ( z i � 1 ) � γ i (45) � 38
Parameter Estimation (I) Given m � � ℓ ( θ ) � δ ( z i � 1 ) log α + δ ( z i � 1 ) log N ( x i ; µ 1 , Σ 1 ) (46) i � 1 � δ ( z i � 2 ) log ( 1 − α ) + δ ( z i � 2 ) log N ( x i ; µ 2 , Σ 2 ) 39
Parameter Estimation (I) Given m � � ℓ ( θ ) � δ ( z i � 1 ) log α + δ ( z i � 1 ) log N ( x i ; µ 1 , Σ 1 ) (46) i � 1 � δ ( z i � 2 ) log ( 1 − α ) + δ ( z i � 2 ) log N ( x i ; µ 2 , Σ 2 ) To maximize ℓ ( θ ) with respect to α we have � δ ( z i � 1 ) m � � − δ ( z i � 2 ) � 0 (47) α 1 − α i � 1 39
Parameter Estimation (I) Given m � � ℓ ( θ ) � δ ( z i � 1 ) log α + δ ( z i � 1 ) log N ( x i ; µ 1 , Σ 1 ) (46) i � 1 � δ ( z i � 2 ) log ( 1 − α ) + δ ( z i � 2 ) log N ( x i ; µ 2 , Σ 2 ) To maximize ℓ ( θ ) with respect to α we have � δ ( z i � 1 ) m � � − δ ( z i � 2 ) � 0 (47) α 1 − α i � 1 and � m � m i � 1 δ ( z i � 1 ) i � 1 δ ( z i � 1 ) α | z � (48) � m i � 1 ( δ ( z i � 1 ) + δ ( z i � 2 )) � m which is similar to the classification example, except that z i is a random variable 39
Parameter Estimation (II) Without going through the details, the estimate of mean and covariance take the similar forms. For example, for the first component, we have m � 1 µ 1 | z δ ( z i � 1 ) x i (49) � m i � 1 m � 1 δ ( z i � 1 )( x i − µ 1 )( x i − µ 1 ) T Σ 1 | z (50) � m i � 1 40
Parameter Estimation (II) Without going through the details, the estimate of mean and covariance take the similar forms. For example, for the first component, we have m � 1 µ 1 | z δ ( z i � 1 ) x i (49) � m i � 1 m � 1 δ ( z i � 1 )( x i − µ 1 )( x i − µ 1 ) T Σ 1 | z (50) � m i � 1 Question : how to eliminate the randomness in α , µ 1 , Σ 1 (and similarly in µ 2 , Σ 2 )? 40
Expectation (II) With E [ δ ( z i � 1 )] � γ i , we have m � E [ α | z ] � 1 E [ δ ( z i � 1 )] x i α � m i � 1 m � 1 (51) γ i x i � m i � 1 41
Expectation (II) With E [ δ ( z i � 1 )] � γ i , we have m � E [ α | z ] � 1 E [ δ ( z i � 1 )] x i α � m i � 1 m � 1 (51) γ i x i � m i � 1 Similarly, we have m m � � µ 1 � 1 µ 2 � 1 γ i x i ( 1 − γ i ) x i m m i � 1 i � 1 m � 1 γ i ( x i − µ 1 )( x i − µ 1 ) T Σ 1 � m i � 1 � m 1 ( 1 − γ i )( x i − µ 2 )( x i − µ 2 ) T (52) Σ 2 41 � m i � 1
The EM Algorithm, Review The algorithm iteratively run the following two steps: E-step Given θ , for each x i , estimate the distribution of the corresponding latent variable z i q ( z i | x i ) � q ( x i , z i ) (53) q ( x i ) and its expectation γ i M-step Given { z i } m i � 1 , maximize the log-likelihood function ℓ ( θ ) and estimate the parameter θ with { γ i } m i � 1 42
Illustration [Bishop, 2006, Page 437] 43
Variational Inference (Optional)
The Computation of q ( z | x ) ◮ In the previous example, we were able to compute the analytic solution of q ( z | x ) as q ( z | x ) � q ( x , z ) (54) q ( x ) where q ( x ) � � z q ( x , z ) ◮ Challenge : Unlike the simple case in GMMs, usually q ( x ) is difficult to compute � q ( x ) q ( x , z ) discrete (55) � z ∫ q ( x , z ) dz continuous (56) � z 45
Solution ◮ Instead of computing q ( x ) and then q ( z | x ) , we propose another distribution q ′ ( z | x ) to approximate q ( z | x ) q ′ ( z | x ) ≈ q ( z | x ) (57) where q ′ ( z | x ) should be simple enough to facilitate the computation 46
Solution ◮ Instead of computing q ( x ) and then q ( z | x ) , we propose another distribution q ′ ( z | x ) to approximate q ( z | x ) q ′ ( z | x ) ≈ q ( z | x ) (57) where q ′ ( z | x ) should be simple enough to facilitate the computation ◮ The objective of finding a good approximation is the Kullback–Leibler (KL) divergence � q ′ ( z | x ) log q ′ ( z | x ) KL ( q ′ � q ) discrete � q ( z | x ) z ∫ q ′ ( z | x ) log q ′ ( z | x ) q ( z | x ) dz continuous � z 46
KL Divergence ◮ KL ( q ′ � q ) ≥ 0 and the equality holds if and only if q ′ � q 47
KL Divergence ◮ KL ( q ′ � q ) ≥ 0 and the equality holds if and only if q ′ � q ◮ Consider the continuous case for the visualization purpose. ∫ q ′ ( z | x ) log q ′ ( z | x ) KL ( q ′ � q ) � q ( z | x ) dz (58) z 47
KL Divergence ◮ KL ( q ′ � q ) ≥ 0 and the equality holds if and only if q ′ � q ◮ Consider the continuous case for the visualization purpose. ∫ q ′ ( z | x ) log q ′ ( z | x ) KL ( q ′ � q ) � q ( z | x ) dz (58) z ◮ Regardless what q ( z | x ) looks like, we decide to define q ′ ( z | x ) for simplicity 47
KL Divergence ◮ KL ( q ′ � q ) ≥ 0 and the equality holds if and only if q ′ � q ◮ Consider the continuous case for the visualization purpose. ∫ q ′ ( z | x ) log q ′ ( z | x ) KL ( q ′ � q ) � q ( z | x ) dz (58) z ◮ Regardless what q ( z | x ) looks like, we decide to define q ′ ( z | x ) for simplicity ◮ Because of q ( z | x ) in equation 58, the challenge still 47 exists
ELBo The learning objective for q ′ ( z | x ) is ∫ q ′ ( z | x ) log q ′ ( z | x ) KL ( q ′ � q ) q ( z | x ) dz � z 48
ELBo The learning objective for q ′ ( z | x ) is ∫ q ′ ( z | x ) log q ′ ( z | x ) KL ( q ′ � q ) q ( z | x ) dz � z ∫ q ′ ( z | x ) log q ′ ( z | x ) q ( x ) dz � q ( z , x ) z ∫ q ′ ( z | x ) log q ′ ( z | x ) q ( x ) q ( x | z ) q ( z ) dz � z 48
ELBo The learning objective for q ′ ( z | x ) is ∫ q ′ ( z | x ) log q ′ ( z | x ) KL ( q ′ � q ) q ( z | x ) dz � z ∫ q ′ ( z | x ) log q ′ ( z | x ) q ( x ) dz � q ( z , x ) z ∫ q ′ ( z | x ) log q ′ ( z | x ) q ( x ) q ( x | z ) q ( z ) dz � z ∫ � � − log q ( x | z ) + log q ′ ( z | x ) q ′ ( z | x ) + log q ( x ) dz � q ( z ) z � � + KL ( q ′ ( z | x )� q ( z )) + log q ( x ) − E log q ( x | z ) � 48
ELBo The learning objective for q ′ ( z | x ) is ∫ q ′ ( z | x ) log q ′ ( z | x ) KL ( q ′ � q ) q ( z | x ) dz � z ∫ q ′ ( z | x ) log q ′ ( z | x ) q ( x ) dz � q ( z , x ) z ∫ q ′ ( z | x ) log q ′ ( z | x ) q ( x ) q ( x | z ) q ( z ) dz � z ∫ � � − log q ( x | z ) + log q ′ ( z | x ) q ′ ( z | x ) + log q ( x ) dz � q ( z ) z � � + KL ( q ′ ( z | x )� q ( z )) + log q ( x ) − E log q ( x | z ) � − ELBo + log q ( x ) � Minimize KL ( q ′ � q ) is equivalent to maximize the Evidence Lower Bound (ELBo) 48
Recommend
More recommend