cs 6316 machine learning
play

CS 6316 Machine Learning Generative Models Yangfeng Ji Department - PowerPoint PPT Presentation

CS 6316 Machine Learning Generative Models Yangfeng Ji Department of Computer Science University of Virginia Basic Definition Data generation process An idealized process to illustrate the relations among domain set X , label set Y , and the


  1. MLE: Gaussian Distribution The definition of multi-variate Gaussian distribution exp � 2 ( x − µ ) T Σ − 1 ( x − µ ) � 1 − 1 q ( x | y ; µ , Σ ) � (18) � ( 2 π ) d | Σ | ◮ For y � + 1 , MLE on µ + and Σ + will only consider the samples x with y � + 1 (assume it’s S + ) ◮ MLE on µ + � 1 µ � (19) x i | S + | x i ∈ S + ◮ MLE on Σ + � 1 ( x i − µ )( x i − µ ) T (20) Σ + � | S + | x i ∈ S + ◮ Exercise : prove equations 19 and 20 with d � 1 15

  2. Example: Parameter Estimation Given N � 1000 samples, here are the parameters Parameter p (·) q (·) [ 2 , 0 ] T [ 1 . 95 , − 0 . 11 ] T µ + � 1 . 0 � 0 . 88 � � 0 . 8 0 . 74 Σ + 0 . 8 2 . 0 0 . 74 1 . 97 [− 2 , 0 ] T [− 2 . 08 , 0 . 08 ] T µ - � 2 . 0 � 1 . 88 � � 0 . 6 0 . 55 Σ - 0 . 6 1 . 0 0 . 55 1 . 07 16

  3. Prediction ◮ For a new data point x ′ , the prediction is given as q ( y ′ | x ′ ) � q ( y ′ ) q ( x | y ′ ) ∝ q ( y ′ ) q ( x ′ | y ′ ) (21) q ( x ′ ) No need to compute q ( x ′ ) 17

  4. Prediction ◮ For a new data point x ′ , the prediction is given as q ( y ′ | x ′ ) � q ( y ′ ) q ( x | y ′ ) ∝ q ( y ′ ) q ( x ′ | y ′ ) (21) q ( x ′ ) No need to compute q ( x ′ ) ◮ Prediction rule � + 1 q ( y ′ � + 1 | x ′ ) > q ( y ′ � − 1 | x ′ ) y ′ � (22) q ( y ′ � + 1 | x ′ ) < q ( y ′ � + 1 | x ′ ) − 1 17

  5. Prediction ◮ For a new data point x ′ , the prediction is given as q ( y ′ | x ′ ) � q ( y ′ ) q ( x | y ′ ) ∝ q ( y ′ ) q ( x ′ | y ′ ) (21) q ( x ′ ) No need to compute q ( x ′ ) ◮ Prediction rule � + 1 q ( y ′ � + 1 | x ′ ) > q ( y ′ � − 1 | x ′ ) y ′ � (22) q ( y ′ � + 1 | x ′ ) < q ( y ′ � + 1 | x ′ ) − 1 ◮ Although equation 22 looks like the one used in the Bayes optimal predictor, the prediction power is limited by q ( y ′ | x ′ ) ≈ p ( y | x ) (23) Again, we don’t know p (·) 17

  6. Naive Bayes Classifiers

  7. Number of Parameters Assume x � ( x · , 1 , . . . , x · , d ) ∈ R d , then the number of parameters in q ( x , y ) ◮ q ( y ) : 1 ( α ) ◮ q ( x | y � + 1 ) : ◮ µ + ∈ R d : d parameters ◮ Σ + ∈ R d × d : d 2 parameters ◮ q ( x | y � − 1 ) : d 2 + d parameters In total, we have 2 d 2 + 2 d + 1 parameters 19

  8. Challenge of Parameter Estimation ◮ When d � 100 , we have 2 d 2 + 2 d + 1 � 20201 parameters ◮ A close look about the covariance matrix Σ in a multivariate Gaussian distribution  σ 2 σ 2  · · ·   1 , 1 1 , d   . . ... . . Σ � (24)   . .    σ 2 σ 2  · · ·   d , 1 d , d 20

  9. Challenge of Parameter Estimation ◮ When d � 100 , we have 2 d 2 + 2 d + 1 � 20201 parameters ◮ A close look about the covariance matrix Σ in a multivariate Gaussian distribution  σ 2 σ 2  · · ·   1 , 1 1 , d   . . ... . . Σ � (24)   . .    σ 2 σ 2  · · ·   d , 1 d , d ◮ To reduce the number of parameters, we assume σ i , j � 0 if i � j (25) 20

  10. Diagonal Covariance Matrix With the diagonal covariance matrix  σ 2  · · · 0   1 , 1   . . ... . . Σ � (26)   . .    σ 2  0 · · ·   d , d Now, the multivariate Gaussian distribution can be rewritten with d � σ 2 | Σ | (27) � j , j j � 1 d � ( x · , j − µ j ) 2 ( x − µ ) T Σ − 1 ( x − µ ) (28) � σ 2 j , j j � 1 21

  11. Diagonal Covariance Matrix (II) In other words d � q ( x · , j | y ; µ j , σ 2 q ( x | y , µ , Σ ) � j , j ) (29) j � 1 22

  12. Diagonal Covariance Matrix (II) In other words d � q ( x · , j | y ; µ j , σ 2 q ( x | y , µ , Σ ) � j , j ) (29) j � 1 ◮ Conditional Independence : Equation 29 means, given y , each component x j is independent of other components 22

  13. Diagonal Covariance Matrix (II) In other words d � q ( x · , j | y ; µ j , σ 2 q ( x | y , µ , Σ ) � j , j ) (29) j � 1 ◮ Conditional Independence : Equation 29 means, given y , each component x j is independent of other components ◮ This is a strong and naive assumption about q ( x | ·) 22

  14. Diagonal Covariance Matrix (II) In other words d � q ( x · , j | y ; µ j , σ 2 q ( x | y , µ , Σ ) � j , j ) (29) j � 1 ◮ Conditional Independence : Equation 29 means, given y , each component x j is independent of other components ◮ This is a strong and naive assumption about q ( x | ·) ◮ Together with q ( y ) , this generative model is called the Naive Bayes classifier 22

  15. Diagonal Covariance Matrix (II) In other words d � q ( x · , j | y ; µ j , σ 2 q ( x | y , µ , Σ ) � j , j ) (29) j � 1 ◮ Conditional Independence : Equation 29 means, given y , each component x j is independent of other components ◮ This is a strong and naive assumption about q ( x | ·) ◮ Together with q ( y ) , this generative model is called the Naive Bayes classifier ◮ Parameter estimation can be done per dimension 22

  16. Example: Parameter Estimation Given N � 1000 samples, here are the parameters Parameter p (·) q (·) Naive Bayes [ 2 , 0 ] T [ 1 . 95 , − 0 . 11 ] T [ 1 . 95 , − 0 . 11 ] T µ + � 1 . 0 � 0 . 88 � 0 . 88 � � � 0 . 8 0 . 74 0 Σ + 0 . 8 2 . 0 0 . 74 1 . 97 0 1 . 97 [− 2 , 0 ] T [− 2 . 08 , 0 . 08 ] T [− 2 . 08 , 0 . 08 ] T µ - � 2 . 0 � 1 . 88 � 1 . 88 � � � 0 . 6 0 . 55 0 Σ - 0 . 6 1 . 0 0 . 55 1 . 07 0 1 . 07 23

  17. Latent Variable Models

  18. Data Generation Model, Revisited Consider the following model again without any label information p ( x ) � α · N ( x ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x ; µ 2 , Σ 2 ) (30) � ������������� �� ������������� � � �������������������� �� �������������������� � c � 1 c � 2 25

  19. Data Generation Model, Revisited Consider the following model again without any label information p ( x ) � α · N ( x ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x ; µ 2 , Σ 2 ) (30) � ������������� �� ������������� � � �������������������� �� �������������������� � c � 1 c � 2 ◮ No labeling information ◮ Instead of having two classes, now it has two components c ∈ { 1 , 2 } 25

  20. Data Generation Model, Revisited Consider the following model again without any label information p ( x ) � α · N ( x ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x ; µ 2 , Σ 2 ) (30) � ������������� �� ������������� � � �������������������� �� �������������������� � c � 1 c � 2 ◮ No labeling information ◮ Instead of having two classes, now it has two components c ∈ { 1 , 2 } ◮ It is a specific case of Gaussian mixture models ◮ A mixture model with two Gaussian components 25

  21. Data Generation The data generation process: for each data point 1. Randomly select a component c based on p ( c � 1 ) � α p ( c � 2 ) � 1 − α (31) 26

  22. Data Generation The data generation process: for each data point 1. Randomly select a component c based on p ( c � 1 ) � α p ( c � 2 ) � 1 − α (31) 2. Sample x from the corresponding component c � N ( x ; µ 1 , Σ 1 ) c � 1 p ( x | y ) � (32) N ( x ; µ 2 , Σ 2 ) c � 2 26

  23. Data Generation The data generation process: for each data point 1. Randomly select a component c based on p ( c � 1 ) � α p ( c � 2 ) � 1 − α (31) 2. Sample x from the corresponding component c � N ( x ; µ 1 , Σ 1 ) c � 1 p ( x | y ) � (32) N ( x ; µ 2 , Σ 2 ) c � 2 3. Add x to S , go to step 1 26

  24. Illustration Here is an example data set S with 1,000 samples No label information available 27

  25. The Learning Problem Consider using the following distribution to fit the data S q ( x ) � α · N ( x ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x ; µ 2 , Σ 2 ) (33) 28

  26. The Learning Problem Consider using the following distribution to fit the data S q ( x ) � α · N ( x ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x ; µ 2 , Σ 2 ) (33) ◮ This is a density estimation problem — one of the unsupervised learning problems 28

  27. The Learning Problem Consider using the following distribution to fit the data S q ( x ) � α · N ( x ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x ; µ 2 , Σ 2 ) (33) ◮ This is a density estimation problem — one of the unsupervised learning problems ◮ The number of components in q ( x ) is part of the assumption based on our understanding about the data 28

  28. The Learning Problem Consider using the following distribution to fit the data S q ( x ) � α · N ( x ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x ; µ 2 , Σ 2 ) (33) ◮ This is a density estimation problem — one of the unsupervised learning problems ◮ The number of components in q ( x ) is part of the assumption based on our understanding about the data ◮ Without knowing the true data distribution, the number of components is treated as a hyper-parameter (predetermined before learning) 28

  29. Parameter Estimation ◮ Based on the general form of GMMs, the parameters are θ � { α, µ 1 , Σ 1 , µ 2 , Σ 2 } ◮ Given a set of training example S � { x 1 , . . . , x m } , the straightforward method is MLE m � L ( θ ) log q ( x i ; θ ) � i � 1 m � � log α · N ( x i ; µ 1 , Σ 1 ) � i � 1 � + ( 1 − α ) · N ( x i ; µ 2 , Σ 2 ) (34) ◮ Learning: θ ← argmax θ ′ L ( θ ′ ) 29

  30. Singularity in GMM Parameter Estimation Singularity happens when one of the mixture component only captures a single data point, which eventually leads the (log-)likelihood to ∞ 30

  31. Singularity in GMM Parameter Estimation Singularity happens when one of the mixture component only captures a single data point, which eventually leads the (log-)likelihood to ∞ ◮ It is easy to overfit the training set using GMMs, for example when K � m 30

  32. Singularity in GMM Parameter Estimation Singularity happens when one of the mixture component only captures a single data point, which eventually leads the (log-)likelihood to ∞ ◮ It is easy to overfit the training set using GMMs, for example when K � m ◮ This issue does not exist when estimating parameters for a single Gaussian distribution 30

  33. Gradient-based Learning Recall the definition of L ( θ ) m � � � L ( θ ) � log α · N ( x i ; µ 1 , Σ 1 ) + ( 1 − α )· N ( x i ; µ 2 , Σ 2 ) (35) i � 1 ◮ There is no closed form solution of ∇ L ( θ ) � 0 ◮ E.g., the value of α depends on { µ c , Σ c } 2 c � 1 , vice versa ◮ Gradient-based learning is still feasible as θ ( new ) ← θ ( old ) + η · ∇ L ( θ ) 31

  34. Latent Variable Models To rewrite equation 33 into a full probabilistic form, we introduce a random variable z ∈ { 1 , 2 } , with q ( z � 1 ) � α q ( z � 2 ) � 1 − α (36) or q ( z ) � α δ ( z � 1 ) ( 1 − α ) δ ( z � 2 ) (37) 32

  35. Latent Variable Models To rewrite equation 33 into a full probabilistic form, we introduce a random variable z ∈ { 1 , 2 } , with q ( z � 1 ) � α q ( z � 2 ) � 1 − α (36) or q ( z ) � α δ ( z � 1 ) ( 1 − α ) δ ( z � 2 ) (37) ◮ z is a random variable and indicates the mixture component for x (a similar role as y in the classification problem) 32

  36. Latent Variable Models To rewrite equation 33 into a full probabilistic form, we introduce a random variable z ∈ { 1 , 2 } , with q ( z � 1 ) � α q ( z � 2 ) � 1 − α (36) or q ( z ) � α δ ( z � 1 ) ( 1 − α ) δ ( z � 2 ) (37) ◮ z is a random variable and indicates the mixture component for x (a similar role as y in the classification problem) ◮ z is not directly observed in the data, therefore it is a latent (random) variable. 32

  37. GMM with Latent Variable With latent variable z , we can rewrite the probabilistic model as a joint distribution over x and z q ( x , z ) q ( z ) q ( x | z ) � α δ ( z � 1 ) · N ( x ; µ 1 , Σ 1 ) δ ( z � 1 ) � · ( 1 − α ) δ ( z � 2 ) · N ( x ; µ 2 , Σ 2 ) δ ( z � 2 ) (38) 33

  38. GMM with Latent Variable With latent variable z , we can rewrite the probabilistic model as a joint distribution over x and z q ( x , z ) q ( z ) q ( x | z ) � α δ ( z � 1 ) · N ( x ; µ 1 , Σ 1 ) δ ( z � 1 ) � · ( 1 − α ) δ ( z � 2 ) · N ( x ; µ 2 , Σ 2 ) δ ( z � 2 ) (38) And the marginal probability p ( x ) is the same as in equation 33 q ( x ) q ( z � 1 ) q ( x | z � 1 ) + q ( z � 2 ) q ( x | z � 2 ) � α · N ( x ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x ; µ 2 , Σ 2 ) (39) � 33

  39. Parameter Estimation: MLE? For each x i , we introduce a latent variable z i as mixture component indicator, then the log likelihood is defined as m � ℓ ( θ ) log q ( x i , z i ) � i � 1 m � � α δ ( z i � 1 ) · N ( x i ; µ 1 , Σ 1 ) δ ( z i � 1 ) log � i � 1 · ( 1 − α ) δ ( z i � 2 ) · N ( x i ; µ 2 , Σ 2 ) δ ( z i � 2 ) � (40) m � � δ ( z i � 1 ) log α + δ ( z i � 1 ) log N ( x i ; µ 1 , Σ 1 ) � i � 1 � δ ( z i � 2 ) log ( 1 − α ) + δ ( z i � 2 ) log N ( x i ; µ 2 , Σ 2 ) 34

  40. Parameter Estimation: MLE? For each x i , we introduce a latent variable z i as mixture component indicator, then the log likelihood is defined as m � ℓ ( θ ) log q ( x i , z i ) � i � 1 m � � α δ ( z i � 1 ) · N ( x i ; µ 1 , Σ 1 ) δ ( z i � 1 ) log � i � 1 · ( 1 − α ) δ ( z i � 2 ) · N ( x i ; µ 2 , Σ 2 ) δ ( z i � 2 ) � (40) m � � δ ( z i � 1 ) log α + δ ( z i � 1 ) log N ( x i ; µ 1 , Σ 1 ) � i � 1 � δ ( z i � 2 ) log ( 1 − α ) + δ ( z i � 2 ) log N ( x i ; µ 2 , Σ 2 ) Question : we have already know that z i is a random variable, but E [ z i � 1 ] � α ? 34

  41. EM Algorithm

  42. Basic Idea ◮ The key challenge of GMM with latent variables is that we do not know the distributions of { z i } 36

  43. Basic Idea ◮ The key challenge of GMM with latent variables is that we do not know the distributions of { z i } ◮ The basic idea of the EM algorithm is to alternatively address the challenge between { z i } m i � 1 ⇔ θ � { α, µ 1 , Σ 1 , µ 2 , Σ 2 } (41) 36

  44. Basic Idea ◮ The key challenge of GMM with latent variables is that we do not know the distributions of { z i } ◮ The basic idea of the EM algorithm is to alternatively address the challenge between { z i } m i � 1 ⇔ θ � { α, µ 1 , Σ 1 , µ 2 , Σ 2 } (41) ◮ Basic procedure 1. Fix θ , estimate the distributions of { z i } m i � 1 2. Fix the distribution of { z i } m i � 1 , estimate the value of θ 3. Go back to step 1 36

  45. How to Estimate z i ? Fix θ , we can estimate the distribution of each z i as (with equation 38 and 39) q ( x i , z i ) q ( z i | x i ) � q ( x i ) (42) Particularly, we have α · N ( x i ; µ 1 , Σ 1 ) q ( z i � 1 | x i ) � α · N ( x i ; µ 1 , Σ 1 ) + ( 1 − α ) · N ( x i ; µ 2 , Σ 2 ) (43) 37

  46. Expectation Let γ i be the expectation of z i under the distribution of q ( z i | x i ) E [ z i ] � γ i (44) 38

  47. Expectation Let γ i be the expectation of z i under the distribution of q ( z i | x i ) E [ z i ] � γ i (44) ◮ Since z i is a Bernoulli random variable, we also have q ( z i � 1 | x i ) � γ i 38

  48. Expectation Let γ i be the expectation of z i under the distribution of q ( z i | x i ) E [ z i ] � γ i (44) ◮ Since z i is a Bernoulli random variable, we also have q ( z i � 1 | x i ) � γ i ◮ Furthermore, the expectation of δ ( z i � 1 ) under the distribution of q ( z i | x i ) E [ δ ( z i � 1 )] δ ( z i � 1 ) · q ( z i � 1 | x i ) � + δ ( z i � 1 ) · q ( z i � 2 | x i ) q ( z i � 1 ) � γ i (45) � 38

  49. Parameter Estimation (I) Given m � � ℓ ( θ ) � δ ( z i � 1 ) log α + δ ( z i � 1 ) log N ( x i ; µ 1 , Σ 1 ) (46) i � 1 � δ ( z i � 2 ) log ( 1 − α ) + δ ( z i � 2 ) log N ( x i ; µ 2 , Σ 2 ) 39

  50. Parameter Estimation (I) Given m � � ℓ ( θ ) � δ ( z i � 1 ) log α + δ ( z i � 1 ) log N ( x i ; µ 1 , Σ 1 ) (46) i � 1 � δ ( z i � 2 ) log ( 1 − α ) + δ ( z i � 2 ) log N ( x i ; µ 2 , Σ 2 ) To maximize ℓ ( θ ) with respect to α we have � δ ( z i � 1 ) m � � − δ ( z i � 2 ) � 0 (47) α 1 − α i � 1 39

  51. Parameter Estimation (I) Given m � � ℓ ( θ ) � δ ( z i � 1 ) log α + δ ( z i � 1 ) log N ( x i ; µ 1 , Σ 1 ) (46) i � 1 � δ ( z i � 2 ) log ( 1 − α ) + δ ( z i � 2 ) log N ( x i ; µ 2 , Σ 2 ) To maximize ℓ ( θ ) with respect to α we have � δ ( z i � 1 ) m � � − δ ( z i � 2 ) � 0 (47) α 1 − α i � 1 and � m � m i � 1 δ ( z i � 1 ) i � 1 δ ( z i � 1 ) α | z � (48) � m i � 1 ( δ ( z i � 1 ) + δ ( z i � 2 )) � m which is similar to the classification example, except that z i is a random variable 39

  52. Parameter Estimation (II) Without going through the details, the estimate of mean and covariance take the similar forms. For example, for the first component, we have m � 1 µ 1 | z δ ( z i � 1 ) x i (49) � m i � 1 m � 1 δ ( z i � 1 )( x i − µ 1 )( x i − µ 1 ) T Σ 1 | z (50) � m i � 1 40

  53. Parameter Estimation (II) Without going through the details, the estimate of mean and covariance take the similar forms. For example, for the first component, we have m � 1 µ 1 | z δ ( z i � 1 ) x i (49) � m i � 1 m � 1 δ ( z i � 1 )( x i − µ 1 )( x i − µ 1 ) T Σ 1 | z (50) � m i � 1 Question : how to eliminate the randomness in α , µ 1 , Σ 1 (and similarly in µ 2 , Σ 2 )? 40

  54. Expectation (II) With E [ δ ( z i � 1 )] � γ i , we have m � E [ α | z ] � 1 E [ δ ( z i � 1 )] x i α � m i � 1 m � 1 (51) γ i x i � m i � 1 41

  55. Expectation (II) With E [ δ ( z i � 1 )] � γ i , we have m � E [ α | z ] � 1 E [ δ ( z i � 1 )] x i α � m i � 1 m � 1 (51) γ i x i � m i � 1 Similarly, we have m m � � µ 1 � 1 µ 2 � 1 γ i x i ( 1 − γ i ) x i m m i � 1 i � 1 m � 1 γ i ( x i − µ 1 )( x i − µ 1 ) T Σ 1 � m i � 1 � m 1 ( 1 − γ i )( x i − µ 2 )( x i − µ 2 ) T (52) Σ 2 41 � m i � 1

  56. The EM Algorithm, Review The algorithm iteratively run the following two steps: E-step Given θ , for each x i , estimate the distribution of the corresponding latent variable z i q ( z i | x i ) � q ( x i , z i ) (53) q ( x i ) and its expectation γ i M-step Given { z i } m i � 1 , maximize the log-likelihood function ℓ ( θ ) and estimate the parameter θ with { γ i } m i � 1 42

  57. Illustration [Bishop, 2006, Page 437] 43

  58. Variational Inference (Optional)

  59. The Computation of q ( z | x ) ◮ In the previous example, we were able to compute the analytic solution of q ( z | x ) as q ( z | x ) � q ( x , z ) (54) q ( x ) where q ( x ) � � z q ( x , z ) ◮ Challenge : Unlike the simple case in GMMs, usually q ( x ) is difficult to compute � q ( x ) q ( x , z ) discrete (55) � z ∫ q ( x , z ) dz continuous (56) � z 45

  60. Solution ◮ Instead of computing q ( x ) and then q ( z | x ) , we propose another distribution q ′ ( z | x ) to approximate q ( z | x ) q ′ ( z | x ) ≈ q ( z | x ) (57) where q ′ ( z | x ) should be simple enough to facilitate the computation 46

  61. Solution ◮ Instead of computing q ( x ) and then q ( z | x ) , we propose another distribution q ′ ( z | x ) to approximate q ( z | x ) q ′ ( z | x ) ≈ q ( z | x ) (57) where q ′ ( z | x ) should be simple enough to facilitate the computation ◮ The objective of finding a good approximation is the Kullback–Leibler (KL) divergence � q ′ ( z | x ) log q ′ ( z | x ) KL ( q ′ � q ) discrete � q ( z | x ) z ∫ q ′ ( z | x ) log q ′ ( z | x ) q ( z | x ) dz continuous � z 46

  62. KL Divergence ◮ KL ( q ′ � q ) ≥ 0 and the equality holds if and only if q ′ � q 47

  63. KL Divergence ◮ KL ( q ′ � q ) ≥ 0 and the equality holds if and only if q ′ � q ◮ Consider the continuous case for the visualization purpose. ∫ q ′ ( z | x ) log q ′ ( z | x ) KL ( q ′ � q ) � q ( z | x ) dz (58) z 47

  64. KL Divergence ◮ KL ( q ′ � q ) ≥ 0 and the equality holds if and only if q ′ � q ◮ Consider the continuous case for the visualization purpose. ∫ q ′ ( z | x ) log q ′ ( z | x ) KL ( q ′ � q ) � q ( z | x ) dz (58) z ◮ Regardless what q ( z | x ) looks like, we decide to define q ′ ( z | x ) for simplicity 47

  65. KL Divergence ◮ KL ( q ′ � q ) ≥ 0 and the equality holds if and only if q ′ � q ◮ Consider the continuous case for the visualization purpose. ∫ q ′ ( z | x ) log q ′ ( z | x ) KL ( q ′ � q ) � q ( z | x ) dz (58) z ◮ Regardless what q ( z | x ) looks like, we decide to define q ′ ( z | x ) for simplicity ◮ Because of q ( z | x ) in equation 58, the challenge still 47 exists

  66. ELBo The learning objective for q ′ ( z | x ) is ∫ q ′ ( z | x ) log q ′ ( z | x ) KL ( q ′ � q ) q ( z | x ) dz � z 48

  67. ELBo The learning objective for q ′ ( z | x ) is ∫ q ′ ( z | x ) log q ′ ( z | x ) KL ( q ′ � q ) q ( z | x ) dz � z ∫ q ′ ( z | x ) log q ′ ( z | x ) q ( x ) dz � q ( z , x ) z ∫ q ′ ( z | x ) log q ′ ( z | x ) q ( x ) q ( x | z ) q ( z ) dz � z 48

  68. ELBo The learning objective for q ′ ( z | x ) is ∫ q ′ ( z | x ) log q ′ ( z | x ) KL ( q ′ � q ) q ( z | x ) dz � z ∫ q ′ ( z | x ) log q ′ ( z | x ) q ( x ) dz � q ( z , x ) z ∫ q ′ ( z | x ) log q ′ ( z | x ) q ( x ) q ( x | z ) q ( z ) dz � z ∫ � � − log q ( x | z ) + log q ′ ( z | x ) q ′ ( z | x ) + log q ( x ) dz � q ( z ) z � � + KL ( q ′ ( z | x )� q ( z )) + log q ( x ) − E log q ( x | z ) � 48

  69. ELBo The learning objective for q ′ ( z | x ) is ∫ q ′ ( z | x ) log q ′ ( z | x ) KL ( q ′ � q ) q ( z | x ) dz � z ∫ q ′ ( z | x ) log q ′ ( z | x ) q ( x ) dz � q ( z , x ) z ∫ q ′ ( z | x ) log q ′ ( z | x ) q ( x ) q ( x | z ) q ( z ) dz � z ∫ � � − log q ( x | z ) + log q ′ ( z | x ) q ′ ( z | x ) + log q ( x ) dz � q ( z ) z � � + KL ( q ′ ( z | x )� q ( z )) + log q ( x ) − E log q ( x | z ) � − ELBo + log q ( x ) � Minimize KL ( q ′ � q ) is equivalent to maximize the Evidence Lower Bound (ELBo) 48

Recommend


More recommend