Mutual Angular Regularization of Latent Variable Models: Theory, Algorithm and Applications Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1
Latent Variable Models (LVMs) Machine Learning Latent Variable Models Pattern 2
Latent Variable Models Topic Models Gaussian Mixture Model Topics Groups Words Feature vectors Hidden Markov Model, Kalman Filtering, Restricted Boltzmann Machine, Deep Belief Network, Factor Analysis, etc. Neural Network, Sparse Coding, Matrix Factorization, Distance Metric learning, Principal Component Analysis, etc. 3
Latent Variable Models Latent Factors Components Behind Data in LVMs Topics in Documents Topic Models Politics Economics Education Obama GDP University Constitution Bank Knowledge Student Government Marketing Groups in Images Gaussian Mixture Model Tiger Car Food 4
Motivation I: Popularity of latent factors is skewed Popularity of latent factors follows a power-law distribution Groups in Topics in News Flickr Photos Dominant Topics Long-Tail Topics Dominant Groups Long-Tail Groups Politics Economics Furniture Flower Diamond Painting Obama GDP Sofa Rose Car Food Constitution Bank Closet Tulip Government Marketing Curtain Lily 5
Standard LVMs are insufficient to capture long-tail factors Latent Dirichlet Allocation (LDA) “Extremely common words tend to dominate all topics” (Wallach, 2009) Tencent Peacock LDA, “When learning ≥ 10 5 topics, around 20% ∼ 40% topics have duplicates in practice” (Wang, 2015 ) Restricted Boltzmann Machine Topic 1 Topic 2 Topic 3 president iraq iraq Ran on 20-Newsgroup dataset clinton united un Many duplicate topics (e.g., the three iraq un iraqi united weapons lewinsky exemplar topics are all about politics) spkr iraqi saddam Common words occur repeatedly house nuclear clinton people india baghdad across topics, such as iraq, clinton, lewinsky minister inspectors united, weapons government saddam weapons white military white 6
Standard LVMs are insufficient to capture long-tail factors Latent factors behind data Components in LVM 7
Long-tail factors are important The amount of long-tail factors is large Long-tail factors Long-tail factors are more important than dominant factors in some applications Example: Tencent applied topic models for advertisement and showed that long- tail topics such as “lose weight”, “nursing” improves click-through rate by 40% (Jin, 2015) 8
Diversification Latent factors behind data Components in LVM 9
Motivation II: Tradeoff induced by the number of components k Tradeoff between Expressiveness and Complexity Small k: low expressiveness, low complexity Large k: high expressiveness, high complexity Can we achieve the best of both worlds? Small k: high expressiveness, low complexity 10 10
Reduce model complexity without sacrificing expressiveness Without diversification With diversification Data Samples Use components to capture the principal directions of data point cloud Components in LVM 11 11
Mutual Angular Regularization of LVMs Goal: encourage the components to diversely spread out to (1) improve the coverage of long-tail latent factors; (2) reduce model complexity without compromising expressiveness Approach: Define a score based on mutual angles to measure the diversity of components Use the score to regularize latent variable models and control the geometry of the latent space during learning 12 12
Outline Mutual Angular Regularizer Algorithm Applications Theory 13 13
Mutual Angular Regularizer Components are parametrized by vectors In Latent Dirichlet Allocation, each topic has a multinomial vector In Sparse Coding, each dictionary item has a real vector Measure the dissimilarity between two vectors Measure the diversity of a vector set 14 14
Dissimilarity between two vectors Invariant to scale, translation, rotation and orientation of the two vectors Euclidean distance, L1 distance Distance 𝑒 is variant to scale d d O O Negative cosine similarity Negative cosine similarity 𝑏 is variant to orientation O O a=0.6 a=-0.6 15 15
Dissimilarity between two vectors Non-obtuse angle 𝜄 𝜄 𝜄 𝜄 O O O Invariant to scale, translation, rotation and orientation of the two vectors Definition x y arccos x y 16 16
Measure the diversity of a vector set Based on the pairwise dissimilarity measure between vectors a K The diversity of a set of vectors is defined as A i i 1 2 Mutual K K K K K K 1 1 1 a a ( ) A Angular i j arccos ij ij pq K K ( 1) K K ( 1) K K ( 1) ij i 1 j 1 i 1 j 1 p 1 q 1 a a Regularizer i j j i j i q p Mean of angles Variance of angles Mean: summarize how these vectors are different from each other on the whole Variance: encourage the vectors to evenly spread out 17 17
LVM with Mutual Angular Regularization (MAR-LVM) max L D ( ; A ) ( ) A A 2 K K K K K K 1 1 1 ( ) A ij ij pq K K ( 1) K K ( 1) K K ( 1) i 1 j 1 i 1 j 1 p 1 q 1 j i j i q p a a i j arccos ij a a i j 18 18
Algorithm Challenge: the mutual angular regularizer is non-smooth and a K non-convex w.r.t the parameter vectors A i i 1 Derive a smooth lower bound The lower bound is easier to derive if the parameter vectors lie on a sphere Decompose the parameter vectors into magnitudes and directions Proved that optimizing the lower bound with gradient ascent method can increase the mutual angular regularizer in each iteration 19 19
Optimization Reparametrize a g a g a a 1 A diag( ) g A i i i i i i = Ω diag(𝐡)𝐁 Ω 𝐁 Magnitude Direction ~ ~ max L ( D ; g A ) ( A ) ~ g , A max L D A ( ; ) ( ) A ~ g A , i , a 1 , g 0 s . t . i i Alternating Optimization ~ ~ g g Fix , optimize Fix , optimize A A ~ ~ ~ max ~ max L ( D ; g A ) ( A ) L ( D ; g A ) g A ~ s . t . a i , 1 s . t . i , g 0 i i 20 20
Optimize 𝑩 ~ ~ max ~ g A A L ( D ; ) ( ) A ~ s . t . i , a 1 i Lower bound 2 T T ( ) A ( ) A arcsin det A A arcsin det A A 2 𝑈 𝑩 is the volume of the parallelipiped Intuition of the lower bound: det 𝑩 . The larger det 𝑩 𝑈 𝑩 is, the more likely that the formed by the vectors in 𝑩 have larger angles (not surely). Γ 𝑩 is an increasing function w.r.t vectors in 𝑩 𝑈 𝑩 . Hence larger Γ 𝑩 is likely to yield larger Ω 𝑩 . det 𝑩 Optimize the lower bound, which is smooth and much more amenable for optimization ~ ~ max ~ L ( D ; g A ) ( A ) A ~ s . t . i , a 1 i 21 21
Close Alignment between the Regularizer and its Lower Bound If the lower bound is optimized with projected gradient ascent (PGA), the mutual angular regularizer can be increased in each iteration of the PGA procedure Optimizing the lower bound with PGA can increase the mean of the angles in each iteration Optimizing the lower bound with PGA can decrease the variance of the angles in each iteration 2 - K K K K K K 1 1 1 ( ) A ij ij pq K K ( 1) K K ( 1) K K ( 1) i 1 j 1 i 1 j 1 p 1 q 1 j i j i q p Variance Mean 22 22
Geometry Interpretation of the Close Alignment The gradient of the lower bound w.r.t is orthogonal to all a i other vectors a a , , , a a 1 2 K i Move along its gradient direction would enlarge its angle a i with other vectors a a a are parameter vectors g ˆ 2 3 1 1 a a a g is the gradient of 1 a and are orthogonal to 2 1 1 3 a ˆ a a g 1 1 1 1 ˆ a The angle between and is greater than a 3 1 a a a between and 3 3 1 a 2 23 23
More recommend