MLA 2017 Latent Tree Analysis Nevin L. Zhang The Hong Kong University of Science and Technology www.cse.ust.hk/~lzhang
What is Latent Tree Analysis (LTA)? Repeated event co-occurrences might Due to common hidden causes or genuine direct correlations, OR Be coincidental, esp. in big data Challenge: Identify co-occurrences due to hidden causes or correlations. Latent tree analysis solves a related and simpler problem: Detect co-occurrences that can be statistically explained by a tree of latent variables Can be used to solve interesting tasks Multidimensional clustering Hierarchical topic detection Latent structure discovery … 2 MLA 2017 Nevin L. Zhang/HKUST
Basic Latent Tree Models (LTM) Tree-structured Bayesian network All variables are discrete Variables at leaf nodes are observed Variables at internal nodes are latent Parameters: P(Y1), P(Y2|Y1),P(X1|Y2 ), P(X2|Y2), … Also known as Hierarchical Semantics: latent class (HLC) models, HLC models (Zhang. JMLR 2004) 3 MLA 2017 Nevin L. Zhang/HKUST
Pouch Latent Tree Models (PLTM) An extension of basic LTM (Poon et al. ICML 2010) Rooted tree Internal nodes represent discrete latent variables Each leaf node consists of one or more continuous observed variables, called a pouch. 4 MLA 2017 Nevin L. Zhang/HKUST
More General Latent Variable Tree Models Internal nodes can be observed (Choi et al. JMLR 2011) Internal nodes can be continuous Forest Primary focus of this talk: the basic LTM 5 MLA 2017 Nevin L. Zhang/HKUST
Identifiability Issues Root change lead to equivalent models So, edge orientations unidentifiable ( Zhang, JMLR 2004) Hence, we are really talking about undirected models Undirected LTM represents an equivalent class of directed LTMs In implementation, represented as directed model instead of MRF so that partition function is always 1. 6 MLA 2017 Nevin L. Zhang/HKUST
Identifiability Issues |X|: Cardinality of variable X, i.e., the number of states. ( Zhang, JMLR 2004) Theor orem: em: The set of all regular models for a given set of observed variables is finite. Latent variables cannot have too many states. 7 MLA 2017 Nevin L. Zhang/HKUST
Latent Tree Analysis (LTA) Learning latent tree models: Determine Number of latent variables • Numbers of possible states for each latent variable • Connections among variables • Probability distributions • 8 MLA 2017 Nevin L. Zhang/HKUST
Latent Tree Analysis (LTA) Learning latent tree models: Determine Number of latent variables • Numbers of possible states for each latent variable • Connections among nodes • Probability distributions • Difficult, but doable 9 MLA 2017 Nevin L. Zhang/HKUST
Three Settings for Algorithm Developments Setting 1: CLRG (Choi et al, 2011, Huang et al. 2015) Assume that data generated from an unknown LTM. Investigate properties of LTMs and use them for learning E.g., model structure from tree additivity of information distance Theoretical guarantees to recover generative model under conditions Setting 2: EAST, BI (Chen et al, 2012, Liu et al, 2013) Do not assume that data generated from an LTM. Fit to LTM to data using BIC score, via search or heuristics Does not make sense to talk about theoretical guarantees Obtains better models than Setting 1 because the assumption usually untrue. Setting 3: HLTA (Liu et al, 2014, Chen et al, 2016) Consider usefulness in addition to model fit. Hierarchy of latent variables. 10 MLA 2017 Nevin L. Zhang/HKUST
Current Capabilities Takes a few hours on a single machine to analyze data sets with Thousands of variables, and Hundreds of thousands of instances Significant additional speedup can be achieved via simplification and parallel computing 11 MLA 2017 Nevin L. Zhang/HKUST
What can LTA be used for? Multidimensional clustering Hierarchical topic detection Latent structure discovery Other applications 12 MLA 2017 Nevin L. Zhang/HKUST
What can LTA be used for? Multidimensional clustering Hierarchical topic detection Latent structure discovery Other applications 13 MLA 2017 Nevin L. Zhang/HKUST
How to Cluster? Cluster analysis: Grouping of objects into clusters such that Objects in the same cluster are similar Objects from different clusters are dissimilar. 14 MLA 2017 Nevin L. Zhang/HKUST
How to Cluster these? 15 MLA 2017 Nevin L. Zhang/HKUST
How to Cluster these? 16 MLA 2017 Nevin L. Zhang/HKUST
How to Cluster these? 17 MLA 2017 Nevin L. Zhang/HKUST
Multidimensional Clustering Complex data usually have multiple facets, and can be meaningfully partitioned in multiple ways. More reasonable to look for multiple ways to partition data How to get multiple partitions? 18 MLA 2017 Nevin L. Zhang/HKUST
How to get one partition? Finite mixture models: One latent variable z Gaussian mixture models: Continuous data Latent class model (mixture of multinomial distributions): Categorical data Key point: Use models with one latent variable for one partition 19 MLA 2017 Nevin L. Zhang/HKUST
How to get multiple partitions? Use models with multiple latent variables for multiple partitions Latent tree models Probabilistic graphical models with multiple latent variables A generalization of latent class models 20 MLA 2017 Nevin L. Zhang/HKUST
Multidimensional Clustering of Social Survey Data // Survey on corruption in Hong Kong and performance of the anti-corruption agency -- ICAC ( Chen et al, AIJ 2012) //31 questions, 1200 samples C_City: s0 s1 s2 s3 // very common, quite common, uncommon, very uncommon C_Gov: s0 s1 s2 s3 C_Bus: s0 s1 s2 s3 Tolerance_C_Gov: s0 s1 s2 s3 //totally intolerable, intolerable, tolerable, totally tolerable Tolerance_C_Bus: s0 s1 s2 s3 WillingReport_C: s0 s1 s2 // yes, no, depends LeaveContactInfo: s0 s1 // yes, no I_EncourageReport: s0 s1 s2 s3 s4 // very sufficient, sufficient, average, ... I_Effectiveness: s0 s1 s2 s3 s4 //very e, e, a, in-e, very in-e I_Deterrence: s0 s1 s2 s3 s4 // very sufficient, sufficient, average, ... ….. -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 0 -1 -1 -1 0 1 1 -1 -1 2 0 2 2 1 3 1 1 4 1 0 1.0 -1 -1 -1 0 0 -1 -1 1 1 -1 -1 0 0 -1 1 -1 1 3 2 2 0 0 0 2 1 2 0 0 2 1 0 1.0 -1 -1 -1 0 0 -1 -1 2 1 2 0 0 0 2 -1 -1 1 1 1 0 2 0 1 2 -1 2 0 1 2 1 0 1.0 …. 21 MLA 2017 Nevin L. Zhang/HKUST
Latent Structure Discovery Y2: Demographic background; Y3: Tolerance toward corruption; Y4: ICAC performance; Y5: Change in level of corruption; Y6: Level of corruption; Y7: ICAC accountability 22 MLA 2017 Nevin L. Zhang/HKUST
Multidimensional Clustering Y2=s0: Low income youngsters; Y2=s1: Women with no/little income; Y2=s2: people with good education and good income; Y2=s3: people with poor education and average income . 23 MLA 2017 Nevin L. Zhang/HKUST
Multidimensional Clustering Values of observed variable: S0 - totally intolerable, …., s3 -totally tolerable Interpretations of values of latent variables Y3=s0: people who find corruption totally intolerable; 57% Y3=s1: people who find corruption intolerable; 27% Y3=s2: people who find corruption tolerable; 15% Interesting finding: People who are tough on corruption are equally tough toward C-Gov and C-Bus. People who are lenient about corruption are more lenient toward C-Bus than C-Gov 24 MLA 2017 Nevin L. Zhang/HKUST
Multidimensional Clustering Who are the toughest toward corruption among the 4 groups? Interesting finding: Y2=s2: ( good education and good income) the toughest on corruption. Y2=s3: (poor education and average income) the most lenient on corruption The other two classes are in between. 25 MLA 2017 Nevin L. Zhang/HKUST
Multidimensional Clustering Summary Latent tree analysis has found several interesting ways to partition the ICAC data, and revealed some interesting relationships between different partitions. (Chen et al. AIJ 2012) 26 MLA 2017 Nevin L. Zhang/HKUST
What Can LTA be used for? Multidimensional clustering Hierarchical topic detection Latent structure discovery Other applications 27 MLA 2017 Nevin L. Zhang/HKUST
Hierarchical Latent Tree Analysis (HLTA) Each word is a binary variable 0 – absence from doc, 1 – presence in doc Each document is a binary vector over vocabulary 28 MLA 2017 Nevin L. Zhang/HKUST
Topics Each latent variable partitions docs into 2 clusters Document clusters interpreted as topics Z14=0: background topic Z14=1: “ video-card- driver” Each latent variable gives one topic 29 MLA 2017 Nevin L. Zhang/HKUST
Topic Hierarchy Latent variables at high levels “ long- range” word co-occurrences, more general topics Latent variables at low levels “ short- range” word co-occurrences, more specific topics . 30 MLA 2017 Nevin L. Zhang/HKUST
The New York Times Dataset From UCI 300,000 articles from 1987 - 2007 10,000 words selected using TF/IDF HLTA took 7 hours on a desktop machine 31 MLA 2017 Nevin L. Zhang/HKUST
Recommend
More recommend