A Discriminative Latent Variable Model for Online Clustering Rajhans - PowerPoint PPT Presentation

A Discriminative Latent Variable Model for Online Clustering Rajhans Samdani, Kai-Wei Chang , Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign

Motivating Example: Coreference n Coreference resolution: cluster denotative noun phrases ( mentions ) in a document based on underlying entities [Bill Clinton] , recently elected as the [President of the USA] , has been invited by the [Russian President] , [Vladimir Putin] , to visit [Russia] . [President Clinton] said that [he] looks forward to strengthening ties between [USA] and [Russia] . n The task: learning a clustering function from training data ¨ Used expressive features between mention pairs (e.g. string similarity). ¨ Learn a similarly metric between mentions. Learning this metric using ¨ Cluster mentions based on the metric. a joint distribution over clustering n The mention arrives in a left-to-right order 2

Online Clustering n Online clustering: items arrive in a given order … … i n Motivating property: cluster item i with no access to future items on the right, only the previous items to the left n This setting is general and is natural in many tasks. ¨ E.g., cluster posts in a forum, cluster network attack n An online clustering algorithm is likely to be more efficient than a batch algorithm under such setting. 3

Greedy Best-Left-Link Clustering [Bill Clinton] , recently elected as the [President of the USA] , has been invited by the [Russian President] , [Vladimir Putin] , to visit [Russia] . [President Clinton] said that [he] looks forward to strengthening ties between [USA] and [Russia] . n Best-Left-Linking decoding: (Bengtson and Roth '08). n A Naïve way to learn the model: ¨ decouple (i) learning a similarity metric between pairs; (ii) hard clustering of mentions using this metric. 5

Our Contribution n A novel discriminative latent variable model, Latent Left-Linking Model (L 3 M) , for jointly learning metric and clustering, that outperforms existing models n Training the pair-wise similarity metric for clustering using a latent variable structured prediction n Relaxing the single best-link: consider a distribution over links n Efficient learning algorithm that decomposes over individual items in the training stream 5

Outline n Motivation, examples and problem description n Latent Left-Linking Model (L 3 M) ¨ Likelihood computation ¨ Inference ¨ Role of temperature ¨ Alternate latent variable perspective n Learning ¨ Discriminative structured prediction learning view ¨ Stochastic gradient based decomposed learning n Empirical study 6

Latent Left-Linking Model (L 3 M) Modeling Axioms X n Each item can link only to some … … item on its left (creating a left- j i link ) ? n Event i linking to j is ? Of i' .. … … linking to j ' j j’ i' i n Probability of i linking to j exp (w ¢ Á(i, j)/°) Pr[j Ã i] / exp(w ¢ Á(i, j)/°) … … ¨ ° 2 [0,1] Is a temperature-like j i user-tuned parameter 7

L 3 M: Likelihood of Clustering n C is a clustering of data stream d ¨ C ( i , j ) = 1 if i and j co-clustered else 0 A dummy item represents the start of a cluster n Prob. of C : multiply prob. of items connecting as per C Pr[C ; w] = Õ i Pr[i , C ; w] = Õ i ( å j < i Pr[j Ã i] C ( i , j ) ) / Õ i ( å j < i exp(w ¢ Á(i, j) /°) C ( i , j ) ) n Partition/normalization function efficient to compute Z d (w) = Õ i ( å j < i exp(w ¢ Á(i, j) /°)) 8

L 3 M: Greedy Inference/Clustering n Sequential arrival of items: … … i Prob. of i connecting to previously formed cluster c = sum of probs. of i connecting to items in c : Pr[c ¯ i] = å j 2 c Pr[j Ã i; w] / å j 2 c exp(w ¢ Á(i, j) /°) n Greedy clustering: ¨ Compute c*= argmax c Pr[ c ¯ i ] ¨ Connect i to c* if Pr[c* ¯ i] > t (threshold) otherwise i starts a new cluster ¨ May not yield the most likely clustering 9

Inference: role of temperature ° n Prob. of i connecting to previous item j Pr[j Ã i] / exp(w ¢ Á(i, j)/°) n ° tunes the importance of high-scoring links ¨ As ° decreases from 1 to 0, high-scoring links become more important ¨ For ° = 0 , Pr[j Ã i] is a Kronecker delta function centered on the argmax link (assuming no ties) Pr[c ¯ i] / å j 2 c exp(w ¢ Á(i, j) /°) n For ° = 0 , clustering considers only the “best-left-link” and greedy clustering is exact 10

Latent Variables: Left-Linking Forests n Left-linking forest, f : the parent (arrow directions reversed) of each item on its left n Probability of forest f based on sum of edge weights in f Pr[f; w] / exp( å (i, j) 2 f w ¢ Á(i, j) /°) n L 3 M: same as expressing the probability of C as the sum of probabilities of all consistent (latent) Left-linking forests Pr[C ; w]= å f2 F(C) Pr[f; w] 11

Outline n Motivation, examples and problem description n Latent Left-Linking Model ¨ Inference ¨ Role of temperature ¨ Likelihood computation ¨ Alternate latent variable perspective n Learning ¨ Discriminative structured prediction learning view ¨ Stochastic gradient based decomposed learning n Empirical study 12

L 3 M: Likelihood-based Learning n Learn w from annotated clustering C d for data d 2 D n L 3 M: Learn w via regularized neg. log-likelihood Regularization LL(w) = ¯ kwk 2 + å d log Z d (w) Partition Function - å d å i log ( å j < i exp(w ¢ Á(i, j) /°) C d ( i , j ) ) Un-normalized Probability n Relation to other latent variable models: ¨ Learn by marginalizing underlying latent left-linking forests ¨ °=1 : Hidden Variable CRFs (Quattoni et al, 07) ¨ °=0 : Latent Structural SVMs (Yu and Joachims, 09) 13

Training Algorithms: Discussion n The objective function LL(w) is non-convex n Can use Concave-Convex Procedure (CCCP) (Yuille and Rangarajan 03; Yu and Joachims, 09) ¨ Pros: guaranteed to converge to a local minima (Sriperumbudur et al, 09) ¨ Cons: requires entire data stream to compute single gradient update n Online updates based on Stochastic (sub-)gradient descent (SGD) ¨ Sub-gradient can be decomposed to a per-item basis ¨ Cons: no theoretical guarantees for SGD with non-convex functions ¨ Pros: can learn in an online fashion; Converge much faster than CCCP ¨ Great empirical performance 14

Outline n Motivation, examples and problem description n Latent Left-Linking Model ¨ Inference ¨ Role of temperature ¨ Likelihood computation ¨ Alternate latent variable perspective n Learning ¨ Discriminative structured prediction learning view ¨ Stochastic gradient based decomposed learning n Empirical study 15

Experiment: Coreference Resolution n Cluster denotative noun phrases called mentions n Mentions follow a left-to-right order n Features: mention distance, substring match, gender match, etc. n Experiments on ACE 2004 and OntoNotes-5.0. n Report average of three popular coreference clustering evaluation metrics: MUC, B 3 , and CEAF 16

Coreference: ACE 2004 Considering multiple links helps Jointly learn metric and clustering helps 80 Better Corr-Clustering (Finley and Joachims'05) 79 Avg. of MUC, B 3 , and CEAF Sum-Link (Haider et 78 al'07) Binary (Bengtson and 77 Roth '08) L3M-0 76 L3M-gamma 75 74 17

Coreference: OntoNotes-5.0 Better 78 Corr-Clustering (Finley and Joachims'05) 77 Avg. of MUC, B 3 , and CEAF Sum-Link (Haider et al'07) 76 Binary (Bengtson and Roth '08) 75 L3M-0 74 L3M-gamma 73 72 By incorporating with domain knowledge constraints, L 3 M achieves the state of the art performance on OntoNotes-5.0 (Chang et al. 13) 18

Experiments: Document Clustering n Cluster the posts in a forum based on authors or topics. n Dataset: discussions from www.militaryforum.com n The posts in the forum arrive in a time order: Veteran Re: Veteran North Korean Re: Re: Veteran insurance insurance Missiles insurance n Features: common words, tf-idf similarity, time between arrival n Evaluate with Variation-of-Information (Meila, 07) 19

Author Based Clustering Better 144 Corr-Clustering (Finley and Joachims '05) Variation of Information x 100 140 Sum-Link (Haider et al '07) Binary (Bengtson and 136 Roth '08) L3M-0 132 L3M-gamma 128 20

Topic Based Clustering Better 278 Corr-Clustering 274 (Finley and Joachims'05) 270 Variation of Information x 100 Sum-Link (Haider et 266 al'07) 262 258 Binary (Bengtson and Roth '08) 254 250 L3M-0 246 242 L3M-gamma 238 234 230 21

Conclusions n Latent Left-Linking Model ¨ Principled probabilistic modeling for online clustering tasks ¨ Marginalizes underlying latent link structures ¨ Tuning ° helps – considering multiple links helps ¨ Efficient greedy inference n SGD-based learning ¨ Decompose learning into smaller gradient updates over individual items ¨ Rapid convergence and high accuracy n Solid empirical performance on problems with a natural streaming order 22

A Discriminative Latent Variable Model for Online Clustering Rajhans - PowerPoint PPT Presentation

A Discriminative Latent Variable Model for Online Clustering Rajhans Samdani, Kai-Wei Chang , Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign Motivating Example: Coreference n Coreference resolution: cluster

1 Latent variable models In the next section we will discuss latent variable models for

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

A Latent Variable Model of Synchronous Parsing for Syntactic and Semantic Dependencies James

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Discriminative Regularization for Latent Variable Models with Applications to Electrocardiography

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model CS330

Bayesian Latent Variable Modelling of Longitudinal Family Data for Genetic Pleiotropy Studies

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Generative vs. discriminative Generative Discriminative Belief network A is more More

Mokken Scale Analysis Alternative names: Unidimensional Latent Variable Model (e.g., Holland &

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges

Poster #24 1 Applied AI Lab, Oxford Robotics Institute 2 Department of Statistics, University of

Variational Sequential Labelers for Semi-Supervised Learning Mingda Chen, Qingming Tang, Karen

Probabilistic & Unsupervised Learning Latent Variable Models Maneesh Sahani

Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic

Learning Latent Dynamics for Planning from Pixels Danijar Hafner, Timothy Lillicrap, Ian Fischer,

Finding Latent Code Errors via Machine Learning over Program Executions Yuriy Brun Michael D.

Case Study: Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested

Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta

A Discriminative Latent Variable Model for Online Clustering Rajhans - PowerPoint PPT Presentation

A Discriminative Latent Variable Model for Online Clustering Rajhans Samdani, Kai-Wei Chang , Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign Motivating Example: Coreference n Coreference resolution: cluster

1 Latent variable models In the next section we will discuss latent variable models for

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

A Latent Variable Model of Synchronous Parsing for Syntactic and Semantic Dependencies James

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Discriminative Regularization for Latent Variable Models with Applications to Electrocardiography

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model CS330

Bayesian Latent Variable Modelling of Longitudinal Family Data for Genetic Pleiotropy Studies

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Generative vs. discriminative Generative Discriminative Belief network A is more More

Mokken Scale Analysis Alternative names: Unidimensional Latent Variable Model (e.g., Holland &amp;

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges

Poster #24 1 Applied AI Lab, Oxford Robotics Institute 2 Department of Statistics, University of

Variational Sequential Labelers for Semi-Supervised Learning Mingda Chen, Qingming Tang, Karen

Probabilistic &amp; Unsupervised Learning Latent Variable Models Maneesh Sahani

Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic

Learning Latent Dynamics for Planning from Pixels Danijar Hafner, Timothy Lillicrap, Ian Fischer,

Finding Latent Code Errors via Machine Learning over Program Executions Yuriy Brun Michael D.

Case Study: Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested

Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta

Mokken Scale Analysis Alternative names: Unidimensional Latent Variable Model (e.g., Holland &

Probabilistic & Unsupervised Learning Latent Variable Models Maneesh Sahani