A Discriminative Latent Variable Model for Online Clustering Rajhans Samdani, Kai-Wei Chang , Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign
Motivating Example: Coreference n Coreference resolution: cluster denotative noun phrases ( mentions ) in a document based on underlying entities [Bill Clinton] , recently elected as the [President of the USA] , has been invited by the [Russian President] , [Vladimir Putin] , to visit [Russia] . [President Clinton] said that [he] looks forward to strengthening ties between [USA] and [Russia] . n The task: learning a clustering function from training data ¨ Used expressive features between mention pairs (e.g. string similarity). ¨ Learn a similarly metric between mentions. Learning this metric using ¨ Cluster mentions based on the metric. a joint distribution over clustering n The mention arrives in a left-to-right order 2
Online Clustering n Online clustering: items arrive in a given order … … i n Motivating property: cluster item i with no access to future items on the right, only the previous items to the left n This setting is general and is natural in many tasks. ¨ E.g., cluster posts in a forum, cluster network attack n An online clustering algorithm is likely to be more efficient than a batch algorithm under such setting. 3
Greedy Best-Left-Link Clustering [Bill Clinton] , recently elected as the [President of the USA] , has been invited by the [Russian President] , [Vladimir Putin] , to visit [Russia] . [President Clinton] said that [he] looks forward to strengthening ties between [USA] and [Russia] . n Best-Left-Linking decoding: (Bengtson and Roth '08). n A Naïve way to learn the model: ¨ decouple (i) learning a similarity metric between pairs; (ii) hard clustering of mentions using this metric. 5
Our Contribution n A novel discriminative latent variable model, Latent Left-Linking Model (L 3 M) , for jointly learning metric and clustering, that outperforms existing models n Training the pair-wise similarity metric for clustering using a latent variable structured prediction n Relaxing the single best-link: consider a distribution over links n Efficient learning algorithm that decomposes over individual items in the training stream 5
Outline n Motivation, examples and problem description n Latent Left-Linking Model (L 3 M) ¨ Likelihood computation ¨ Inference ¨ Role of temperature ¨ Alternate latent variable perspective n Learning ¨ Discriminative structured prediction learning view ¨ Stochastic gradient based decomposed learning n Empirical study 6
Latent Left-Linking Model (L 3 M) Modeling Axioms X n Each item can link only to some … … item on its left (creating a left- j i link ) ? n Event i linking to j is ? Of i' .. … … linking to j ' j j’ i' i n Probability of i linking to j exp (w ¢ Á(i, j)/°) Pr[j à i] / exp(w ¢ Á(i, j)/°) … … ¨ ° 2 [0,1] Is a temperature-like j i user-tuned parameter 7
L 3 M: Likelihood of Clustering n C is a clustering of data stream d ¨ C ( i , j ) = 1 if i and j co-clustered else 0 A dummy item represents the start of a cluster n Prob. of C : multiply prob. of items connecting as per C Pr[C ; w] = Õ i Pr[i , C ; w] = Õ i ( å j < i Pr[j à i] C ( i , j ) ) / Õ i ( å j < i exp(w ¢ Á(i, j) /°) C ( i , j ) ) n Partition/normalization function efficient to compute Z d (w) = Õ i ( å j < i exp(w ¢ Á(i, j) /°)) 8
L 3 M: Greedy Inference/Clustering n Sequential arrival of items: … … i Prob. of i connecting to previously formed cluster c = sum of probs. of i connecting to items in c : Pr[c ¯ i] = å j 2 c Pr[j à i; w] / å j 2 c exp(w ¢ Á(i, j) /°) n Greedy clustering: ¨ Compute c*= argmax c Pr[ c ¯ i ] ¨ Connect i to c* if Pr[c* ¯ i] > t (threshold) otherwise i starts a new cluster ¨ May not yield the most likely clustering 9
Inference: role of temperature ° n Prob. of i connecting to previous item j Pr[j à i] / exp(w ¢ Á(i, j)/°) n ° tunes the importance of high-scoring links ¨ As ° decreases from 1 to 0, high-scoring links become more important ¨ For ° = 0 , Pr[j à i] is a Kronecker delta function centered on the argmax link (assuming no ties) Pr[c ¯ i] / å j 2 c exp(w ¢ Á(i, j) /°) n For ° = 0 , clustering considers only the “best-left-link” and greedy clustering is exact 10
Latent Variables: Left-Linking Forests n Left-linking forest, f : the parent (arrow directions reversed) of each item on its left n Probability of forest f based on sum of edge weights in f Pr[f; w] / exp( å (i, j) 2 f w ¢ Á(i, j) /°) n L 3 M: same as expressing the probability of C as the sum of probabilities of all consistent (latent) Left-linking forests Pr[C ; w]= å f2 F(C) Pr[f; w] 11
Outline n Motivation, examples and problem description n Latent Left-Linking Model ¨ Inference ¨ Role of temperature ¨ Likelihood computation ¨ Alternate latent variable perspective n Learning ¨ Discriminative structured prediction learning view ¨ Stochastic gradient based decomposed learning n Empirical study 12
L 3 M: Likelihood-based Learning n Learn w from annotated clustering C d for data d 2 D n L 3 M: Learn w via regularized neg. log-likelihood Regularization LL(w) = ¯ kwk 2 + å d log Z d (w) Partition Function - å d å i log ( å j < i exp(w ¢ Á(i, j) /°) C d ( i , j ) ) Un-normalized Probability n Relation to other latent variable models: ¨ Learn by marginalizing underlying latent left-linking forests ¨ °=1 : Hidden Variable CRFs (Quattoni et al, 07) ¨ °=0 : Latent Structural SVMs (Yu and Joachims, 09) 13
Training Algorithms: Discussion n The objective function LL(w) is non-convex n Can use Concave-Convex Procedure (CCCP) (Yuille and Rangarajan 03; Yu and Joachims, 09) ¨ Pros: guaranteed to converge to a local minima (Sriperumbudur et al, 09) ¨ Cons: requires entire data stream to compute single gradient update n Online updates based on Stochastic (sub-)gradient descent (SGD) ¨ Sub-gradient can be decomposed to a per-item basis ¨ Cons: no theoretical guarantees for SGD with non-convex functions ¨ Pros: can learn in an online fashion; Converge much faster than CCCP ¨ Great empirical performance 14
Outline n Motivation, examples and problem description n Latent Left-Linking Model ¨ Inference ¨ Role of temperature ¨ Likelihood computation ¨ Alternate latent variable perspective n Learning ¨ Discriminative structured prediction learning view ¨ Stochastic gradient based decomposed learning n Empirical study 15
Experiment: Coreference Resolution n Cluster denotative noun phrases called mentions n Mentions follow a left-to-right order n Features: mention distance, substring match, gender match, etc. n Experiments on ACE 2004 and OntoNotes-5.0. n Report average of three popular coreference clustering evaluation metrics: MUC, B 3 , and CEAF 16
Coreference: ACE 2004 Considering multiple links helps Jointly learn metric and clustering helps 80 Better Corr-Clustering (Finley and Joachims'05) 79 Avg. of MUC, B 3 , and CEAF Sum-Link (Haider et 78 al'07) Binary (Bengtson and 77 Roth '08) L3M-0 76 L3M-gamma 75 74 17
Coreference: OntoNotes-5.0 Better 78 Corr-Clustering (Finley and Joachims'05) 77 Avg. of MUC, B 3 , and CEAF Sum-Link (Haider et al'07) 76 Binary (Bengtson and Roth '08) 75 L3M-0 74 L3M-gamma 73 72 By incorporating with domain knowledge constraints, L 3 M achieves the state of the art performance on OntoNotes-5.0 (Chang et al. 13) 18
Experiments: Document Clustering n Cluster the posts in a forum based on authors or topics. n Dataset: discussions from www.militaryforum.com n The posts in the forum arrive in a time order: Veteran Re: Veteran North Korean Re: Re: Veteran insurance insurance Missiles insurance n Features: common words, tf-idf similarity, time between arrival n Evaluate with Variation-of-Information (Meila, 07) 19
Author Based Clustering Better 144 Corr-Clustering (Finley and Joachims '05) Variation of Information x 100 140 Sum-Link (Haider et al '07) Binary (Bengtson and 136 Roth '08) L3M-0 132 L3M-gamma 128 20
Topic Based Clustering Better 278 Corr-Clustering 274 (Finley and Joachims'05) 270 Variation of Information x 100 Sum-Link (Haider et 266 al'07) 262 258 Binary (Bengtson and Roth '08) 254 250 L3M-0 246 242 L3M-gamma 238 234 230 21
Conclusions n Latent Left-Linking Model ¨ Principled probabilistic modeling for online clustering tasks ¨ Marginalizes underlying latent link structures ¨ Tuning ° helps – considering multiple links helps ¨ Efficient greedy inference n SGD-based learning ¨ Decompose learning into smaller gradient updates over individual items ¨ Rapid convergence and high accuracy n Solid empirical performance on problems with a natural streaming order 22
Recommend
More recommend