clustering the tagged web
play

Clustering the Tagged Web Daniel Ramage, Paul Heymann, Christopher - PowerPoint PPT Presentation

Clustering the Tagged Web Daniel Ramage, Paul Heymann, Christopher D. Manning, Hector Garcia-Molina Stanford University WSDM 2009 Images from del.icio.us, lbaumann.com, www.hometrainingtools.com Web document text Web document text Words:


  1. Clustering the Tagged Web Daniel Ramage, Paul Heymann, Christopher D. Manning, Hector Garcia-Molina Stanford University WSDM 2009 Images from del.icio.us, lbaumann.com, www.hometrainingtools.com

  2. Web document text

  3. Web document text Words: information about catalog pricing changes in 2008 welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals

  4. Web document text Words: information about catalog pricing changes in 2008 welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals Anchor Text: home science tools hometrainingtools.com links click follow supplies training experiments other pages

  5. Web document text Words: information about catalog pricing changes in 2008 welcome looking hands-on science ideas try kitchen projects dissolve eggshell grow crystals Anchor Text: home science tools hometrainingtools.com links click follow supplies training experiments other pages Tags: science homeschool education shopping curriculum homeschooling experiments tools chemistry supplies

  6. Why tags? – del.icio.us

  7. Why tags? – del.icio.us ≈120,000 posts / day 12- 75 million (≈10 7 – 10 8 ) unique URLs (versus 10 9 – 10 11 total URLs) Disproportionately the web’s most useful URLs (and those URLs have many tags)

  8. Using tags to understand the web  The web is large and growing: anything that helps us understand high level structure is useful  Tags encode semantically meaningful labels  Tags cover much of the web’s best content  How can we use tags to provide high-level insight?

  9. Web page clustering task  Given a collection of web pages

  10. Web page clustering task  Given a collection of web pages A B  Assign each page to a A cluster, maximizing similarity within clusters A B C A C

  11. Web page clustering task  Given a collection of web pages A B  Assign each page to a A cluster, maximizing similarity within clusters A B  Applications: improved user interfaces, collection clustering, search result diversity, language-model C based retrieval A C

  12. Structure of this talk Features Words Tags Anchors Vector Space Model: K-means Generative Models Model: MM-LDA

  13. Models: K-means and MM-LDA Features Words Tags Anchors Vector Space Model: K-means Generative Models Model: MM-LDA

  14. Model 1: K-means clustering  K-means assumes the standard Vector Space Model: documents are Euclidean normalized real-valued vectors  Algorithm: iteratively Re-assign documents to closest cluster centroid Update cluster centroids from document assignments

  15. Model 2: Latent Dirichlet Allocation  LDA assumes each Words: Document 22 document’s words information about catalog generated by some pricing changes topic’s word distribution 2008 welcome looking hands-on science ideas try kitchen Topic 5 Topic 12 science catalog experiment shopping … learning buy ideas Internet practice checkout information cart

  16. Model 2: Latent Dirichlet Allocation  LDA assumes each Words: Document 22 document’s words information about catalog generated by some pricing changes topic’s word distribution 2008 welcome looking hands-on science ideas try  Paired with an inference kitchen mechanism (Gibbs sampling), learns per- Topic 5 Topic 12 document distributions science catalog over topics, per-topic experiment shopping … learning buy distributions over words ideas Internet practice checkout information cart

  17. Features: words, anchors, and tags Features Words Tags Anchors Vector Space Model: K-means Generative Models Model: MM-LDA

  18. Combining features Feature Combination Feature Space Size Words Words Anchors Anchors Tags Tags

  19. Combining features Feature Combination Feature Space Size Words Words Anchors Anchors Tags Tags Tags as Words Tags as Words Anchors as Words Tags & Anchors as Words

  20. Combining features Feature Combination Feature Space Size Words Words Anchors Anchors Tags Tags Tags as Words Tags as Words Tags as New Words Words Tags Words Anchors Words Tags Anchors

  21. Combining features Feature Combination Feature Space Size Words Words Anchors Anchors Tags Tags Simple feature space modifi- Tags as Words Tags as Words cations for existing models Tags as New Words Words Tags Words Anchors Words Tags Anchors

  22. Combining features Feature Combination Feature Space Size Words Words Anchors Anchors Tags Tags Tags as Words Tags as Words Tags as New Words Words Tags Words + Tags Words Tags Words Anchors Words Tags Anchors

  23. Combining features Feature Combination Feature Space Size Words Words Anchors Anchors Tags Tags Tags as Words Tags as Words Tags as New Words Words Tags Words + Tags K-means: normalize feature input Words Tags vectors independently Words Anchors LDA : multiple parallel sets of Words Tags Anchors observations via MM-LDA

  24. Experiments Features Words Tags Anchors 1. Combining Vector Space words and tags Model: in the VSM K-means Generative Models Model: MM-LDA

  25. Experiments Features Words Tags Anchors Vector Space Model: 2. Comparing K-means models, at Generative multiple levels of Models Model: specificity MM-LDA

  26. Experiments Features Words Tags Anchors Vector Space Model: K-means 3. Do words and tags complement or substitute Generative Models for anchor text? Model: MM-LDA

  27. Experimental Setup  Construct surrogate “gold standard” clustering using Open Directory Project  Reflects a (problematic) consensus clustering, with known number of clusters ODP Category # Documents Top Tags Computers 5361 web css tools software programming Health 434 parenting medicine healthcare medical Reference 1325 education reference time research dictionary

  28. Experimental Setup  Score predicted clusterings with ODP, but not trying to predict ODP  Useful for relative system performance ODP Category # Documents Top Tags Computers 5361 web css tools software programming Health 434 parenting medicine healthcare medical Reference 1325 education reference time research dictionary

  29. Evaluation: Cluster F1 A Reference B Intuition: balance A pairwise precision A (place only similar B documents together) with pairwise recall (keep all similar documents C Health together) A C

  30. Evaluation: Cluster F1 A Reference B Same Different Label Label A Same A B Cluster Different Cluster C Health A C

  31. Evaluation: Cluster F1 A Reference B Same Same Different Different Label Label Label Label A Same Same A 5 B Cluster Cluster Different Different Cluster Cluster C Health A C

  32. Evaluation: Cluster F1 A Reference B Same Same Different Different Label Label Label Label A Same Same A 5 5 3 B Cluster Cluster Different Different Cluster Cluster C Health A Cluster Precision: 5/8 C

  33. Evaluation: Cluster F1 A Reference B Same Same Different Different Label Label Label Label A Same Same A 5 5 3 3 B Cluster Cluster Different Different 8 Cluster Cluster C Health A Cluster Precision: 5/8 C Cluster Recall: 5/13

  34. Evaluation: Cluster F1 A Reference B Same Different Label Label A Same A 5 3 B Cluster Different 8 Cluster C Health A Cluster Precision: 5/8 C Cluster Recall: 5/13 Cluster F1: .476

  35. Experiments Features Words Tags Anchors 1. Combining Vector Space words and tags Model: in the VSM K-means Generative Models Model: MM-LDA

  36. Result: normalize words and tags independently in the Vector Space Model Features K-means Words .139 Words Tags .219 Tags Words+Tags .225 Words Tags Possible utility for other applications of the VSM

  37. Result: normalize words and tags independently in the Vector Space Model Features K-means Words .139 Words Tags .219 Tags Words+Tags .225 Words Tags Tags as Words (×1) Tags as Words .158 Tags as Words Tags as Words (×2) .176 Words Tags Tags as New Words .154 Possible utility for other applications of the VSM

  38. Experiments Features Words Tags Anchors Vector Space Model: 2. Comparing K-means models, at Generative multiple levels of Models Model: specificity MM-LDA

  39. Result: MM-LDA outperforms K-means on top-level ODP categories Features K-means (MM-)LDA Words .139 .260 Words Tags .219 .270 Tags Words+Tags .225 .307 Words Tags

  40. Tagging at multiple basic levels People use tags to help find the same page later, often at a “natural” level of specificity Programming/ Languages Society/ Social Sciences (1094 documents) (1590 documents) Java PHP Python C++ Issues, Religion & JavaScript Perl Lisp Spirituality, People, Ruby C Politics, History, Law, Philosophy

  41. Tagging at multiple basic levels People use tags to help find the same page later, often at a “natural” level of specificity Programming/ Languages Society/ Social Sciences (1094 documents) (1590 documents) Java PHP Python C++ Issues, Religion & JavaScript Perl Lisp Spirituality, People, Ruby C Politics, History, Law, java applies to 73% of Philosophy Programming/Java pages but software applies to only 21% of Top/Computer pages

Recommend


More recommend