conceptual clustering using lingo algorithm evaluation on
play

Conceptual Clustering Using Lingo Algorithm: Evaluation on Open - PowerPoint PPT Presentation

Conceptual Clustering Using Lingo Algorithm: Evaluation on Open Directory Project Data Stanis law Osi nski Dawid Weiss Institute of Computing Science Pozna n University of Technology May 20th, 2004 Some background: how to evaluate


  1. Conceptual Clustering Using Lingo Algorithm: Evaluation on Open Directory Project Data Stanis� law Osi´ nski Dawid Weiss Institute of Computing Science Pozna´ n University of Technology May 20th, 2004

  2. Some background: how to evaluate an SRC algorithm? About various goals of evaluation. . . Reconstruction of a predefined structure Test data: merge-then-cluster, manual labeling Measures: precision-recall, entropy measures. . . Labels “quality”, descriptiveness User surveys, click-distance methods

  3. Some background: how to evaluate an SRC algorithm? What types of “errors” can an algorithm make structure-wise? Misassignment errors (document → cluster) Missing documents in a cluster Incorrect clusters (unexplainable) Missing clusters (undetected) Granularity level confusion (subcluster domination problems)

  4. Evaluation of Lingo’s performance We tried to answer the following questions: Clusters’ structure: 1 Is Lingo able to cluster similar documents? 2 Is Lingo able to highlight outliers and “minorities”? 3 Is Lingo able to capture generalizations of closely-related subjects? 4 How does Lingo compare to Suffix Tree Clustering? Quality of cluster labels Are clusters labelled appropriately? Are they informative?

  5. Data set for the experiment Data set: a subset of the Open Directory Project Rationale: Human-created and maintained structure Human-created and maintained labels Descriptions resemble search results (snippets) Free availability

  6. ODP Categories chosen for the experiment MOVIES HEALTH CARE Ortho LRings BRunner PHOTOGRAPHY Infra COMPUTER SCIENCE DATABASES MISC. DWare MySQL JavaTut XMLDB Vi Postgr

  7. Test sets for the experiement Test sets Test sets were combinations of categories designed to help in answering the set of questions. Identifier Merged categories Test set rationale G1 LRings , MySQL Separation of two unrelated categories. G2 LRings , MySQL , Ortho Separation of three unrelated categories. G3 LRings , MySQL , Ortho , Separation of four unrelated categories, highligting Infra small topics ( Infra ). G4 MySQL , XMLDB , DWare , Separation of four conceptually close categories, all Postgr connected to database. G5 MySQL , XMLDB , DWare , Four conceptually very close categories (database) Postgr , JavaTut , Vi plus two distinct, but within the same abstract topic (computer science). G6 MySQL , XMLDB , DWare , Outlier highlight test – four dominating conceptually Postgr , Ortho close categories (databases) and one outlier ( Ortho ) G7 All categories All categories mixed together. Cross-topic cluster de- tection test (movies, databases).

  8. The experiment Lingo’s implementation → Carrot 2 framework The algorithm’s thresholds: Fixed at “good guess” values (same as those used in the on-line demo) Stemming and stop-word detection applied to the input data

  9. The results Method of analysis Manual investigation of document-to-cluster assignment charts. Helps understand the internal structure of results Prevents compensations inherent in aggregative measures

  10. → Is Lingo able to cluster similar documents? Categories-in-clusters view, input test: G3 Cluster assignment=0.250, Candidate cluster=0.775, top 16 clusters 30 25 20 15 10 5 0 MySQL News Information on Infrared Images Galleries Foot Orthotics Lord of the Rings Movie Orthopedic Products Humor Lotr Site Shoes Links Medical Database Support Stockings Middle Earth Manager lord of the rings mysql orthopedic infrared photography G1–G3: clear separation of topics, but with some extra clusters G1: granularity problem

  11. → Is Lingo able to cluster similar documents? Categories-in-clusters view, input test: G5 Cluster assignment=0.250, Candidate cluster=0.775, top 16 clusters 25 20 15 10 5 0 Java Tutorial Vim Page Federated Data Warehouse Native Xml Database Web Postgresql Database Mysql Server Free Links Development Tool Quick Reference Data Warehousing Mysql Client Object Oriented Postgres Driver java tutorials vi mysql data warehouses articles native xml databases postgres G5: misassignment problem

  12. → Is Lingo able to highlight outliers and “minorities”? Categories-in-clusters view, input test: G6 Cluster assignment=0.250, Candidate cluster=0.775, top 16 clusters 18 16 14 12 10 8 6 4 2 0 Mysql Database Federated Data Warehouse Foot Orthotics Orthopedic Products Access Postgresql Web Mysql Server Medical Shoes Designed Orthopaedic Postgres Mysql Client Data Warehousing Offers Software Development Tool Innovative mysql data warehouses articles native xml databases postgres orthopedic Ortho category (outlier), XMLDB consumed by MySQL!

  13. → Is Lingo able to highlight outliers and “minorities”? Categories-in-clusters view, input test: G7 Cluster assignment=0.250, Candidate cluster=0.775, top 16 clusters 50 45 40 35 30 25 20 15 10 5 0 Blade Runner Mysql Database Java Tutorial Lord of the Rings News Movie Review Information on Infrared Data Warehouse Images Galleries BBC Film Vim Macro Web Site Fan Fiction Fan Art Custom Orthotics Layout Management DBMS Online blade runner infrared photography java tutorials lord of the rings mysql orthopedic vi data warehouses articles native xml databases postgres Infra category (outlier)

  14. → Is Lingo able to highlight outliers and “minorities”? Categories-in-clusters view, input test: G5 Cluster assignment=0.250, Candidate cluster=0.775, top 16 clusters 25 20 15 10 5 0 Java Tutorial Vim Page Federated Data Warehouse Native Xml Database Web Postgresql Database Mysql Server Free Links Development Tool Quick Reference Data Warehousing Mysql Client Object Oriented Postgres Driver java tutorials vi mysql data warehouses articles native xml databases postgres XMLDB category (outlier)

  15. → Is Lingo able to capture generalizations? Categories-in-clusters view, input test: G7 Cluster assignment=0.250, Candidate cluster=0.775, top 16 clusters 50 45 40 35 30 25 20 15 10 5 0 Blade Runner Mysql Database Java Tutorial Lord of the Rings News Movie Review Information on Infrared Data Warehouse Images Galleries BBC Film Vim Macro Web Site Fan Fiction Fan Art Custom Orthotics Layout Management DBMS Online blade runner infrared photography java tutorials lord of the rings mysql orthopedic vi data warehouses articles native xml databases postgres “movie review” cluster is a generalization, but. . .

  16. → Is Lingo able to capture generalizations? Categories-in-clusters view, input test: G7 Cluster assignment=0.250, Candidate cluster=0.775, top 16 clusters 50 45 40 35 30 25 20 15 10 5 0 Blade Runner Mysql Database Java Tutorial Lord of the Rings News Movie Review Information on Infrared Data Warehouse Images Galleries BBC Film Vim Macro Web Site Fan Fiction Fan Art Custom Orthotics Layout Management DBMS Online blade runner infrared photography java tutorials lord of the rings mysql orthopedic vi data warehouses articles native xml databases postgres Clusters are usually orthogonal with SVD, so no good results should be expected in this area.

  17. → How does Lingo compare to Suffix Tree Clustering? Categories-in-clusters view, input test: G7 Cluster assignment=0.250, Candidate cluster=0.775, top 16 clusters 50 45 40 35 30 25 20 15 10 5 0 Blade Runner Mysql Database Java Tutorial Lord of the Rings News Movie Review Information on Infrared Data Warehouse Images Galleries BBC Film Vim Macro Web Site Fan Fiction Fan Art Custom Orthotics Layout Management DBMS Online blade runner infrared photography java tutorials lord of the rings mysql orthopedic vi data warehouses articles native xml databases postgres

  18. → How does Lingo compare to Suffix Tree Clustering? Categories-in-clusters view, input test: G7 STC algorithm, Top 16 clusters 35 30 25 20 15 10 5 0 xml,native,native xml database includes blade runner,blade,runner information dm,dm review article,dm review used database ralph,article by ralph kimball,ralp [...] mysql articles written site cast dm review article by douglas hackne [...] review characters blade runner infrared photography java tutorials lord of the rings mysql orthopedic vi data warehouses articles native xml databases postgres

  19. Key differences between Lingo and STC Size-dominated clusters in STC Cluster labels much less informative Common-term clusters in STC

  20. Cluster labels quality Performed manually Problems: Single term labels usually ambiguous or too broad (“news”, “free”) Level of granularity usually unclear (need for hierarchical methods?)

  21. A word about analytical comparison methods. . . Can these conclusions be derived using formulas? We think so: cluster contamination measures might help.

  22. Online demo A nice form of evaluation (although scientifically doubtful), is the online demo’s popularity and feedback we get from users.

  23. http://carrot.cs.put.poznan.pl Thank you. Questions?

Recommend


More recommend