Introduction Clustering Evaluation Clustering in Swedish Clustering in Swedish The Impact of some Properties of the Swedish Language on Document Clustering and an Evaluation Method Magnus Rosell 2006-03-15 Magnus Rosell Clustering in Swedish
Introduction Content Clustering The Swedish Twin Registry Evaluation Clustering Clustering in Swedish Motivation Content Introduction Clustering Evaluation Clustering in Swedish Magnus Rosell Clustering in Swedish
Introduction Content Clustering The Swedish Twin Registry Evaluation Clustering Clustering in Swedish Motivation The Swedish Twin Registry KI (The Swedish Medical University) Largest twin registry in the world, about 140 000 twins. Smoking is not harmful. The impact of heritage and environment. A lot of questionnnaires (open and closed questions). Magnus Rosell Clustering in Swedish
Introduction Content Clustering The Swedish Twin Registry Evaluation Clustering Clustering in Swedish Motivation One free text question about occupation About 42 000 twins have answered. Two hierarchical classification systems: System L1 L2 L3 L4 L5 AMSYK 11 28 114 361 969 YK80 12 59 288 (number of categories per level) Manual categorization of the 42 000 texts according to both systems took one summer. Magnus Rosell Clustering in Swedish
Introduction Content Clustering The Swedish Twin Registry Evaluation Clustering Clustering in Swedish Motivation Clustering Clustering – to partition a set of objects into clusters (groups or parts) so that the objects within clusters are more similar to each other than object from other clusters. We are interested in similarity with respect to the contents of the documents (our objects). Clustering vs. Categorization Magnus Rosell Clustering in Swedish
Introduction Content Clustering The Swedish Twin Registry Evaluation Clustering Clustering in Swedish Motivation Motivation Postprocessing of search results (http://vivisimo.com, http://www.iboogie.com) Tool for exploration. Questionnaires. Cheaper and faster than manual clustering/categorization. Easy to obtain several different clusterings of the same set. Magnus Rosell Clustering in Swedish
Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example Clustering Many clustering algorithms use representation of the objects (often a vector of values) similarity measure Magnus Rosell Clustering in Swedish
Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example Document Clustering For documents – Information Retrieval: The content of a text is represented by the words in it. No regard to word order. Very common words that do not have any “meaning” are removed (stoplist). Texts are considered similar if they share many words. Magnus Rosell Clustering in Swedish
Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example Document Clustering (cont.) The term-by-document matrix: d 1 d j . . . . . . t 1 w 1 , 1 w 1 , j . . . . . . . . . ... . . . . . . t i w i , 1 w i , j . . . . . . . . . ... . . . . . . w i , j – based on the frequency of the word in the document and its frequency in the entire document collection (tf*idf-weighting for instance). Similarity between documents – cosine measure for instance. Magnus Rosell Clustering in Swedish
Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example Algorithms Two Types of Clustering Algorithms Partitioning algorithms , flat partition Hierarchical algorithms , hierarchy of clusters Magnus Rosell Clustering in Swedish
Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example A Partitioning Algorithm: K-Means 1 Initial partition, for example: pick k documents at random as first cluster centroids. 2 Put each document in the most similar cluster. 3 Calculate new cluster centroids. 4 Repeat 2 and 3 until some condition is fulfilled. Magnus Rosell Clustering in Swedish
Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example A Clustering Example 6 4 2 0 −2 −4 −6 −6 −4 −2 0 2 4 6 Magnus Rosell Clustering in Swedish
Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example Hierarchical Clustering Agglomerative algorithm: 1 Make one cluster for each document. 2 Join the most similar pair into one cluster. 3 Repeat 2 until some condition is fulfilled. Examples: single link, complete link, group average link, Ward’s method. Magnus Rosell Clustering in Swedish
Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example Hierarchical Clustering (cont.) Divisive algorithm: 1 Put all documents in one cluster. 2 Split the worst cluster. 3 Repeat 2 until some condition is fulfilled. Example: Bisecting K-Means splits the biggest cluster into two using K-Means. Magnus Rosell Clustering in Swedish
Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example Comparison of Algorithms K-Means decide the number of clusters in advance different results depending on initial partition “global” Agglomerative may stop at “optimal” number of clusters same result every time “local” Magnus Rosell Clustering in Swedish
Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example A Document Clustering Example Klustringsresultat Antal artiklar Ord Ekonomi Nöje Sport Sverige Världen Totalt procent, index, börs, ök, Kluster1 167 4 1 37 23 232 ränt film, aftonbl, skriv, tv, Kluster2 18 421 22 176 40 677 sver spel, match, svensk, Kluster3 0 19 452 10 14 495 vann, klubb reut, pressmeddel, bolag, Kluster4 312 8 6 36 10 372 stockholm, akti polis, död, skad, person, Kluster5 3 48 19 241 413 724 tt tt, svensk, procent, skriv, Totalt stockholm, reut, spel, tid, 500 500 500 500 500 2500 dag, akti Magnus Rosell Clustering in Swedish
Introduction Evaluation Clustering Internal Quality Measures Evaluation External Quality Measures Clustering in Swedish Comparing Comparisons Evaluation What is a good clustering? Internal qualities measures depend on the representation. External quality measures are based on a known categorization. Magnus Rosell Clustering in Swedish
Introduction Evaluation Clustering Internal Quality Measures Evaluation External Quality Measures Clustering in Swedish Comparing Comparisons Internal Quality Measures Cluster self similarity (the average similarity of the documents in a cluster). Not good when evaluating the represenation. Also uses the assumption that our representation is valid. Magnus Rosell Clustering in Swedish
Introduction Evaluation Clustering Internal Quality Measures Evaluation External Quality Measures Clustering in Swedish Comparing Comparisons External Quality Measures Compare the clustering to another partition (a categorization for instance). Precision, Recall Entropy, Mutual Information Magnus Rosell Clustering in Swedish
Introduction Evaluation Clustering Internal Quality Measures Evaluation External Quality Measures Clustering in Swedish Comparing Comparisons External Quality Measures (cont.) External measures – comparisons: Clustering → categorization Max and min, but what is good? Magnus Rosell Clustering in Swedish
Introduction Evaluation Clustering Internal Quality Measures Evaluation External Quality Measures Clustering in Swedish Comparing Comparisons Comparing Comparisons Clustering AMSYK YK80 YK80 → AMSYK: 3.00 Clustering → AMSYK: 2.29 (77 %) AMSYK → YK80: 3.02 Clustering → YK80: 2.17 (72 %) Average values over all levels. Magnus Rosell Clustering in Swedish
Introduction Text Sets Clustering Stemming Evaluation Compound Splitting Clustering in Swedish Phrases Text Sets KTH News Corpus DN and Aftonbladet Five categories each Medical papers from L¨ akartidningen MeSH-terms Four autmatically extracted categorizations. Magnus Rosell Clustering in Swedish
Introduction Text Sets Clustering Stemming Evaluation Compound Splitting Clustering in Swedish Phrases Stemming Morphology: cykel, cyklarnas, cykling Stemming – remove suffixes. Stemming improved results on newspaper articles with 4 %. Reduces the size of the representation. Magnus Rosell Clustering in Swedish
Introduction Text Sets Clustering Stemming Evaluation Compound Splitting Clustering in Swedish Phrases Compound Splitting Swedish compounds: textklustring (clustering of texts) The Spell checking program Stava . Stop some “compounds” from being split: godk¨ and, Svante, Lindstr¨ om, etc Improved results with 10 %. Combined with Stemming: 13 %. (Newspaper articles) Keep only the components! Magnus Rosell Clustering in Swedish
Recommend
More recommend