Search Results Clustering in Polish: Evaluation of Carrot DAWID - PowerPoint PPT Presentation

Search Results Clustering in Polish: Evaluation of Carrot DAWID WEISS JERZY STEFANOWSKI Institute of Computing Science Pozna ń University of Technology

Introduction • search engines – tools of everyday use • poor knowledge about search techniques • presentation of search results • „Baudelaire?”

Limitations of ranked list presentation

What is Search Results Clustering? Search Results Clustering is about efficient identification of meaningful thematic groups of documents in a search result and their concise presentation • benefits gained from SRC • faster identification of relevant groups of documents • identification of topics range covered by the search result • SRC does not cure • SRC is not a query answering system

Our research • general influence of data pre-processing on the quality of clustering • ignoring stop-words • stemming • clustering inflectionally rich languages (Polish) • Suffix Tree Clustering algorithm’s thresholds and quality of results • new search results clustering algorithms

Suffix Tree Clustering algorithm • Snippet similarity based on recurring phrases • utilizes suffix trees for clustering (theoretically linear complexity) • one of the first approaches dedicated to search results clustering All the real knowledge which we possess, depends on methods by which we distinguish the similar from the dissimilar. - Genera plantarum, Linnaeus

Example (1) “cat ate cheese” (2) “mouse ate cheese too” (3) “cat ate mouse too” Base clusters: [a] (1,3) cat ate [b] (1,2,3) ate [f] (1,2) ate cheese [c] (2,3) too … - some base clusters will be removed because they contain stop words , np. [c] - for each cluster we calculate a base cluster score

Example (contd) • base clusters merging • binary similarity measure • all connected sub graphs become clusters • many limitations of the merging method

Data pre-processing (in STC and not only) • ignoring frequently occurring terms (stop words) • stemming • how we addressed the above for Polish? • stop words – public sources and private word frequency list (Rzeczpospolita) • SAM • custom stemming and lemmatization methods: quasi-stemmer i lametyzator

Quasi-stemmer • very simple • head-word (lexeme) is not explicit • the terms share identical prefix ( k characters) • after removing the prefix, the remainders for both terms exists in the lookup table of allowed suffixes • suffixes table from Rzeczpospolita corpus • weaknesses of the method • does not handle alternations • relation of ‘stem’ equality not transitive

[Lame]tyzator • inflected and base forms generated using ispell-pl dictionary • compressed to a finite state automaton • advantages • very fast • large word coverage (1.4 million? src: ispell-pl ) • open source (dictionary: GPL, Java code: free) • weaknesses • only words in the dictionary can be analyzed • contains erroneous entries (betoniarka [beton]) • no tags (stemming only)

The experiment: measuring clustering quality • existing approaches • precision/ recall – lack of test data • user surveys – subjective, hard to involve large number of participants • user interface efficiency measures (Zamir)

The experiment: measuring clustering quality • Byrona E. Dom measure of clustering quality • entropy-based • measures differences between the ‘ideal’ and given clustering • Q2=1 � C i K are identical • Q2=0 � groups in K do not carry any information about groups in C

The experiment: assumptions • clustering of 1:1 type (partitioning) • binary document-to-cluster membership • flat structure of clusters (no hierarchy)

The experiment: input data and ground truth • A set of 100 results for two queries ( inteligencja and odkrywanie wiedzy ) were downloaded • Manual clustering of this set was performed by 5 individuals (experts) • Ground truth set was obtained by unifying the results from each expert • A large number of inconsistencies in manual clustering only proves the problem is indeed difficult (only about 50% of assignments fully consistent among all experts) • Experiment has been later extended to cover more queries (2 in Polish and 4 in English)

The experiment: configurations • pre-processing configurations • for Polish: • no stemming, all words • quasi-stemmer, all words • quasi-stemmer, stop words ignored • lametyzator, all words • lametyzator, stop words ignored • for English: • as above, Porter algorithm used for stemming • wide spectrum of values for control thresholds ( minimum base cluster score and merge threshold)

Results no stemming, no 0,54 stopwords 0,52 quasi-stemming, no stopwords 0,5 quasi-stemming, 0,48 Q0 stopwords 0,46 dictionary- stemming, no 0,44 stopwords dictionary 0,42 stemming, stopwords 0,4 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 8 6 4 2 0 8 6 4 2 0 8 min. base cluster score , , , , , , , , , , , , , 0 1 1 2 3 4 5 5 6 7 8 9 9 Distribution of Q0, constant merge threshold (0.6), query: inteligencja

Results (contd) no stemming, no 0,64 stopwords quasi-stemming, 0,62 no stopwords quasi-stemming, 0,6 Q0 stopwords 0,58 dictionary- stemming, no stopwords 0,56 dictionary stemming, stopwords 0,54 0,20 1,00 1,80 2,60 3,40 4,20 5,00 5,80 6,60 7,40 8,20 9,00 9,80 min. base cluster score Distribution of Q0, constant merge threshold (0.6), query: odkrywanie wiedzy

Results (contd) 0,45 0,43 no stemming, no stopwords 0,41 0,39 no stemming, stopwords 0,37 Q0 0,35 stemming, no 0,33 stopwords 0,31 stemming, 0,29 stopwords 0,27 0,20 1,00 1,80 2,60 3,40 4,20 5,00 5,80 6,60 7,40 8,20 9,00 9,80 min. base cluster score Distribution of Q0, constant merge threshold (0.6), query: salsa

Results – thresholds and quality 0,4000 0,3500 0,3000 0,2500 0,3500-0,4000 q2 0,2000 0,3000-0,3500 0,2500-0,3000 0,1500 0,2000-0,2500 0,1500-0,2000 0,1000 0,1000-0,1500 9,80 0,0500-0,1000 8,60 0,0500 0,0000-0,0500 7,40 6,20 0,0000 5,00 0,3 min cluster score 0,34000003 3,80 0,38000005 0,42000008 0,4600001 0,5000001 2,60 0,5400001 0,58000004 0,62 0,65999997 0,6999999 1,40 0,7399999 0,77999985 0,8199998 0,8599998 QUERY: logika rozmyta 0,20 0,89999974 0,9399997 0,97999966 merge threshold

Results – thresholds and clusters number 30,0000 25,0000 20,0000 25,0000-30,0000 number 20,0000-25,0000 of 15,0000 15,0000-20,0000 clusters 10,0000-15,0000 5,0000-10,0000 0,9399997 0,0000-5,0000 0,8599998 10,0000 0,77999985 0,6999999 5,0000 0,62 merge threshold 0,5400001 0,0000 0,4600001 0,20 1,00 1,80 0,38000005 2,60 3,40 4,20 QUERY: logika rozmyta 5,00 5,80 6,60 0,3 7,40 8,20 min cluster score 9,00 9,80

Conclusions (general) • STC seems to be sensitive to languages with rich inflection • stemming and ignoring stop words improved the quality of results (within our assumptions and quality measure) • even simple pre-processing methods yielded significant improvement (quasi-stemmer)

Conclusions (STC-specific) • low base cluster score and merge threshold decrase the stability of quality measure • base cluster score strongly affects the number of final clusters • high base cluster score leads to highly distinctive, but potentially obvious, clusters

Current work • other algorithms (not phrase-based) • derived from Latent Semantic Indexing • hierarchical methods • search results clustering framework – Carrot 2

Carrot 2 • in the beginning… • reference STC implementation • now • many algorithms • distributed architecture • data-driven components (XML) • ease of debugging and component integration • active open source project

Become part of the project http://www.cs.put.poznan.pl/dweiss/carrot

Search Results Clustering in Polish: Evaluation of Carrot DAWID - PowerPoint PPT Presentation

Search Results Clustering in Polish: Evaluation of Carrot DAWID WEISS JERZY STEFANOWSKI Institute of Computing Science Pozna University of Technology Introduction search engines tools of everyday use poor knowledge about search

Dietary carrot powder T eam : More than just a carrot Agenda Description of the idea

CIOA The Carrot Improvement for Organic Agriculture Project Better Carrots for Organic Growers

Frontend Web Development with Angular CC BY-NC-ND Carrot & Company GmbH Agenda

Frontend Web Development with Angular CC BY-NC-ND Carrot & Company GmbH Agenda

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

School Offers Carrot Sticks Student Selects Pineapple Cup Hot Dog on WG bun Sweet Potato

Frontend Web Development with Angular CC BY-NC-ND Carrot & Company GmbH Agenda

Water Based and Odour Free Nail Polish Acquarella Polish - offers a high quality, chemicals free

From the National Corpus of Polish to the Polish Corpus Infrastructure Maciej Ogrodniczuk

Clustering in Swedish The Impact of some Properties of the Swedish Language on Document

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Natural Language Processing CSCI 4152/6509 Lecture 10 Elements of Information Retrieval

What Makes Human Languages Interesting? Connecting minds: how one persons thoughts reach

StemmingandSearch StrategiesforEast EuropeanLanguage

Recent Developments in Digital Services Taxes: The UK Debate John Vella Faculty of Law &

POLITICAL OPINIONS OF US AND THEM AND THE INFLUENCE OF DIGITAL MEDIA USAGE Laura Burbach Andr

Slide 1. The paper by Gali and Rabanal has two main parts. Part I is a survey of papers in the

iMedEd Hackathon Cooney | Chan | Voros | Patocka iMedEd Hackathon Cooney | Chan | Voros |

PASSING POINTERS TO FUNCTIONS CSSE 120 Rose-Hulman Institute of Technology Parameter Passing

Search Results Clustering in Polish: Evaluation of Carrot DAWID - PowerPoint PPT Presentation

Search Results Clustering in Polish: Evaluation of Carrot DAWID WEISS JERZY STEFANOWSKI Institute of Computing Science Pozna University of Technology Introduction search engines tools of everyday use poor knowledge about search

Dietary carrot powder T eam : More than just a carrot Agenda Description of the idea

CIOA The Carrot Improvement for Organic Agriculture Project Better Carrots for Organic Growers

Frontend Web Development with Angular CC BY-NC-ND Carrot &amp; Company GmbH Agenda

Frontend Web Development with Angular CC BY-NC-ND Carrot &amp; Company GmbH Agenda

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

School Offers Carrot Sticks Student Selects Pineapple Cup Hot Dog on WG bun Sweet Potato

Frontend Web Development with Angular CC BY-NC-ND Carrot &amp; Company GmbH Agenda

Water Based and Odour Free Nail Polish Acquarella Polish - offers a high quality, chemicals free

From the National Corpus of Polish to the Polish Corpus Infrastructure Maciej Ogrodniczuk

Clustering in Swedish The Impact of some Properties of the Swedish Language on Document

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Natural Language Processing CSCI 4152/6509 Lecture 10 Elements of Information Retrieval

What Makes Human Languages Interesting? Connecting minds: how one persons thoughts reach

StemmingandSearch StrategiesforEast EuropeanLanguage

Recent Developments in Digital Services Taxes: The UK Debate John Vella Faculty of Law &amp;

POLITICAL OPINIONS OF US AND THEM AND THE INFLUENCE OF DIGITAL MEDIA USAGE Laura Burbach Andr

Slide 1. The paper by Gali and Rabanal has two main parts. Part I is a survey of papers in the

iMedEd Hackathon Cooney | Chan | Voros | Patocka iMedEd Hackathon Cooney | Chan | Voros |

PASSING POINTERS TO FUNCTIONS CSSE 120 Rose-Hulman Institute of Technology Parameter Passing

Frontend Web Development with Angular CC BY-NC-ND Carrot & Company GmbH Agenda

Frontend Web Development with Angular CC BY-NC-ND Carrot & Company GmbH Agenda

Frontend Web Development with Angular CC BY-NC-ND Carrot & Company GmbH Agenda

Recent Developments in Digital Services Taxes: The UK Debate John Vella Faculty of Law &