TALPGeoIR Daniel Ferr´ es TALP at GeoCLEF 2007: Using Terrier with Geographical Knowledge Filtering Daniel Ferr´ es and Horacio Rodr´ ıguez TALP Research Center Universitat Polit` ecnica de Catalunya CLEF 2007, 21 September, Budapest, Hungary
Outline TALPGeoIR Daniel Ferr´ es Introduction 1 System Overview 2 Document Retrieval 3 Experiments 4 Conclusions 5
TALPGeoIR TALPGeoIR Daniel Ferr´ es Introduction System GIR system that combines thematic and geographical Overview Geographical searches. Resources Geographical Thesaurus An improved version of TALPGeoIR 2006 [ferres-2006] . Collection Pre-processing Motivation at GeoCLEF 2007: Shape Files Toolbox Document Using a state-of-the-art IR: Terrier [Ounis-2006] . Retrieval Using geographical knowledge to improve standard IR Thematic IR Geographical IR results. Document Filtering Experiments Results Conclusions Future Work
System Overview TALPGeoIR Daniel Ferr´ es Introduction 1 Introduction System System Overview 2 Overview Geographical Resources Geographical Resources Geographical Geographical Thesaurus Thesaurus Collection Collection Pre-processing Pre-processing Shape Files Toolbox Shape Files Toolbox Document Retrieval Thematic IR Document Retrieval 3 Geographical IR Document Filtering Experiments Experiments 4 Results Conclusions Future Work Conclusions 5
Geographical Knowledge Base TALPGeoIR Daniel Ferr´ es Introduction System Overview Geographical Gazetteers: Geographical Resources GEOnet Names Server (GNS). 5.3 million entries Geographical Thesaurus Geographic Names Information System (GNIS). 39,906 Collection Pre-processing Shape Files entries (US. Concise subset) Toolbox GeoWorldMap (Geobytes Inc.). 40,594 entries Document Retrieval World Gazetteer : 29,924 cities Thematic IR Geographical IR Document Filtering Experiments Results Conclusions Future Work
Geographical Thesaurus TALPGeoIR Daniel Ferr´ es Introduction Information for each geographical entry : feature name, System Overview feature type base, geo-ontology parent, coordinates, Geographical Resources (population). Geographical Thesaurus Collection Alexandria Digital Library (ADL) Feature Type Pre-processing Shape Files Thesaurus : 575 features [hill-2000] . Toolbox Document Disambiguation Hierarchy : continent, sub-continent, Retrieval Thematic IR capital, country, region (state),sea , summit, river, county Geographical IR Document (province), other. Filtering Experiments Results Conclusions Future Work
Collection Pre-processing TALPGeoIR Daniel Ferr´ es Linguistic Pre-processing: Introduction Part-of-speech (POS) tags . TnT [brants-2000] . System Overview Lemmas . WordNet Lemmatizer [fellbaum-1998] . Geographical Resources Named Entities . Maximum Entrophy-based NERC (CoNLL Geographical Thesaurus 2003 English Dataset for training). Collection Pre-processing Shape Files Geographical Preprocessing with GeoKB. Toolbox Document Indexing: Retrieval Thematic IR Geographical Index : feature type and geo-ontology path Geographical IR Document information and coordinates. Filtering Experiments Textual Index : lemmatized content of the documents Results without added extra geographical information. Conclusions Future Work
Shape Files Toobox TALPGeoIR Daniel Ferr´ es Introduction System [pouliquen-2004] propose the use of a publicly available Overview Geographical database of ’shape files’ for countries. Resources Geographical Thesaurus ’shape files’: encoding polygons that representing the Collection Pre-processing ’border’ of the area. Shape Files Toolbox Our main features with shape files: Document Retrieval 9-grid zone division. (North, East, North-East,...) Thematic IR Geographical IR Close/Near points around a point P. Document Filtering Experiments Results Conclusions Future Work
Document Retrieval TALPGeoIR Introduction 1 Daniel Ferr´ es System Overview 2 Introduction Geographical Resources System Geographical Thesaurus Overview Geographical Collection Pre-processing Resources Geographical Shape Files Toolbox Thesaurus Collection Pre-processing Document Retrieval 3 Shape Files Toolbox Thematic IR Document Retrieval Geographical IR Thematic IR Document Filtering Geographical IR Document Filtering Experiments 4 Experiments Results. Results Conclusions Conclusions Future Work 5 Future Work
Terrier Configuration TALPGeoIR Daniel Ferr´ es Thematic document retrieval over Terrier. Introduction System All keywords are used for search (only stopwords removal). Overview Geographical Lemma searching. Resources Geographical Thesaurus Selection of schemas based on experiments over the GeoCLEF Collection Pre-processing Shape Files 2006 data set: Toolbox Document TF-IDF vs DFR vs BM25 Retrieval Thematic IR Porter Stemmer vs No stemmer Geographical IR Document Filtering Blind Relevance Feedback (docs=10;terms=40) vs No Experiments Relevance Feedback Results Conclusions Future Work
Geographical IR using GKBs TALPGeoIR Daniel Ferr´ es Introduction Obtains the set of documents that are geographically System Overview rellevant. Geographical Resources Uses the geographical places and geographical feature Geographical Thesaurus types detected in the topics to perform the search. Collection Pre-processing Shape Files Toolbox The feature types can be expanded with a list of synonyms Document extracted from GNS. Retrieval Thematic IR Relaxed geographical search policy (e.g. a query that Geographical IR Document contains U.S. retrieves documents that contain New York). Filtering Experiments Results Conclusions Future Work
Document Filtering TALPGeoIR Daniel Ferr´ es Introduction System Overview Geographical Resources Documents retrieved by Terrier that have been also Geographical Thesaurus Collection retrieved by the GKBs had priority over the other Pre-processing Shape Files documents retrieved by Terrier. Toolbox Document Retrieval Thematic IR Geographical IR Document Filtering Experiments Results Conclusions Future Work
GeoCLEF 2007 Experiments TALPGeoIR Daniel Ferr´ es Introduction Table: 1. Description of the TALPGeoIR Experiments at GeoCLEF System Overview 2007. Geographical Resources Geographical Thesaurus Collection Runs IR System Relevance Feedback Border Filtering Pre-processing Shape Files Toolbox TD1 Terrier yes - Document TD2 Terrier & GeoKB yes - Retrieval TDN1 Terrier yes - Thematic IR Geographical IR TDN2 Terrier & GeoKB yes - Document Filtering TDN3 Terrier & GeoKB - yes Experiments Results Conclusions Future Work
Global Results TALPGeoIR Daniel Ferr´ es Introduction System Overview Table: 2. TALPGeoIR results at GeoCLEF 2007. Geographical Resources Run IR System AvgP. R-Prec. Recall (%) Geographical Thesaurus Collection TD1 Terrier 0.2711 0.2847 91.23% Pre-processing Shape Files TD2 Terrier & GeoKB 0.2850 0.3170 90.30% Toolbox TDN1 Terrier 0.2625 0.2526 93.23 % Document Retrieval TDN2 Terrier & GeoKB 0.2754 0.2895 90.46% Thematic IR Geographical IR TDN3 Terrier & GeoKB 0.2787 0.2890 92.61% Document Filtering Experiments Results Conclusions Future Work
Conclusions TALPGeoIR Daniel Ferr´ es Introduction System Overview Geographical Knowledge improved standard IR. Geographical Resources The approach with Terrier and the GeoKB was slightly Geographical Thesaurus better in terms of MAP than the one with Terrier alone. Collection Pre-processing Shape Files the BorderFiltering approach applied without Relevance Toolbox Document Feedback improved slightly the results in MAP and Recall. Retrieval Thematic IR Good results at GeoCLEF 2007. Geographical IR Document Filtering Experiments Results Conclusions Future Work
Future Work TALPGeoIR Daniel Ferr´ es Introduction System Overview A precision-oriented toponym resolution (disambiguation) Geographical Resources algorithm Geographical Thesaurus Experiments with the Divergence From Randomness Collection Pre-processing Shape Files schema. Toolbox Document Improvement of the Shape Files toolbox and the Border Retrieval Filtering algorithm. Thematic IR Geographical IR Document Filtering Experiments Results Conclusions Future Work
Thanks! TALPGeoIR Daniel Ferr´ es Introduction System Overview Thanks for your attention! Geographical Resources Geographical Thesaurus Collection Pre-processing Shape Files Toolbox Document Retrieval Thematic IR Questions? Geographical IR Document Filtering Experiments Results Conclusions Future Work
Recommend
More recommend