indexing of textual databases based
play

Indexing of textual databases based on lexical resources: A case - PowerPoint PPT Presentation

Indexing of textual databases based on lexical resources: A case study for Serbian Ranka Stankovi Cvetana Krstev 1st International Ivan Obradovi KEYSTONE Conference Olivera Kitanovi IKC 2015 Coimbra Portugal, 8-9 September 2015


  1. Indexing of textual databases based on lexical resources: A case study for Serbian Ranka Stankovi ć Cvetana Krstev 1st International Ivan Obradovi ć KEYSTONE Conference Olivera Kitanovi ć IKC 2015 Coimbra Portugal, 8-9 September 2015 University of Belgrade, Serbia

  2. Presentation outline • Motivation • Current solution • Improved solution ▫ Resources used ▫ Architecture of the new system • Evaluation • Conclusion and future work 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

  3. Motivation • Geological Information System of Serbia was launched in 2004 ▫ general geology, exploration of mineral deposits, hydrogeology, engineering geology ▫ users (professionals or ordinary citizens) ▫ geo-portal, cartographic content, multimedia, dictionaries and textual databases • FoDiB - geological projects documentation with structured descriptions of over 4,900 national geological projects from 1956 to the present day • Metadata: ▫ title, year, location, company, authors, abstract, keywords ▫ prospects, application of mineral resource and possibilities for its use ▫ field works, geomechanics, mining, geodesic works, prospective exploration • DB contains project summary with about 30% of text from project, representing well the textual content of the complete report • Digitalization and full text archiving is foreseen, so this approach will be expanded and implemented on future full text database 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

  4. Present solution (ugalj OR “ugljeni basen”) AND kostolac Faceted search: • Mineral resource • Location • Author • Year • General 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

  5. Current solution • Search by scanning the text ▫ appropriate fields with given keywords ▫ word boundaries not taken into consideration • Search results are ranked on the basis of weight factors assigned to individual fields • Each search criterion fits several different attributes within the database, where weight factors determine the attributes ’ relevance for the result set • Example of search criteria: location ▫ Weights are: Municipality 8, County 7, Title 4, Keywords 3, Abstract 2, Appendices 1. ▫ For location criterion with keyword Bor: Bor in Municipality field better ranked than Bor in the Abstract field. 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

  6. Improved Solution • One of the problems of full text search in Serbian is its rich morphology • Keywords are in first person singular, while in the texts they take different inflectional forms • Normalization of morphological forms for document indexing and query processing ▫ stemming: several stemmers are avilable, one with open code ▫ statistical lemmatization (TreeTagger, trained on corpus of contemporary Serbian, not appropriate for technical texts) ▫ lemmatization based on morphological electronic dictionaries and finite state transducers for Serbian 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

  7. Resources used in improved solution • NLP for Serbian based on lexical resources ▫ electronic dictionaries: 135,000 simple word lemmas + 13,000 MWUs ▫ local grammars using finite-state transducers (FSTs): 1,000 inflectional transducers ▫ 3,500,000 inflected forms generated automatically • NER: names of persons, locations and organizations, time, date, money and percentage • The Serbian NER system is a handcrafted rule-based system based on e-dictionaries and local grammars in the form of FSTs • For more information about lexical resources and tools see: http://jerteh.rs 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

  8. Resources used in improved solution • The whole collection consist of 4,902 documents, 2,880,229 tokens (900,403 simple word forms). • Almost all documents contained at least one NE • On the average, 4 NEs of all types were recognized per document, with as many as 47 NEs for one of the documents Table 1. Distribution of three top-level NEs: persons, locations and organizations NE type Frequency Average per doc % of the text person 11,991 2.45 1.33 location 49,414 10.08 5.49 organization 2,882 0.59 0.32 total 64,287 13.11 7.14 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

  9. Architecture of the new system User query Geological documents Transliteration Lexical (Latin->Cyrillic) resources: Tokenization & lemmatization e-dictionaries Transliteration local grammers Tokenization NLP tools: POS tagging & Query lemmatization Leximir representation Unitex Bag-of-words NE extraction Feedback Phrases chunking Matching: scoring and Document ranking Indexed documents representation Retrieved documents 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

  10. Architecture of the new system • BOW - representation of the document by a set of ungrammatical words (nouns, adjectives, adverbs and acronyms) followed by their frequencies • Text is lemmatized and lemmas (simple and multi-word) are extracted and their frequencies are calculated • In this approach 12,204 simple lemmas (with 450,418 occurences) and 271 MWUs (with 6,525 occurences) were extracted 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

  11. One document dealing with gold 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

  12. Term weights • First implementation: tf_idf • Further development included: ▫ tfc.tfc - modification of tf.idf with cosine normalization ▫ tfc.nfc - term weighting algorithm with normalized tf factor for the query term weights ▫ lnc.ltc measure where ‘l’ stands for weights with a logarithmic tf component ▫ lnu.ltu where normalization is based on the number of unique words in text ▫ measure used in Inquery system Hiemstra, D.: Using language models for information retrieval. Taaluitgeverij Nes- lia Paniculata (2001) 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

  13. Evaluation • First evaluation: entire collection of documents and a set of 10 information needs • For query selection the log of the existing system was used as well as suggestions of geologists on most common information needs. • It turned out that most frequent requests are for ▫ a mineral resource type (copper, gold, coal) ▫ location ▫ geological event (landslide, earthquake) ▫ research company 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

  14. Evaluation • Precision P = tp/(tp + fp), recall R = tp/(tp + fn), and F- measure F = 2*P*R/(P + R) • Precision-recall curve for (zlato OR Au) AND (Bor OR Borski okrug) retrieval without index with index • precision of the old system is significantly better among first-ranked documents • recall is better with the new system: 39 among first 80 documents in the new system were relevant, compared to 25 in the old one 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

  15. Evaluation • Comparative graph of the relationship between precision and recall ▫ interpolated average precision for 11 levels of recall 0.0, 0.1,0.2,..., 0.9, 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

  16. Average Precision per query and Mean Average Precision (MAP) for the old and the new system A space in a query stands for an OR operator, a semicolon for an AND operator (relevant for the old system) 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

  17. Evaluation • The biggest problem of the new system are: ▫ specific technical terms that are not found in electronic dictionaries ▫ quite a number of typographical errors in the document collection • This shortcomings can be rectified by: ▫ correcting errors (based on the list of words unrecognized by the vocabulary) ▫ continuous enhancement of the vocabulary by adding new words • Evaluation was time consuming due to: ▫ lack of previously marked documents as relevant for queries ▫ no software support for evaluation, everything was done in excel (manually) 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

  18. Conclusion and future work • Advantages of current solution: ▫ simple to apply ▫ performs well for certain types of queries • New solution based on pre-indexing outperforms the present, but it can be further improved by: ▫ enriching morphological e-dictionaries with terms from geological domain ▫ adapting NER to the new domain and text type (technical rather than newspapers) ▫ experimenting with different term weight measures ▫ experimenting with different comparison of documents and information need representation • Further research will be done by: ▫ applying the new solution to other textual collections ▫ developing a geodatabase for visualization of locations of recognized named entities • An analysis of queries in the full sentence form is planned • Integration of query expansion by adding synonyms from available resources, such as the geologic dictionary for terminological query terms and WordNet for more general terms. 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

  19. 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Recommend


More recommend