information retrieval based on extraction of domain
play

Information Retrieval Based on Extraction of Domain Specific - PowerPoint PPT Presentation

Information Retrieval Based on Extraction of Domain Specific Information Retrieval Based on Extraction of Domain Specific Significant Keywords and Other Relevant Phrases from a Conceptual Semantic Network Structure Mohammad Moinul Hoque,


  1. “Information Retrieval Based on Extraction of Domain Specific Information Retrieval Based on Extraction of Domain Specific Significant Keywords and Other Relevant Phrases from a Conceptual Semantic Network Structure ” Mohammad Moinul Hoque, Prakash Poudyal, Teresa Goncalves M h d M i l H P k h P d l T G l and Paulo Quaresma University of Evora Team Evora Portugal Evora, Portugal

  2. Introduction  This paper presents  A functional approach towards the problem domain of Information Retrieval System built upon a narration based search text.  The presented system  Retrieves documents from the background collection  By extracting Domain specific significant keywords D i ifi i ifi t k d  Other relevant phrases from a given narrative search text.   The narrative search text can be  A description / scenario  A description / scenario

  3. P Proposed Approach d A h  A domain specific Conceptual Semantic Network is built (CSN)  Significant keywords are extracted from the narrative search text g y with the help of the CSN to form an initial search query  Alternative sets of search queries are also formulated by  Expanding the initial query built from the CSN.  Adding synonymous terms of the retrieved keyword/phrases using WordNet synonym sets.

  4. Domain specific Conceptual Semantic o a spec c Co ceptua Se a t c Network  The corpus we are dealing with is domain specific (legal documents) documents)  Search space is also domain dependent.  We build a potential model of Concept based Semantic Network structure (CSN) manually. t t (CSN) ll  CSN contains various conceptual terms/phrases related to the respective domains and connections among them. ti d i d ti th  These concept terms / phrases are extracted from the Wikipedia using a crawler application developed for the purpose using a crawler application developed for the purpose.

  5. Domain specific Conceptual Domain specific Conceptual Semantic Network Maintenance Maintenance Divorce Dowry Child Custody Child Custody Marriage Torture Case File Re Marriage Re-Marriage A A partial view of the domain specific (Hindu Marriage and Divorce Law domain) ti l i f th d i ifi (Hi d M i d Di L d i ) Conceptual Semantic Network

  6. Domain specific Conceptual Domain specific Conceptual Semantic Network Manufacturing defect Warranty refusal Warranty claim Consumer Protection Return Replacement A partial view of the domain specific (Consumer Protection Law domain) Conceptual Semantic Network Conceptual Semantic Network

  7. Indexing the Document Corpus Indexing the Document Corpus  Background data files are preprocessed first by stripping off a few data structure tags.  Stemming is performed on data using the Porter Stemming Algorithm for English language.  Finally the data is indexed using an inverted index structure. Inverted Stemming Background Data Preprocess document I d Index

  8. Search text analysis and processing  Preprocessing  English stop words are eliminated from the text since they are very less or not significant at all  Text is freed from the noisy symbols or characters  Text is freed from the noisy symbols or characters.  Search text is converted into a set of sentences using the heuristics method employed by the OpenNLP’s API .  POS Tagging : Words in the search text are tagged with POS tag using Stanford POS tagger  Named-Entities such as Organization and Locations are identified from the search text.  Domain specific Conceptual Semantic Network is consulted to select D i ifi C t l S ti N t k i lt d t l t the significant keywords from the search text  Those non-stop terms/phrases are initially picked ups as a possible keyword set Those non stop terms/phrases are initially picked ups as a possible keyword set to build up an initial search query.

  9. Search query expansion Search query expansion  From the Conceptual Semantic Network  Possible connections with other conceptual terms related to the p initial set of keywords are analyzed and added to form an alternate set of queries  Based on the parts of speech tags of the marked keywords  Possible set of synonyms are extracted using WORDNET synset.  These synonyms are also added to create an alternate set of queries  For possible file retrieval performance enhancement.

  10. Search text analysis and processing Search text analysis and processing Domain specific Domain specific Narrative Search S Conceptual Semantic text Network Set of Search Initial Search Preprocessing Queries Query Query Stemming Query Named Entity POS expansion Recognition Tagging gg g Indexed Data WordNET

  11. Generating Search Queries (Example)  S1: “I am a Hindu girl married for over 5 years and have a 4 yr child out of my wedlock. My married life had been full of problems from the first week of marriage - most of which can be summarized as dowry related harassment physical and mental torture and cruelty Now my related harassment, physical and mental torture and cruelty. Now my husband and family have filed for divorce mostly on the grounds of cruelty and infidelity using false allegations to malign my character cruelty and infidelity using false allegations to malign my character and false allegation to prove that I have been a bad daughter in law. …... I want to file a FIR and complaint in Women Cell regarding my jewellery and dowry related harassment. …. I want the child custody, monthly maintenance and share in husband's or in-law's property…”

  12. Generating Search Queries (Example) Generating Search Queries (Example)  Our system analyzes the text and discovers the domain dependent terms from the CSN  For example: the system extracts the keyword ‘ marriage’ and associates the phrase ‘ full of problems ’ with it ‘ full of problems ’ with it.  The System continues to discover similar associations  Consults the WORDNET synonym set to add a few more synonyms of those keywords depending on their parts of speech tag keywords depending on their parts of speech tag  The system creates a collection of sets containing 1…m number of sets  Each of which is again a set of n number of keywords. E h f hi h i i t f b f k d  The cardinality of these sets appearing inside the superset will be ranging from 2….n  The content of these sets keywords in the search text , keywords from the conceptual network connections which are k d i th h t t k d f th t l t k ti hi h  associated with them.  Phrases appearing inside quotation marks are directly included in the collection set  Phrases appearing inside quotation marks, are directly included in the collection set

  13. Search Query generation  Collection Set [ {marriage, problem}, {marriage , dowry}, {marriage, ‘physical torture’}, {Marriage, dowry, harassment }, {Separation, child } { g y } { p custody}, {divorce, maintenance}, {Divorce, ‘false allegation’},{marriage, endowment, harassment}, {Marriage, Dowry, mental torture}, {marriage, dowry, emotional abuse}, Dowry mental torture} {marriage dowry emotional abuse} {marriage, dowry, ‘verbal abuse’, ‘physical abuse’, harassment}, {marriage, dowry, ‘verbal abuse’, ‘physical abuse’, harassment, divorce}, {marriage, ‘physical abuse’ , b h d } { h l b ‘abusive marriage’, cruelty, ‘mental torture’, separation} ]

  14. Ranking of the retrieved documents and g selection of final set of documents  Adopted the Lucene based searching techniques (uses a combination of Vector Space Model and Boolean Model ) combination of Vector Space Model and Boolean Model )  Documents are ranked and scored by VSM; only for those retrieved document which are approved by the BM retrieved document which are approved by the BM.  When passing the queries  Each of the sets of queries inside the collection set of queries are E h f th t f i i id th ll ti t f i sent separately and the returned set of documents are stored with their corresponding scores their corresponding scores.  VSM score of a document d for the query q is calculated using Cosine Similarity of the weighted query vectors V(q) and V(d) Cosine Similarity of the weighted query vectors V(q) and V(d).

  15. Ranking of the retrieved documents and selection of final set of documents V(q) · V(d) C Cosine Similarity ( q,d ) = ––––––––– i Si il i ( d ) |V(q)| |V(d)| V(q).V(d) is the dot product of the weighted vectors, and |V(q)| and |V(d)| are their Euclidean norms.

  16. Ranking Point priorities Ranking Point priorities Search queries having larger cardinality in terms of the  Search queries having larger cardinality in terms of the containing keywords within them are given higher priorities  Returned documents containing more significant keywords have the best chance of being more relevant. g  From the results set of each queries From the results set of each queries,  Top 1000 highest ranked documents (based on VSM points) are generated and expected to be relevant to the search text. generated and expected to be relevant to the search text.

Recommend


More recommend