Automatic Ontology-Based Document Annotation for Arabic Information Retrieval Ashraf I. Kaloub Rebhi S. Baraka Alaqsa-Community College Faculty of Information Technology Khan Younis, Gaza Strip Islamic University of Gaza 1 The 3rd Palestinian Symposium on Computational Linguistics and Arabic Content (iArabic’ 2014) April 12, 2014
O UTLINE Introduction Methodology and model System Realization, Experimental Results and Evaluation Conclusion and Future Work 2
I NTRODUCTION The need for semantically enriched Information Retrieval (IR) and searching are among the most important issues of the semantic web. Semantic IR try to overcome the limitations of the traditional IR model which suffers from misunderstanding the query and its context and on the keyword which cannot represent the semantic information of resources therefore obtaining a lower recall and precision. 3
I NTRODUCTION Using ontology in the field of IR improves the retrieval accuracy and reduces irrelevant results. An Ontology is a formal explicit description of concepts in a domain of discourse classes (concepts). Properties of each concept to describe various features and attributes of the concept (slots), and restrictions on slots (facets) ontology together with a set of individual instances of classes constitutes a knowledge 4 base.
I NTRODUCTION We develop an automatic ontology-based document annotation and retrieval model for Arabic documents. The model will be used to improve the accuracy of Arabic retrieved documents depending on Arabic "هقف ةلبصلا (Prayer jurisprudence). Ontology Domain " All Documents in this domain are written in Arabic 5 language and stored in a corpus.
M ETHODOLOGY AND MODEL To build the model various steps performed: Preparing the corpus 1. "هقف ةلبصلا Building Arabic Ontology Domain " 2. (Prayer jurisprudence) Documents annotation 3. Processing annotated documents 4. Indexing and searching 5. 6
M ETHODOLOGY AND MODEL Preparing the corpus "هقف The corpus is a collection of documents in the domain " ةلبصلا (Prayer Jurisprudence). We collect these documents from IslamWeb website related to Fatwa questions in the field of Islamic issues. Collected documents are converted to xml type when we load it into Gate in order to facilities the processing of documents annotation and retrieval. 7
M ETHODOLOGY AND MODEL "هقف ةلبصلا (Prayer Building Arabic Ontology Domain " jurisprudence). The development of ontology consists of the following stages: Define concepts, i.e., classes based on studying and analyzing the domain. Define instances, i.e., real elements in our domain. Define relations among classes as a requirement to come up with the ontology. 8 8 Enrich ontology with Synonyms and Stemming words. Ontology Evaluation
Classes Classes /English Description No. /Arabic Prayer Time The time of FardhuAin prayer 1 ةلبصلا تقو Aladan Aladan is the call to prayer itself, and the person 2 ناذلأا who calls it is called the muadhan. Omission Forget one of the prayer steps 3 وهسلا Increase Either increase in acts or statements when the 4 ةدايز وهس Omission person does the prayer Omission Doubt Doubt between the two things, whichever is 5 كش وهس signed throughout the prayer Decrease Either increase in acts or statements when the 6 ناصقن وهس Omission person does the prayer Prayer Matters that are not part of the prayer, but must ةلبصلا طورش 7 Conditions be satisfied before starting the prayer Validity Conditions of prayer being valid refer to that on ةحص طورش Conditions which the validity of prayer depends, such that if 8 one of these conditions is broken, then prayer is not valid as a result. Obligation Conditions of prayer must be available in the بوجو طورش 9 Conditions person who want to pray to be his prayer right. Voluntary Prayer It is the optional prayer can do beside the عوطتلا ةلبص 10 obligatory prayer 9 AlRoateb Sunan Beyond the five daily required prayers, Muslims 11 ننس بتاور often engage in optional prayers before or after the regular prayers (FardhAin). These are known Ontology Classes as "AlRoateb Sunan " .
Classes Classes Description No. /Arabic /English Post-Roateb It is done after the FardhuAin prayer 12 ةيدعب بتاور Pre-Roateb It is done before the FardhuAin prayer 13 ةيلبق بتاور Eid Prayer Eid prayer is performed on the morning 14 ديعلا ةلبص of Eid ul-Fitr and Eid ul-Adha. 15 راذعلؤا لهأ ةلبص Prayer of Persons who have a problem which can’t Exempted do the prayer in suitable way. People Obligatory The prayer must done by every person 16 ضرف ةلبص Prayer FardhuAin It is the main five prayers that done by 17 نيع ضرف person who want to pray. FardhuKifayah Prayer that carried out by one fall for 18 ةيافك ضرف others Prayer The main components for prayer and must 19 ةلبصلا تانوكم Components be found in it include ( Staff, Disliked, things which invalidate and Musthbat). Staff It is one of the important components of 20 ناكرأ prayer related with the practical side. Disliked Things that are unlike in prayer 21 تاهوركم 10 Things which Things make prayer wrong 22 تلبطبم Invalidate Musthbat Things that are preferred in the prayer 23 تابحتسم
11 Part of Ontology Concepts and Instances
12 "ةزانجلا" ( Funeral ) Synonyms Words for Instance
13 Using Onto Root Gazetteer
14 The Annotation Process Result
T HE M ODEL S TRUCTURE User Interface Part Document annotation and Retrieval Part List of Documents Synonyms and Stemming for ontology elements Input Corpus Query Ontology Annotator Information Retrieval Process Results Apply Jape rules List of Annotated Indexed Documents 15
S YSTEM R EALIZATION , E XPERIMENTAL R ESULTS AND E VALUATION Tools and Programs o For indexing and keyword searching we use Lucene Datastore search engine. o Protégé for ontology building. o Gate as environment to execute all our work. 16
S YSTEM R EALIZATION , E XPERIMENTAL R ESULTS AND E VALUATION System Interface • Applications: in this part we execute our application "ةلبصلا قيبطت " (Prayer Application), by which we name it adding the plugins and Jape rules in its pipeline. • Language Resources (LRs): represent entities such as lexicons, corpora or ontologies. • Processing Resources (PRs): represent entities that are primarily algorithmic such as parsers. • Data stores : specialized folder on a hard drive used to store the annotated corpus and improve processing times for large collections of documents. 17 • Text area: view the document before and after the annotation.
لوخد دعب نوكي نأ ناذلؤاةحص طورش هلآ هبحصو ,دعب امأ :نمف 18 System Gate Interface
E XPERIMENTS • We performed a series of experiments to demonstrate the ability of our system to retrieve the related documents. • All our experiments depend on the annotation types (ontology classes) that created from the processing of annotated documents using Jape rules. • We give some examples to demonstrate and test the prototype and search using the annotation types that come up with the process of documents annotation. 19
E XPERIMENTS • The first three examples showing the results of a "ةلبص لهأ search using three annotation types راذعلؤا" (Prayer of Exempted People), "بتاور" (Roateb) and "ناذلأا" (Aladan). • The last example for using the word "ناذلأا" (Aladan) as keyword (traditional way) in the search. 20
Example 1 . Searching using annotation type "راذعلؤا لهأ ةلبص " (Prayer of Exempted People). هعم ناك نمب اقفر ةصخرلاب ,رفسلا يف ذخأ هنأ هؤشنم 21
Example 2. Searching using annotation type "بتاور" (Roateb). رفصت ملامسمشلا."ثيدحلاو رصعلا ملسو هيلع للوالاق:تقوو 22
Example 3. Searching using annotation type ناذلأا " (Aladan). " ةبطخلا أدبي مل ماملئا ًاعبطناذلأا موي ملبكلا زوجي ءانثأ ةعمجلا 23
ناذلأا " (Aladan) as " Example 4. Searching using the word keyword 24
S YSTEM E VALUATION • System evaluation depends on finding all related documents to the ontology components. We use 100 documents in our related Arabic Ontology "ةلبصلا هقف (Prayer jurisprudence) then we Domain " used the Gate tool to automatically annotate these documents, based on the Onto Root Gazetteer annotator. • We depend on two important measures which are 25 commonly used to evaluate such a system: precision and recall.
S YSTEM E VALUATION Recall: is defined as the number of relevant documents retrieved by a search divided by the total number of existing relevant documents (which should have been retrieved . Precision: is defined as the number of relevant documents retrieved by a search divided by the total number of documents retrieved by that search. 26
Recommend
More recommend