a survey on cross language ir clir
play

A Survey on Cross-language IR (CLIR) Naveen Yamparala (RS09174) - PowerPoint PPT Presentation

A Survey on Cross-language IR (CLIR) Naveen Yamparala (RS09174) Types of IR (Language based) [1] There are broadly three different types of Information retrieval. 1. Monolingual information retrieval : Query and the documents will be in the


  1. A Survey on Cross-language IR (CLIR) Naveen Yamparala (RS09174)

  2. Types of IR (Language based) [1] There are broadly three different types of Information retrieval. 1. Monolingual information retrieval : Query and the documents will be in the same language. This is the traditional Information retrieval technique. 2. Bi-lingual information retrieval: If two languages are involved in information retrieval then it is called Bi-lingual IR. For example, if the document is written in Hindi and the query is made in English, information retrieval is said to be bilingual. 3. Cross-lingual information retrieval: If the user issues a single query against a document collection that contains documents in a various other languages then it is called Cross-lingual information retrieval or Cross-language information retrieval.

  3. CLIR approaches 1. Query Translation a. Dictionary Based Translation Approach b. Corpora Based Translation Approach c. Machine Translation Based Approach 2. Document Translation 3. Dual Translation (Both Query and Document Approach)

  4. 1. Query Translation Approach (QT) User’s query will be translated into the documents language. Advantages: Easy as it doesn’t require to translate huge amount of text. Disadvantages: ● Query terms does not usually provide enough information about the intended meaning of the query. ● Effect of translation errors is usually high on performance. ● Query has to be translated to all the languages of the documents

  5. 1.1 Dictionary Based Translation Approach In this type of query translation, only keywords of the query are translated using Machine Readable Dictionaries (MRD) and the query will be processed linguistically. MRD’s are Electronic versions of printed dictionaries either in general domain or specific domain. There are a few problems associated with dictionary-based translation. ● Untranslatable words : There will be a few words which can’t be found in the MRD’s like new compound words, proper names, spelling variants, and special terms. ● Inflected words: Translating inflected words are usually hard as inflected word forms are usually not found in dictionaries.(e.g change in tense of the word etc.) ● Lexical ambiguity in source and target languages: e.g. bank (river bank) and bank (financial institution). Due to ambiguity in the search keys, matching for retrieving relevant documents may not be successful.

  6. 1.2 Corpora Based Translation Approach In this approach, the query is translated based on the terms extracted from the parallel or comparable corpora. Corpora is the collection of naturally occurring language material, such as texts, paragraphs and sentences from one or many languages. A parallel corpus is a collection of texts, each of which is translated into one or more languages other than the original language. These are also used to decide the relationships, such as co-occurrences, between terms of different languages. Comparable corpora contains the text related to the same topic or area but the texts are from different languages. The text can be considered as equivalent to each other in different languages.

  7. 1.3 Machine Translation Based Approach Machine Translation is one of the best approaches for Query translation for the obvious reasons. [4] Machine Translation can be distinguished into four types: (a) Word-for-word approach (b) Syntactic transfer approach (c) Semantic transfer approach and (d) Interlingual approach. The main goal of all these approaches is to understand the context and translate query using that context. Machine translation based approaches has the advantage over others since it translates the sentence as a whole and the translation ambiguity is solved during the analysis of the source sentence.

  8. Comparison of different Query based Translation techniques

  9. 2. Document Translation Approach (DT) Document Translation (DT) approach is mostly in the cases where the user wants to search the documents of a different language and wants to receive results in his own language. There are two different ways to do document translation. (i) Post Translation or ‘on-the-fly-translation’ (ii) pre translation or ‘all together before query”

  10. 2. Document Translation Approach (Contd) Document translation has its own benefits over query translation as below: ● Translations errors should not harm retrieval too much, as they are weighted against a whole document. ● The translation can be done at indexing time, thus getting faster retrieval at run time(pre translation)

  11. 3. DUAL TRANSLATION (BOTH QUERY AND DOCUMENT TRANSLATION APPROACH) ● When the queries and documents are translated into a common representation it is called Dual Translation (DT). This approach provides scalability but takes extra storage space for storing translated documents from different languages. ● [3] Pivot Language is used to perform Dual Translation approach since due to limitation of translation resources it is not possible to perform a direct translation between languages. A third or intermediate language is used for this purpose called pivot language.

  12. Comparison of QT vs DT vs Dual Translation ● Query Translation (QT) is flexible and allows faster retrieval of results since only the query needed to be translated. It doesn’t require extra storage space as document translation. QT allows more interactions with the user. Query translation often suffers with poor performance due to translation ambiguity. ● Document translation offers more flexibility for the IR engine to recover because even though some key words or phrases are wrongly translated as their effect is mitigated as they are weighted against the whole document but it requires a lot of storage space and impractical. ● Dual or Hybrid Translation has the advantage of matching the queries unambiguously. It has some disadvantages like the space requirement to store the translated documents in the intermediate language (pivot language). It also need to deal with converting documents and queries in to intermediate language

  13. Query vs Document vs Hybrid Translations

  14. References A comprehensive survey on cross-language information retrieval system (2019). 1. https://pdfs.semanticscholar.org/1eb0/7903d8e09131997d069991eb48e3cae06274.pdf 2. Dwivedi, Sanjay & Chandra, Ganesh. (2016). A Survey on Cross Language Information Retrieval. International Journal on Cybernetics & Informatics. 5. 127-142. 10.5121/ijci.2016.5113. https://www.researchgate.net/publication/297752831_A_Survey_on_Cross_Language_Information_Retrieval 3. Abusalah, Mustafa & Tait, John & Oakes, Michael. Literature Review of Cross Language Information Retrieval.. 175-177. https://www.researchgate.net/publication/221017614_Literature_Review_of_Cross_Language_Information_Retrieval Manning, C.D., & Schutze, H., (1999) Foundations of statistical natural language processing. MIT Press. 4.

Recommend


More recommend