LIGA and Syllabification Approach for Language Identification and - PowerPoint PPT Presentation

LIGA and Syllabification Approach for Language Identification and Back Transliteration : Shared Task Report by DA-IICT Shraddha Patel Vaibhavi Desai

Problem Statement Subtask 1 : Query Word Labeling Suppose that q: w1 w2 w3 … wn, is a query is written Roman script. The words, w1 w2 etc., could be standard English words or transliterated from another language L (Hindi / Gujarati). The task is to label the words as E or L depending on whether it an English word, or a transliterated L-language word. And then, for each transliterated word, provide the correct transliteration in the native script (i.e., the script which is used for writing L). Input Output palak\H= पालक paneer\H= पनीर recipe\E palak paneer recipe Maro\G= મારો phone\E bagadi\G= બગડ� Maro phone bagadi gayo gayo\G= ગયો

Language Identification Graph Approach (LIGA): [ Gujarati and Hindi ] - Constructing the Graph • Constructing bi grams and tri grams of the words in the training data • For each word in the training set, construct a simple graph and compute path matching scores for both languages using LIGA 1 Example : ply LIGA Approach for training data 3 1 3 1 Calculating node and edge scores (tri-gram) 3 1 for a set of three words “apple”, “apply” and app ppl ple “applied” 1 pli lie ied 1 1 1 1 1

Language Identification Graph Approach (LIGA): [ Gujarati and Hindi ] - Path Matching Scores If the test word is “applies”, a LIGA graph 1 can be constructed which will produce the following simple path : ply app -> ppl -> pli -> lie -> ies 3 1 3 1 Calculating the Path Matching (PM) score 3 1 app ppl ple for a language having training LIGA graph shown in figure 1 can be done as follows : 1 (adding all weights) Total no of vertices = 11 pli lie ied Total no of edges = 8 1 1 1 1 1 PM = 3/11 +3/11 + 1/11 + 1/11 + 0 + ⅜ + ⅛ + ⅛ + 0 PM = 1.352

Language Identification Graph Approach (LIGA): [ Gujarati and Hindi ] - Predicting the language • For a language L, we calculate the path matching (PM) score for each word by constructing its bi-grams and tri-grams • For each word of query set, the same method is applied to calculate the PM scores. • A word in the query is labeled as “L” (Hindi, Gujarati or English) depending upon the maximum path matching score of that respective language. Example : • word in the query : “applies” • PM score (English) = 1.352 • PM score (Hindi) = 0.112 • Hence, the word belongs to English and is labeled as “/E”

LIGA : Results (Labelling Accuracy) English Hindi English Gujarati Precision 77.3 71.0 97.6 16.7 Recall 78.2 79.5 98.6 25.0 F - score 78.8 75.0 98.1 20.0 Labeling 77.1 96.3 Accuracy

LIGA : Error Analysis and Drawbacks • Results highly depend on the size and credibility of the data sets. • Single lettered words - not classified correctly • eg. “a”, “o” • Problem with classification of proper nouns • eg. “Satyam”,”Delhi” • Classification of words of different languages having same transliterations. eg. Deep (both English and Hindi - द�प ) •

Back-transliteration : Rule Based Syllabification Make syllables from words on the nearest consonants with at least one vowel . The last set of consonants can be taken as it is. • Eg. Sudarshan = Su+da+rsha+n • Eg. Vijay = Vi+ja+y • Eg. Gada = Ga+da • If the word ends in a vowel, the last syllable is constructed till the last vowel. • Eg. Gada= Ga+ da ( Instead of taking the last “a” as a separate syllable, append it with “d” and the last syllable thus becomes “da”)

Back-transliteration : Syllable Mapping • Language of transliteration: P • Language (Real): L • Each syllable is then fed into a mapper where it gets mapped to a syllable in the language L. Some letters are mapped directly while some are mapped in combinations. For instance, consider the Hindi word : khoobsoorat ( खूबसूरत ) • • Syllables: khoo, bsoo, ra, t mapping : ‘khoo’ : Since, ‘kh’ is mapped to the letter ‘ ख ’ instead of ‘k’ and ‘h’ • individually mapped to their corresponding letters. ‘oo’ is then mapped to ‘ ऊ ’ • • For mapping, a hash table is made where, each letter or combination of letters in P are mapped to one letter or letters in L. • Such back transliterated syllables are then appended to form a complete word in language L.

Back-transliteration : Mapping to Dictionary • S: naive word formed after syllable mapping. • After constructing the naive word, the word is then looked for words in the dictionary of language L. • If S maps directly to a word in the dictionary, it is taken as the output of the process. • Else : Mapping for words: Mapping is done on a letter to letter basis in S. • P: Word in the dictionary. • Rules for mapping: • For each letter in S, the corresponding letter is taken in P. If the letters match, the check is continued. If the letters do not match, the alternate letter set of the letter in P is checked for. If the letter matches to any letter in the alternate letter set, the check is continued. • Alternate letter set: Some letters may have same phonetic representation or transliterated representations. For instance, Hindi letters ऊ and उ may be written as ‘u’ in English.

Back-transliteration : Mapping to dictionary and score calculation. • Hence, with this process the search is narrowed to certain words where the check is done successfully. • For example, the word ‘manav’ maps to माणव • मानव • Score Calculation: • A letter by letter comparison with the naive word is done. For every letter match an increment is given to the score. For every letter matching in the alternate letter set,3/4th increment is given to the score. The word with the highest score, is the output.

Back-transliteration: Results Hindi Gujarati Precision 9.6 46.4 Recall 52.3 46.2 F-Score 16.3 46.3

Back-transliteration: Error Analysis and Drawbacks • Erroneous transliterations: The system does not give proper output for highly erroneous transliterations. • Words having different phonetic representations but same transliterated representations may not be back transliterated efficiently. eg. लाई and लायी • •

Acknowledgements • Prasenjit Majumdar, DAIICT. • Abhishek Shah, DAIICT. • Monojit Chaudhry, Microsoft Research Lab, India • Gokul Chittranjan , Microsoft Reseach Lab, India

Thank You

LIGA and Syllabification Approach for Language Identification and - PowerPoint PPT Presentation

LIGA and Syllabification Approach for Language Identification and Back Transliteration : Shared Task Report by DA-IICT Shraddha Patel Vaibhavi Desai Problem Statement Subtask 1 : Query Word Labeling Suppose that q: w1 w2 w3 wn, is a query

BUILDING Fundao LIGA SUSTAINABLE AND Portugal FUTURE-PROOF Graz, Austria 4-5 October 2018

LIGA Lithography B y : B i t e w D i n k e H u g o F e r r e r E n e e 4 1 6 D r . G h o d

Syllables and Phonotactics Syllables and Phonotactics Syllabification Rule Syllabic Consonants

Manufacturing of high resolution X-ray masks for LIGA technology in SSTRC A.G. Lemziakov, B.G.

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

Developmental Developmental Disorders affecting Disorders affecting language language

CS11-737: Multilingual Natural Language Processing Language contact Yulia Tsvetkov Language

Models of Language Evolution models thereof its evolution language Models of Language Evolution

Kimmo Kettunen, Paul McNamee and Feza Baskaya HLT2010, Riga, October 7-8, 2010 1. Why use

Mathematics Education and Language Diversity: From Language-as-Problem to Language-as-Resource

Language Types We re going to look at two types of language: figurative language and literal

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Hypertext Markup Language Introduction to Web Design Hypertext Markup Language Introduction to

FOIA-dc.gov Office of the Chief Technology Officer DC Government For review and comment of DC

Programme for Active Learning (PAL) Our lower primary students undergo 6 modules of PAL lessons

DAVVI and vESP: experimental systems for doing search in multimedia collections Pl Halvorsen

Srinivasa Ramanujan and Signal Processing P. P. Vaidyanathan California Institute of Technology,

Malware Classification into Families based on File - Content and Characteristics KARAN BANSAL

TERENA Networking Conference, 2003 MOBILE WORK ENVIRONMENT FOR GRID USERS. TESTBED Miroslaw

collaboration on social media @AngelaCorbalan May 2016 WWW.BETTERTHANCASH.ORG The Better Than

Understanding the Characteristics of Android Wear OS Renju Liu and Felix Xiaozhu Lin Purdue ECE