encoding transliteration variation through dimensionality
play

Encoding transliteration variation through dimensionality reduction - PowerPoint PPT Presentation

Encoding transliteration variation through dimensionality reduction Parth Gupta 1 , Paolo Rosso 1 and Rafael E. Banchs 2 pgupta@dsic.upv.es 1 Natural Language Engineering Lab Technical University of Valencia (UPV), Spain 2 HLT, Institute for


  1. Encoding transliteration variation through dimensionality reduction Parth Gupta 1 , Paolo Rosso 1 and Rafael E. Banchs 2 pgupta@dsic.upv.es 1 Natural Language Engineering Lab Technical University of Valencia (UPV), Spain 2 HLT, Institute for Infocomm Research (I 2 R), Singapore

  2. 2 of 21

  3. Transliterated Search (Means: My Dream Girl ) 3 of 21

  4. Transliterated Search (A special case of Lyrics Retrieval) 4 of 21

  5. Transliterated Search (A special case of Lyrics Retrieval) 4 of 21

  6. Transliterated Search (A special case of Lyrics Retrieval) 4 of 21

  7. Transliterated Search (A special case of Lyrics Retrieval) 4 of 21

  8. What is query and document ? • Query - Mere Sapno ki rani ◦ The most repeated lines in the song e.g. Ooh la la ooh la la ◦ The first line of the song e.g. Tadap tadap ke ◦ The “catchiest” part of the song e.g. Billo Rani ◦ Quite unique line e.g. Mujhko saja di pyar ki • Document ◦ Webpage/document containing that song’s lyrics in [Roman | Devnagari] script 5 of 21

  9. Some challenges • Extensive spelling variation, e.g. “ayega”, “aaega”, “ayegaa” • Match across the scripts e.g. a�� ��� , “ a�e�� ” • Unlike normal documents, some words/lines are repeated many times (statistical drift?) 6 of 21

  10. Looking at the Problem... • Basically the problem is two-fold 1. Handling spelling variation in the same script • Edit Distance? • Edit distance is Integer i.e. many entries at same distance like Sapney → Sapne, Apney, Samney (same distance) • Smarter Edit Distance? - Editex (uses Phonix and Soundex info) in calculating edit distance • Need mature Soundex and Phonix standards for the language. 2. Performing transliteration generation/mining operation to operate in the other script • Basically motivated to operate across the script 7 of 21

  11. Looking at the Problem... • Basically the problem is two-fold 1. Handling spelling variation in the same script • Edit Distance? • Edit distance is Integer i.e. many entries at same distance like Sapney → Sapne, Apney, Samney (same distance) • Smarter Edit Distance? - Editex (uses Phonix and Soundex info) in calculating edit distance • Need mature Soundex and Phonix standards for the language. 2. Performing transliteration generation/mining operation to operate in the other script • Basically motivated to operate across the script 7 of 21

  12. Looking at the Problem... • Basically the problem is two-fold 1. Handling spelling variation in the same script • Edit Distance? • Edit distance is Integer i.e. many entries at same distance like Sapney → Sapne, Apney, Samney (same distance) • Smarter Edit Distance? - Editex (uses Phonix and Soundex info) in calculating edit distance • Need mature Soundex and Phonix standards for the language. 2. Performing transliteration generation/mining operation to operate in the other script • Basically motivated to operate across the script 7 of 21

  13. Looking at the Problem... • Basically the problem is two-fold 1. Handling spelling variation in the same script • Edit Distance? • Edit distance is Integer i.e. many entries at same distance like Sapney → Sapne, Apney, Samney (same distance) • Smarter Edit Distance? - Editex (uses Phonix and Soundex info) in calculating edit distance • Need mature Soundex and Phonix standards for the language. 2. Performing transliteration generation/mining operation to operate in the other script • Basically motivated to operate across the script 7 of 21

  14. Looking at the Problem... • Basically the problem is two-fold 1. Handling spelling variation in the same script • Edit Distance? • Edit distance is Integer i.e. many entries at same distance like Sapney → Sapne, Apney, Samney (same distance) • Smarter Edit Distance? - Editex (uses Phonix and Soundex info) in calculating edit distance • Need mature Soundex and Phonix standards for the language. 2. Performing transliteration generation/mining operation to operate in the other script • Basically motivated to operate across the script 7 of 21

  15. Looking at the Problem... • Basically the problem is two-fold 1. Handling spelling variation in the same script • Edit Distance? • Edit distance is Integer i.e. many entries at same distance like Sapney → Sapne, Apney, Samney (same distance) • Smarter Edit Distance? - Editex (uses Phonix and Soundex info) in calculating edit distance • Need mature Soundex and Phonix standards for the language. 2. Performing transliteration generation/mining operation to operate in the other script • Basically motivated to operate across the script 7 of 21

  16. Our Model • We observe the association among the inter/intra script terms at character uni/bi-gram level. 1. Intra script, e.g. s → sh, f → ph, j → z, � � (mu) → � � (moo) 2. Inter script e.g. k → к , kh → х • Ideally the algorithm should automatically derive such mappings BUT the end goal is to find equivalents considering this information. • We model inter/intra script equivalents jointly . 8 of 21

  17. Our Model • We observe the association among the inter/intra script terms at character uni/bi-gram level. 1. Intra script, e.g. s → sh, f → ph, j → z, � � (mu) → � � (moo) 2. Inter script e.g. k → к , kh → х • Ideally the algorithm should automatically derive such mappings BUT the end goal is to find equivalents considering this information. • We model inter/intra script equivalents jointly . 8 of 21

  18. Our Model • We observe the association among the inter/intra script terms at character uni/bi-gram level. 1. Intra script, e.g. s → sh, f → ph, j → z, � � (mu) → � � (moo) 2. Inter script e.g. k → к , kh → х • Ideally the algorithm should automatically derive such mappings BUT the end goal is to find equivalents considering this information. • We model inter/intra script equivalents jointly . 8 of 21

  19. Our Model • We observe the association among the inter/intra script terms at character uni/bi-gram level. 1. Intra script, e.g. s → sh, f → ph, j → z, � � (mu) → � � (moo) 2. Inter script e.g. k → к , kh → х • Ideally the algorithm should automatically derive such mappings BUT the end goal is to find equivalents considering this information. • We model inter/intra script equivalents jointly . 8 of 21

  20. Distribution of Units - Character n-grams in Terms • The character n-grams in terms follow same distribution as terms in documents with some variation. Freq. Distrinution of Char. 1−grams Freq. Distrinution of Char. 2−grams 50000 10000 Frequency Frequency 20000 4000 0 0 0 20 40 60 80 0 500 1000 1500 2000 Char. 1−grams ID Char. 2−grams ID Freq. Distrinution of Char. 3−grams 2500 Frequency 1000 0 0 1000 2000 3000 4000 5000 6000 Char. 3−grams ID 9 of 21

  21. Modeling the terms 1. We create unique character uni/bi-gram joint space ( C n ) of both scripts out of the training terms, n =dimensionality. к х .. � � a� ..] e.g. [a b c ... ch ks .. 2. The training term-pairs are transformed into feature vector v d ∈ C n ). e.g. � v d = ���� . ( � v r , � v r = “pyar” and � 3. The dimensionality of these pairs are reduced to � h r , � h d ∈ R m such that, dist ( � h r , � h d ) is minimum where m << n . 4. [Important] As there is no distinction between features across the scripts the model can learn principle components within (intra) and across (inter) the scripts jointly . 10 of 21

  22. Training Method • A Deep Autoencoder is trained where the visible layer models the character grams through multinomial sampling [Salakhutdinov and Hinton, 2009]. Pre-training Fine-Tuning Output Layer 20 Linear Layer Code Layer RSM Layer Original Word ( � v d ) Transliteration ( � v r ) Input Layer 11 of 21

  23. Finding equivalents • Apriori the complete lexicon of Code ( � h q ) the reference/source 20 Linear Layer collection is projected into the abstract space using the autoencoder. • Given the query term q t , its feature vector � v q t is also RSM Layer projected in the abstract Query Term ( � v q ) Zero Vector space as ( � h q t ). • All the terms which have cosine similarity greater than θ are considered as equivalents. 12 of 21

  24. Subtask-2 : Adhoc Retrieval • Query Formulation Original Query ik din ayega “ik”, “ikk”, “ig”, “ eк ”, “ iк ” Variants of “ik” “din”, “didn”, “diin”, “ ��� ”, “ ��� ” Variants of “din” “ayega”, “aeyega”, “ayegaa”, “ a�� ��� ”, “ a�e�� ” Variants of “ayega” Formulated Query ik$din, ik$didn, ik$diin, diin$ayega, · · · eк $ ��� , eк $ ��� , diin$aeyega, diin$ayegaa, · · · , ��� $ a�� ��� , ��� $ a�e�� • Ranking Model (word 2-grams variant) ◦ TF-IDF ◦ unsupervised DFR (free from parameters) 13 of 21

  25. Demo Transliteration Encoding Demo 14 of 21

Recommend


More recommend