N UKTI : English-Inuktitut Word Alignment System Description Philippe Langlais, Fabrizio Gotti and Guihong Cao RALI Département d’informatique et de recherche opérationnelle Université de Montréal WPT— June 2005 Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 1 / 16
Context We found the task intriguing enough so we spent 2 weeks to test 2 approaches. no other (textual) resource than the ones provided ∼ 4 000 lines of (C++) code (bug inside) We corrected a few bugs after the deadline Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 2 / 16
Word Alignment as a Sentence Alignment Task Observation on the DEV corpus : monotonicity • in • pijjutigillugu (3-1) ✭✭✭✭✭✭✭✭✭✭✭✭ ✏✏✏✏✏✏✏✏✏✏✏✏ regards • • innatuqait (1-1) ✘✘✘✘✘✘✘✘✘✘✘✘✘ • to • amma (1-1) ✘✘✘✘✘✘✘✘✘✘✘✘ • elders • makkuttu (1-1) ✘✘✘✘✘✘✘✘✘✘✘✘ • and • youth monotonicity ≡ a perfect setting for sentence alignment Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 3 / 16
Our in-house Sentence Alignment Program : J APA developed for the Arcade evaluation campaign (Langlais & al., 1998) : step 1 (roughly) word-align in order to delimit the search space step 2 sentence-align by a mix of (Gale & Church, 1993) and (Simard et al., 1992) available at rali.iro.umontreal.ca/Japa Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 4 / 16
Word Alignment as a Sentence Alignment Task Documents ≡ sentences Sentences ≡ words J APA handles n-m patterns of arbitrary size (default n , m ∈ [ 0 , 2 ] ) Exp. 1 : seeding J APA with the empirical pattern distribution 1-1 0.406 4-1 0.092 4-2 0.015 2-1 0.172 5-1 0.038 5-2 0.011 . . . 3-1 0.123 7-1 0.027 3-2 0.011 (24 patterns observed on the DEV corpus) We generated the cartesian product for each pattern where n , m > 1 Prec. Rec. F-meas. AER 26.17 74.49 38.73 71.27 official run Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 5 / 16
Word Alignment as a Sentence Alignment Task Document ≡ sentence Sentence ≡ word J APA handles n-m patterns of arbitrary size (default n , m ∈ [ 0 , 2 ] ) Exp. 2 : J APA in its default mode 1-1 0.89 1-2 0.089 2-1 0.089 0-1 0.009 1-0 0.009 2-2 0.011 Prec. Rec. F-meas. AER 26.17 74.49 38.73 71.27 53.04 37.12 43.68 45.13 unofficial run Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 6 / 16
Word Alignment as a Sentence Alignment Task Document ≡ sentence Sentence ≡ word J APA handles n-m patterns of arbitrary size (default n , m ∈ [ 0 , 2 ] ) Exp. 3 : seeding J APA with this pattern distribution 1-1 0.406 4-1 0.092 7-1 0.027 7-2 0.011 2-1 0.172 5-1 0.04 4-2 0.015 3-2 0.011 3-1 0.123 6-1 0.04 5-2 0.011 2-2 0.000 Prec. Rec. F-meas. AER 26.17 74.49 38.73 71.27 53.04 37.12 43.68 45.13 55.41 60.55 57.86 42.48 unofficial run Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 7 / 16
N UKTI : Principle Finding a monotonic split of the English sentence in regards to | c 1 elders | c 2 and | c 3 youth pijjutigillugu innatuqait amma makkuttu � I K be an Inuktitut sentence of K words 1 let E N be an English sentence of N words 1 We seek the split { c k | k ∈ [ 1 , K − 1 ] , c k ∈ [ 1 , N − 1 ] , c k > c k − 1 } which maximizes : K � p ( I k | E c k A = argmax c k − 1 + 1 ) +( 1 − λ ) p ( d k ≡ c k − c k − 1 ) λ c K � �� � � �� � k = 1 1 fertility word-sequence score Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 8 / 16
N UKTI : dirty hands Word-Word distribution : max c k j = c k − 1 + 1 p ( I k | E j ) p ( I k | E c k c k − 1 + 1 ) ≃ or � c k j = c k − 1 + 1 p ( I k | E j ) ⇐ = Word-Substring distribution : � p ( I | E ) ≃ λ p llr ( i | E ) + ( 1 − λ ) p ibm 2 ( i | E ) i ∈ I Fertility distribution p ( d k ) found useless in practice Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 9 / 16
N UKTI : Log-likelihood ratio score p llr ( i | E ) Martin et al. (2003) We computed a likelihood ratio score (Dunning, 1993) for all pairs of English tokens (E) and Inuktitut substrings (i) of length ranging from 3 to 10 characters. a maximum of 25 000 associations were kept for each English word (the top ranked ones) (probably too many) cooccurrence ≡ presence in the same pair of sentences (suboptimal) normalized so that ∀ E , � i p llr ( i | E ) = 1 we used a suffix tree structure (1 hour for 100 English words) Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 10 / 16
N UKTI : IBM model p ibm 2 ( i | E ) Brown et al. 1993 we segmented the Inuktitut material by a recursive process and trained an IBM model 2 (we used only the transfer table) Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 11 / 16
N UKTI Greedy Search Strategy Step1 : Seed N UKTI with a given split I 4 I 3 I 2 I 1 E E E E E E 1 2 3 4 5 6 Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 12 / 16
N UKTI Greedy Search Strategy Step1 : Seed N UKTI with a given split I 4 I 3 I 2 I 1 E E E E E E 1 2 3 4 5 6 Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 12 / 16
N UKTI Greedy Search Strategy Step1 : Seed N UKTI with a given split I 4 I 3 I 2 I 1 E E E E E E 1 2 3 4 5 6 in | c 1 regards to | c 2 elders | c 3 and youth pijjutigillugu innatuqait amma makkuttu We tried 2 seed splits : diagonal and J APA Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 12 / 16
N UKTI Greedy Search Strategy Step2 : Perturbation of the seed split From left to right : in ≻ c 1 regards to | c 2 elders | c 3 and youth pijjutigillugu innatuqait amma makkuttu Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 13 / 16
N UKTI Greedy Search Strategy Step2 : Perturbation of the seed split From left to right : in regards to | c 1 ≻ c 2 elders | c 3 and youth pijjutigillugu innatuqait amma makkuttu Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 13 / 16
N UKTI Greedy Search Strategy Step2 : Perturbation of the seed split From left to right : in regards to | c 1 elders | c 2 ≻ c 3 and youth pijjutigillugu innatuqait amma makkuttu Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 13 / 16
N UKTI Greedy Search Strategy Step2 : Perturbation of the seed split From left to right : in regards to | c 1 elders | c 2 and | c 3 youth pijjutigillugu innatuqait amma makkuttu Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 13 / 16
N UKTI : results Configuration Prec. Rec. F-m. AER seed diagonal 51.7 53.66 52.66 49.54 + greedy 65.4 68.31 66.83 32.10 seed J APA 55.4 60.55 57.86 42.48 65.47 68.36 66.88 31.93 + greedy Best submitted : N UKTI (diago) 63.09 65.87 64.45 34.06 Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 14 / 16
Conclusion & Future Work Word alignment as a sentence alignment task : AER ∼ 42 a dictionary (transfer parameters) could be used to ease J APA transliteration for improving cognatness J APA + N UKTI : AER ∼ 32 no 1-0 cept allowed log-likelihood ratio distributions too noisy If we were to do it again : http://www.inuktitutcomputing.ca/Uqailaut/ See the next talk ! (Schafer and Drábek, 2005) Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 15 / 16
thank you Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 16 / 16
Recommend
More recommend