an approach to unsupervised historical text normalisation
play

An approach to unsupervised historical text normalisation Petar - PowerPoint PPT Presentation

An approach to unsupervised historical text normalisation Petar Mitankin Stefan Gerdjikov Stoyan Mihov Sofia University Sofia University Bulgarian Academy FMI FMI of Sciences IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May An approach


  1. An approach to unsupervised historical text normalisation Petar Mitankin Stefan Gerdjikov Stoyan Mihov Sofia University Sofia University Bulgarian Academy FMI FMI of Sciences IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May

  2. An approach to unsupervised historical text normalisation Petar Mitankin Stefan Gerdjikov Stoyan Mihov Sofia University Sofia University Bulgarian Academy FMI FMI of Sciences IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May

  3. Contents ● Supervised Text Normalisation – CULTURA – REBELS Translation Model – Functional Automata ● Unsupervised Text Normalisation – Unsupervised REBELS – Experimental Results – Future Improvements

  4. Co-funded under the 7th Framework Programme of the European Commission ● Maye - 34 occurrences in the 1641 Depositions , 8022 documents, 17 th century Early Modern English ● CULTURA: CULTivating Understanding and Research through Adaptivity ● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI

  5. Co-funded under the 7th Framework Programme of the European Commission ● Maye - 34 occurrences in the 1641 Depositions , 8022 documents, 17 th century Early Modern English ● CULTURA: CULTivating Understanding and Research through Adaptivity ● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI

  6. Supervised Text Normalisation ● Manually created ground truth – 500 documents from the 1641 Depositions – All words: 205 291 – Normalised words: 51 133 ● Statistical Machine Translation from historical language to modern language combines: – Translation model – Language model

  7. Supervised Text Normalisation ● Manually created ground truth – 500 documents from the 1641 Depositions – All words: 205 291 – Normalised words: 51 133 ● Statistical Machine Translation from historical language to modern language combines: – Translation model – Language model

  8. REgularities Based Embedding of Language Structures he / -1.89 REBELS se / -1.69 shee Translation she / -9.75 Model shea / -10.04 Automatic Extraction of Historical Spelling Variations

  9. Training of The REBELS Translation Model ● Training pairs from the ground truth: (shee, she), (maye, may), (she, she), (tyme, time), (saith, says), (have, have), (tho:, thomas), ...

  10. Training of The REBELS Translation Model ● Deterministic structure of all historical/modern subwords ● Each word has several hierarchical decompositions in the DAWG: Hierarchical Hierarchical decomposition of each decomposition of each historical word modern word

  11. Training of The REBELS Translation Model ● For each training pair ( knowth , knows ) we find a mapping between the decompositions: ● We collect statistics about ● We collect statistics about historical subword -> modern subword historical subword -> modern subword

  12. REgularities Based Embedding of Language Structures he / -1.89 REBELS se / -1.69 shee Translation she / -9.75 Model shea / -10.04 REBELS generates normalisation candidates for unseen historical words

  13. shee knowth me REBELS REBELS REBELS shee knowth me

  14. Combination of REBELS with Statistical Bigram Language Model relevance score (he knuth my) = REBELS TM (he knuth my) * C_tm + Statistical Language Model (he knuth my)*C_lm ● Bigram Statistical Model – Smoothing: Absolute Discounting, Backing-off – Gutengberg English language corpus

  15. Functional Automata L(C_tm, C_lm) is represented with Functional Automata

  16. Automatic Construction of Functional Automaton For The Partial Derivative w.r.t. x L(C_tm, C_lm) is optimised with the Conjugate Gradient method

  17. Supervised Text Normalisation Search REBELS Module Normalised Historical Translation text Based on text Model Functional Automata Ground Training Truth Module Based on Functional Automata

  18. Unsupervised Text Normalisation Search REBELS Historical Normalised Module Translation text text Based on Model Functional Automata Unsupervised Generation of Training Pairs ( knoweth, knows )

  19. Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If more than 6 modern words were generated for H, then do not use the corresponding pairs for training.

  20. Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If more than 6 modern words were generated for H, then do not use the corresponding pairs for training.

  21. Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If more than 6 modern words were generated for H, then do not use the corresponding pairs for training.

  22. Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If more than 6 modern words were generated for H, then do not use the corresponding pairs for training.

  23. Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If too many (> 5) modern words were generated for H, then do not use the corresponding pairs for training.

  24. Normalisation of the 1641 Depositions . Experimental results Generation of REBELS Spelling Method Language Model Accuracy BLEU Training Probabilities Pairs 1 ---- ---- ---- 75.59 50.31 2 Unsupervised NO YES 67.84 45.52 3 Unsupervised YES NO 79.18 56.55 4 Unsupervised YES YES 81.79 61.88 5 Unsupervised Supervised Trained Supervised Trained 84.82 68.78 6 Supervised Supervised Trained Supervised Trained 93.96 87.30

  25. Future Improvement Search REBELS Historical Normalised Module Translation text text Based on Model Functional Automata Unsupervised MAP Generation of Training Training Pairs Module ( knoweth, knows ) with probabilities

  26. Thank You! Comments / Questions? ACKNOWLEDGEMENTS The reported research work is supported by the project CULTURA, grant 269973, funded by the FP7Programme and the project AComIn, grant 316087, funded by the FP7 Programme.

Recommend


More recommend