An approach to unsupervised historical text normalisation Petar Mitankin Stefan Gerdjikov Stoyan Mihov Sofia University Sofia University Bulgarian Academy FMI FMI of Sciences IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May
An approach to unsupervised historical text normalisation Petar Mitankin Stefan Gerdjikov Stoyan Mihov Sofia University Sofia University Bulgarian Academy FMI FMI of Sciences IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May
Contents ● Supervised Text Normalisation – CULTURA – REBELS Translation Model – Functional Automata ● Unsupervised Text Normalisation – Unsupervised REBELS – Experimental Results – Future Improvements
Co-funded under the 7th Framework Programme of the European Commission ● Maye - 34 occurrences in the 1641 Depositions , 8022 documents, 17 th century Early Modern English ● CULTURA: CULTivating Understanding and Research through Adaptivity ● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI
Co-funded under the 7th Framework Programme of the European Commission ● Maye - 34 occurrences in the 1641 Depositions , 8022 documents, 17 th century Early Modern English ● CULTURA: CULTivating Understanding and Research through Adaptivity ● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI
Supervised Text Normalisation ● Manually created ground truth – 500 documents from the 1641 Depositions – All words: 205 291 – Normalised words: 51 133 ● Statistical Machine Translation from historical language to modern language combines: – Translation model – Language model
Supervised Text Normalisation ● Manually created ground truth – 500 documents from the 1641 Depositions – All words: 205 291 – Normalised words: 51 133 ● Statistical Machine Translation from historical language to modern language combines: – Translation model – Language model
REgularities Based Embedding of Language Structures he / -1.89 REBELS se / -1.69 shee Translation she / -9.75 Model shea / -10.04 Automatic Extraction of Historical Spelling Variations
Training of The REBELS Translation Model ● Training pairs from the ground truth: (shee, she), (maye, may), (she, she), (tyme, time), (saith, says), (have, have), (tho:, thomas), ...
Training of The REBELS Translation Model ● Deterministic structure of all historical/modern subwords ● Each word has several hierarchical decompositions in the DAWG: Hierarchical Hierarchical decomposition of each decomposition of each historical word modern word
Training of The REBELS Translation Model ● For each training pair ( knowth , knows ) we find a mapping between the decompositions: ● We collect statistics about ● We collect statistics about historical subword -> modern subword historical subword -> modern subword
REgularities Based Embedding of Language Structures he / -1.89 REBELS se / -1.69 shee Translation she / -9.75 Model shea / -10.04 REBELS generates normalisation candidates for unseen historical words
shee knowth me REBELS REBELS REBELS shee knowth me
Combination of REBELS with Statistical Bigram Language Model relevance score (he knuth my) = REBELS TM (he knuth my) * C_tm + Statistical Language Model (he knuth my)*C_lm ● Bigram Statistical Model – Smoothing: Absolute Discounting, Backing-off – Gutengberg English language corpus
Functional Automata L(C_tm, C_lm) is represented with Functional Automata
Automatic Construction of Functional Automaton For The Partial Derivative w.r.t. x L(C_tm, C_lm) is optimised with the Conjugate Gradient method
Supervised Text Normalisation Search REBELS Module Normalised Historical Translation text Based on text Model Functional Automata Ground Training Truth Module Based on Functional Automata
Unsupervised Text Normalisation Search REBELS Historical Normalised Module Translation text text Based on Model Functional Automata Unsupervised Generation of Training Pairs ( knoweth, knows )
Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If more than 6 modern words were generated for H, then do not use the corresponding pairs for training.
Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If more than 6 modern words were generated for H, then do not use the corresponding pairs for training.
Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If more than 6 modern words were generated for H, then do not use the corresponding pairs for training.
Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If more than 6 modern words were generated for H, then do not use the corresponding pairs for training.
Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If too many (> 5) modern words were generated for H, then do not use the corresponding pairs for training.
Normalisation of the 1641 Depositions . Experimental results Generation of REBELS Spelling Method Language Model Accuracy BLEU Training Probabilities Pairs 1 ---- ---- ---- 75.59 50.31 2 Unsupervised NO YES 67.84 45.52 3 Unsupervised YES NO 79.18 56.55 4 Unsupervised YES YES 81.79 61.88 5 Unsupervised Supervised Trained Supervised Trained 84.82 68.78 6 Supervised Supervised Trained Supervised Trained 93.96 87.30
Future Improvement Search REBELS Historical Normalised Module Translation text text Based on Model Functional Automata Unsupervised MAP Generation of Training Training Pairs Module ( knoweth, knows ) with probabilities
Thank You! Comments / Questions? ACKNOWLEDGEMENTS The reported research work is supported by the project CULTURA, grant 269973, funded by the FP7Programme and the project AComIn, grant 316087, funded by the FP7 Programme.
Recommend
More recommend