Unsupervised Code-Switching for Multilingual Historical Document Transcription Dan Garrette UT-Austin Computer Science Hannah Alpert-Abrams UT-Austin Comparative Literature Taylor Berg-Kirkpatrick UC Berkeley Computer Science Dan Klein UC Berkeley Computer Science 1
Historical Document Transcription Working with scholars in humanities who want to study texts from the 1500s. Standard OCR systems don’t work well on printing-press books. 2
State-of-the-Art: Ocular • Berg-Kirkpatrick, Durrett, and Klein 2013 p ri [Berg-Kirkpatrick et al. 2013] Slide courtesy of Taylor Berg-Kirkpatrick 3
Multilingual Texts • But many historical documents are written in, and switch readily between, multiple languages. 4
5
Spanish Latin Nahuatl 6
Spanish Latin Nahuatl 7
Spanish Latin Nahuatl 8
Spanish Latin Nahuatl 9
Starting Point: Ocular Generative Model in 3 parts: 1. Language model 2. Typesetting model 3. Rendering model [Berg-Kirkpatrick et al. 2013] 10
Ocular’s Generative Model Language Model p r i s o n E P ( E ) [Berg-Kirkpatrick et al. 2013] Slide courtesy of Taylor Berg-Kirkpatrick 11
Ocular’s Generative Model Language Model p r i s o n E P ( E ) Typesetting Model T · P ( T | E ) [Berg-Kirkpatrick et al. 2013] Slide courtesy of Taylor Berg-Kirkpatrick 11
Ocular’s Generative Model Language Model p r i s o n E P ( E ) Typesetting Model T · P ( T | E ) a b c Rendering Model X P ( X | E, T ) [Berg-Kirkpatrick et al. 2013] Slide courtesy of Taylor Berg-Kirkpatrick 11
Our Focus Language Model p r i s o n E P ( E ) Typesetting Model T · P ( T | E ) Rendering Model X P ( X | E, T ) [Berg-Kirkpatrick et al. 2013] Slide courtesy of Taylor Berg-Kirkpatrick 12
Starting Point: Ocular • The language model helps Ocular work well, but creates additional challenges for many documents. • Our work helps to overcome those challenges. 13
Our Focus 1. Multilingual code-switching 2. Inconsistent/outdated orthography 14
Ocular’s Language Model E r a t e i e i − 1 e i +1 Kneser-Ney smoothed character 6-gram [Berg-Kirkpatrick et al. 2013] Slide courtesy of Taylor Berg-Kirkpatrick 15
Ocular’s Language Model file6.txt file5.txt Neither Lorillard file4.txt nor the researchers We 're talking about file3.txt who studied the years ago before workers were aware Although preliminary file2.txt anyone heard of of any research on findings were asbestos having any The asbestos fiber , smokers of the Kent file1.txt reported more than a E questionable crocidolite , is cigarettes . We have year ago , the Rudolph Agnew , 55 properties . There unusually resilient no useful r a t latest results years old and former is no asbestos in once it enters the information on Pierre Vinken , 61 appear in today 's chairman of our products now . lungs , with even whether users are at years old , will New England Journal count n-grams Consolidated Gold brief exposures to risk , said James A. e i e i − 1 e i +1 join the board as a of Medicine , a Fields PLC , was it causing symptoms Talcott of Boston 's nonexecutive forum likely to named a nonexecutive that show up decades Dana-Farber Cancer director Nov. 29 . bring new attention director of this later , researchers Institute . Mr. Vinken is to the problem . A British industrial said . Lorillard chairman of Elsevier Lorillard spokewoman conglomerate . A Inc. , the unit of N.V. , the Dutch said , This is an form of asbestos New York-based Loews publishing group . old story . once used to make Corp. that makes Kent cigarette Kent cigarettes , filters has caused a stopped using high percentage of cancer deaths among [Berg-Kirkpatrick et al. 2013] 16
Baseline Multilingual Model spanish6.txt spanish5.txt spanish4.txt spanish3.txt spanish2.txt spanish1.txt latin6.txt E latin5.txt latin4.txt latin3.txt e i − 1 e i e i +1 latin2.txt latin1.txt nahuatl6.txt nahuatl5.txt nahuatl4.txt nahuatl3.txt nahuatl2.txt nahuatl1.txt 17
Baseline Multilingual Model spanish6.txt spanish5.txt latin6.txt E spanish4.txt latin5.txt nahuatl6.txt spanish3.txt latin4.txt count n-grams nahuatl5.txt spanish2.txt latin3.txt e i − 1 e i e i +1 nahuatl4.txt spanish1.txt latin2.txt nahuatl3.txt latin1.txt nahuatl2.txt nahuatl1.txt 17
Baseline Multilingual Model • Poor results • “Multilingual blur” 18
Code-Switching Language Model spanish6.txt spanish5.txt spanish4.txt spanish3.txt e i-1 , s e i , s e i+1 , s spanish2.txt spanish1.txt latin6.txt latin5.txt latin4.txt e i-1 , l e i , l e i+1 , l latin3.txt latin2.txt latin1.txt nahuatl6.txt e i-1 , n e i , n e i+1 , n nahuatl5.txt nahuatl4.txt nahuatl3.txt nahuatl2.txt nahuatl1.txt 19
Code-Switching Language Model E spanish6.txt spanish5.txt spanish4.txt spanish3.txt e i-1 , s e i , s e i+1 , s spanish2.txt spanish1.txt latin6.txt latin5.txt latin4.txt e i-1 , l e i , l e i+1 , l latin3.txt latin2.txt latin1.txt nahuatl6.txt e i-1 , n e i , n e i+1 , n nahuatl5.txt nahuatl4.txt nahuatl3.txt nahuatl2.txt nahuatl1.txt 19
Code-Switching Language Model E spanish6.txt spanish5.txt spanish4.txt spanish3.txt e i-1 , s e i , s e i+1 , s spanish2.txt spanish1.txt latin6.txt latin5.txt latin4.txt e i-1 , l e i , l e i+1 , l latin3.txt latin2.txt latin1.txt nahuatl6.txt e i-1 , n e i , n e i+1 , n nahuatl5.txt nahuatl4.txt nahuatl3.txt nahuatl2.txt nahuatl1.txt 19
Code-Switching Language Model E spanish6.txt spanish5.txt spanish4.txt spanish3.txt e i-1 , s e i , s e i+1 , s spanish2.txt spanish1.txt latin6.txt latin5.txt latin4.txt e i-1 , l e i , l e i+1 , l latin3.txt latin2.txt latin1.txt nahuatl6.txt e i-1 , n e i , n e i+1 , n nahuatl5.txt nahuatl4.txt nahuatl3.txt nahuatl2.txt nahuatl1.txt 19
Code-Switching Language Model E e i e i − 1 e i +1 20
Code-Switching Language Model E P ( | ) is learned unsupervised via EM, with a hyperparameter biasing the model toward not switching e i e i − 1 e i +1 (long language spans) 21
22
22
22
AÁBCDÉFGHIÍJKLMÑOÓPQRSTUÚVWXYZ Spanish aábcdéfghiíjklmñoópqrstuúvwxyz 01234567890.,/\()?!”’:;- ABCDFGHIJKLMOPQRSTUVWXYZ Latin abcdfghijklmopqrstuvwxyz 01234567890.,/\()?!”’:;- ABCDFGHIJKLMOPQRSTUVWXYZ Nahuatl abcdfghijklmopqrstuvwxyz 01234567890.,/\()?!”’:;- 24
a a a a a a a a a a a a a a a a a a a á á á á á á á á á á á á á á á á á á á b b b b b b b b b b b b b b b b b b b c c c c c c c c c c c c c c c c c c c d d d d d d d d d d d d d d d d d d d e e e e e e e e e e e e e e e e e e e a a a a a a a a a a a a a a a a a a a b b b b b b b b b b b b b b b b b b b c c c c c c c c c c c c c c c c c c c d d d d d d d d d d d d d d d d d d d e e e e e e e e e e e e e e e e e e e f f f f f f f f f f f f f f f f f f f a a a a a a a a a a a a a a a a a a a b b b b b b b b b b b b b b b b b b b c c c c c c c c c c c c c c c c c c c d d d d d d d d d d d d d d d d d d d e e e e e e e e e e e e e e e e e e e f f f f f f f f f f f f f f f f f f f 26
Code-Switching Language Model • Improves transcription quality, and • Implicitly identifies language spans in text (metadata of the transcription) 27
Orthographic Variability 28
Orthographic Variability • We train our language models from available text (e.g. Project Gutenberg) • Modern transcribers use modern spellings, which often do not match the printed documents 29
Orthographic Variability Transcription Modern Form di c e di z e n u mero n ú mero D on de D õ de 30
Orthographic Variability Simple solution: Modify the modern corpora to use old conventions. 31
Orthographic Variability Replacement Modern Spanish Old Spanish Rules spanish6.txt spanish6b.txt u → v spanish5.txt spanish5b.txt spanish4.txt c → z spanish4b.txt spanish3.txt spanish3b.txt spanish2.txt ú → u spanish2b.txt spanish1.txt spanish1b.txt on → õ ñ que → q … 32
Experiments 33
Experiments • Evaluated on five different books Primeros Libros • Years 1553 to 1600 • Differing fonts, language proportions, clarity 34
Unknown Fonts Gante (1553) Anunciación (1565) Sahagún (1583) Rincón (1595) Bautista (1600) 35
Experimental Results (lower is better. ~90% characters are correct) 15 12.3 11.3 10 10.5 Character Error Rate 5 0 Ocular +code-switch +orth.var. 36
A thing we do well merita Without handling orth. variation: ˜ metira With handling orth. variation: mentira Modern form: 37
A thing we do wrong 38
A thing we do wrong sí t ĩ i ← Spanish Model output: li tli ← Nahuatl Gold transcription: Model avoids switching languages, but this is actually from a description of Nahuatl grammar. 38
Recommend
More recommend