What Kind of Language Is Hard to Language-Model? ACL 2019 Sabrina J. Mielke and Ryan Cotterell, Kyle Gorman, Brian Roark, Jason Eisner Johns Hopkins University // City University of New York Graduate Center // Google sjmielke@jhu.edu Twitter: @sjmielke – paper and thread pinned! 1
Questions and answers 0. Do current language models do equally well on all languages? 2
Questions and answers 0. Do current language models do equally well on all languages? No. 2
Questions and answers 0. Do current language models do equally well on all languages? No. 1. Which one do they struggle more with: German or English? 2
Questions and answers 0. Do current language models do equally well on all languages? No. 1. Which one do they struggle more with: German or English? German. 2
Questions and answers 0. Do current language models do equally well on all languages? No. 1. Which one do they struggle more with: German or English? German. 2. What about non-Indo-European languages, say Chinese? 2
Questions and answers 0. Do current language models do equally well on all languages? No. 1. Which one do they struggle more with: German or English? German. 2. What about non-Indo-European languages, say Chinese? It depends. 2
Questions and answers 0. Do current language models do equally well on all languages? No. 1. Which one do they struggle more with: German or English? German. 2. What about non-Indo-European languages, say Chinese? It depends. 3. What makes a language harder to model? 2
Questions and answers 0. Do current language models do equally well on all languages? No. 1. Which one do they struggle more with: German or English? German. 2. What about non-Indo-European languages, say Chinese? It depends. 3. What makes a language harder to model? Actually, rather technical factors. 2
Questions and answers 0. Do current language models do equally well on all languages? No. 1. Which one do they struggle more with: German or English? German. 2. What about non-Indo-European languages, say Chinese? It depends. 3. What makes a language harder to model? Actually, rather technical factors. 4. Is Translationese easier? 2
Questions and answers 0. Do current language models do equally well on all languages? No. 1. Which one do they struggle more with: German or English? German. 2. What about non-Indo-European languages, say Chinese? It depends. 3. What makes a language harder to model? Actually, rather technical factors. 4. Is Translationese easier? It’s different, but not actually easier! 2
Outline “Difficulty” 3
Outline “Difficulty” Models and languages 3
Outline “Difficulty” Models and languages What correlates with difficulty? 3
Outline “Difficulty” Models and languages What correlates with difficulty? And... is Translationese really easier? 3
How to measure “difficulty”? Language models measure surprisal / information content (NLL; − log p ( · ) ): p ( · ) ⇒ NLL en 0.03 ⇒ 5 bits I love Florence! 4
How to measure “difficulty”? Language models measure surprisal / information content (NLL; − log p ( · ) ): p ( · ) ⇒ NLL en 0.03 ⇒ 5 bits I love Florence! de 0.008 ⇒ 7 bits Ich grüße meine Oma und die Familie dahein. 4
How to measure “difficulty”? Language models measure surprisal / information content (NLL; − log p ( · ) ): p ( · ) ⇒ NLL en 0.03 ⇒ 5 bits I love Florence! de 0.008 ⇒ 7 bits Ich grüße meine Oma und die Familie dahein. nl 0.0004 ⇒ 11 bits Alle mensen worden vrij en gelijk in waardigheid en rechten geboren. 4
How to measure “difficulty”? Language models measure surprisal / information content (NLL; − log p ( · ) ): p ( · ) ⇒ NLL en 0.03 ⇒ 5 bits I love Florence! de 0.008 ⇒ 7 bits Ich grüße meine Oma und die Familie dahein. nl 0.0004 ⇒ 11 bits Alle mensen worden vrij en gelijk in waardigheid en rechten geboren. Issue 1: Different topics / styles / content 4
How to measure “difficulty”? Language models measure surprisal / information content (NLL; − log p ( · ) ): p ( · ) ⇒ NLL en 0.013 ⇒ 6.5 bits Resumption of the session. de 0.011 ⇒ 6.3 bits Wiederaufnahme der Sitzung. nl 0.012 ⇒ 6.4 bits Hervatting van de sessie. Issue 1: Different topics / styles / content Solution: train and test on translations! Europarl: 21 languages share ~40M chars Bibles: 62 languages share ~4M chars 4
How to measure “difficulty”? Language models measure surprisal / information content (NLL; − log p ( · ) ): p ( · ) ⇒ NLL en 0.013 ⇒ 6.5 bits Resumption of the session. de 0.011 ⇒ 6.3 bits Wiederaufnahme der Sitzung. nl 0.012 ⇒ 6.4 bits Hervatting van de sessie. Issue 1: Different topics / styles / content Solution: train and test on translations! Europarl: 21 languages share ~40M chars Bibles: 62 languages share ~4M chars and this one takes � � a big ILP to solve, Gurobi which is really fun 4
How to measure “difficulty”? Language models measure surprisal / information content (NLL; − log p ( · ) ): p ( · ) ⇒ NLL en 0.013 ⇒ 6.5 bits Resumption of the session. de 0.011 ⇒ 6.3 bits Wiederaufnahme der Sitzung. nl 0.012 ⇒ 6.4 bits Hervatting van de sessie. Issue 1: Different topics / styles / content Solution: train and test on translations! Europarl: � 69 languages 21 languages share ~40M chars Bibles: 62 languages share ~4M chars s l i e m i a e f g u a g a n l 1 3 and this one takes � � a big ILP to solve, Gurobi which is really fun 4
How to measure “difficulty”? Language models measure surprisal / information content (NLL; − log p ( · ) ): p ( · ) ⇒ NLL en 0.013 ⇒ 6.5 bits Resumption of the session. de 0.011 ⇒ 6.3 bits Wiederaufnahme der Sitzung. nl 0.012 ⇒ 6.4 bits Hervatting van de sessie. Issue 1: Different topics / styles / content Issue 2: Comparing scores Solution: train and test on translations! Europarl: � 69 languages 21 languages share ~40M chars Bibles: 62 languages share ~4M chars s l i e m i a e f g u a g a n l 1 3 and this one takes � � a big ILP to solve, Gurobi which is really fun 4
How to measure “difficulty”? Language models measure surprisal / information content (NLL; − log p ( · ) ): p ( · ) ⇒ NLL en 0.013 ⇒ 6.5 bits Resumption of the session. de 0.011 ⇒ 6.3 bits Wiederaufnahme der Sitzung. nl 0.012 ⇒ 6.4 bits Hervatting van de sessie. Issue 1: Different topics / styles / content Issue 2: Comparing scores Solution: train and test on translations! Use total bits of an Europarl: � 69 languages 21 languages share ~40M chars Bibles: 62 languages share ~4M chars s open-vocabulary model . l i e m i a e f g u a g a n l 1 3 and this one takes � � Why? a big ILP to solve, Gurobi which is really fun 4
How to compare your language models across languages 1. We need to be open-vocabulary – no UNKs. 5
How to compare your language models across languages 1. We need to be open-vocabulary – no UNKs. Every UNK is “cheating” – morphologically rich languages have more UNKs, unfairly advantaging them. 5
How to compare your language models across languages 1. We need to be open-vocabulary – no UNKs. Every UNK is “cheating” – morphologically rich languages have more UNKs, unfairly advantaging them. 2. We can’t normalize per word or even per character in languages individually. 5
How to compare your language models across languages 1. We need to be open-vocabulary – no UNKs. Every UNK is “cheating” – morphologically rich languages have more UNKs, unfairly advantaging them. 2. We can’t normalize per word or even per character in languages individually. Example: if puˇ c cz and Putsch de are equally likely, they should be equally “difficult.” 5
How to compare your language models across languages 1. We need to be open-vocabulary – no UNKs. Every UNK is “cheating” – morphologically rich languages have more UNKs, unfairly advantaging them. 2. We can’t normalize per word or even per character in languages individually. Example: if puˇ c cz and Putsch de are equally likely, they should be equally “difficult.” ⇒ just use overall bits (i.e., surprisal / NLL) of an aligned sentence 5
How to compare your language models across languages 1. We need to be open-vocabulary – no UNKs. Every UNK is “cheating” – morphologically rich languages have more UNKs, unfairly advantaging them. 2. We can’t normalize per word or even per character in languages individually. Example: if puˇ c cz and Putsch de are equally likely, they should be equally “difficult.” ⇒ just use overall bits (i.e., surprisal / NLL) of an aligned sentence [ note: total easily obtainable from BPC or perplexity by multiplying with total chars / words ] 5
How to aggregate multiple intents’ surprisals into “difficulties”? For fully parallel corpora... en de bg Resump- Wieder- Възобн- 1 tion aufnah- овяване of the me der на се- session ... ... The Der Мирът, 2 peace gestern който that verein- беше ... ... ... Although Obwohl Макар 3 we were wir че не not al- nicht бяхме ... ... ... Now we Jetzt Накрая 4 can fi- ist die всички nally Zeit можем ... ... ... aligned multi-text Image CC-BY Mike Grauer Jr / flickr
Recommend
More recommend