Incorporating Dialectal Variability for Socially Equitable Language Identification David Jurgens, Yulia Tsvetkov, and Dan Jurafsky
McNamee, P., “Language identification: a solved problem suitable for undergraduate instruction” Journal of Computing Sciences in Colleges 20(3) 2005.
McNamee, P., “Language identification: a solved problem suitable for undergraduate instruction” Journal of Computing Sciences in Colleges 20(3) 2005. “ This paper describes […] how even the most simple of these methods using data obtained from the World Wide Web achieve accuracy approaching 100% on a test suite comprised of ten European languages ”
Whose language are we identifying?
Whose language are we identifying?
Whose language are we identifying?
Whose language are we identifying?
Global platforms attract global diversity in a language English
Global platforms attract global diversity in a language 60M Speakers 125M Speakers English 251M Speakers 90M Speakers 79M Speakers
Global platforms attract global diversity in a language 60M Speakers 125M Speakers English 251M Speakers 90M Speakers 79M Speakers French Spanish Arabic
Estimated LID accuracy for English tweets { Human Development Index of Education Life expectancy text’s origin country Income (Labov, 1964; Ash, 2002) 5
Estimated LID accuracy for English tweets More Less Dialect Dialect { Human Development Index of Education Life expectancy text’s origin country Income (Labov, 1964; Ash, 2002) 5
Estimated LID accuracy for English tweets More Less Dialect Dialect { Human Development Index of Education Life expectancy text’s origin country Income (Labov, 1964; Ash, 2002) 5
Current language detection methods perform significantly worse in less-developed countries Estimated LID accuracy for English tweets More Less Dialect Dialect { Human Development Index of Education Life expectancy text’s origin country Income (Labov, 1964; Ash, 2002) 5
Current language detection methods perform significantly worse in less-developed countries } Estimated LID accuracy for 23% English tweets More Less Dialect Dialect { Human Development Index of Education Life expectancy text’s origin country Income (Labov, 1964; Ash, 2002) 5
Practical Motivation : Epidemic Detection Keyword Filter NLP “flu”, “sick” Which symptoms? 6 6
Practical Motivation : Epidemic Detection Language Keyword Filter NLP Detection “flu”, “sick” Which symptoms? 6 6
Practical Motivation : Epidemic Detection Language Keyword Filter NLP Detection “flu”, “sick” Which symptoms? non-English 6 6
Practical Motivation : Epidemic Detection Language Keyword Filter NLP Detection “flu”, “sick” Which symptoms? non-English 6 6
Practical Motivation : Epidemic Detection Language Keyword Filter NLP Detection “flu”, “sick” Which symptoms? non-English 6 6
Practical Motivation : Epidemic Detection Language Keyword Filter NLP Detection “flu”, “sick” Which symptoms? non-English? 6 6
Failing to recognize a language silences its speakers’ voices
Current language detection methods perform significantly worse in less-developed countries Estimated accuracy for English tweets More Less Dialect Dialect Human Development Index of text’s origin country (Labov, 1964; Ash, 2002) 8
Current language detection methods perform significantly worse in less-developed countries Estimated accuracy for Our goal is make language ID English tweets performance equal for all languages across all dialects More Less Dialect Dialect Human Development Index of text’s origin country (Labov, 1964; Ash, 2002) 8
Current language detection methods perform significantly worse in less-developed countries a s i s i h T Estimated l a s r e v i n u accuracy for ! Our goal is make language ID e u s s i P L N English tweets performance equal for all languages across all dialects More Less Dialect Dialect Human Development Index of text’s origin country (Labov, 1964; Ash, 2002) 8
Key Problems : Current methods struggle in the global setting because 9
Key Problems : Current methods struggle in the global setting because Data : No corpora that captures global variation in lexicon and dialect 9
Key Problems : Current methods struggle in the global setting because Data : No corpora that captures global variation in lexicon and dialect Model : makes simplistic assumptions about how multilinguals communicate 9
Our approach Better social representation NLP methodologies through network-based capable of handling sampling linguistic variation 10
Our Data Solution : Improve linguistic representation through network-based sampling 11
Our Data Solution : Improve linguistic representation through network-based sampling Bootstrap dialectic corpora using existing classifiers to find monolingual individuals 11
Our Data Solution : Improve linguistic representation through network-based sampling Bootstrap dialectic corpora using existing classifiers to find monolingual individuals 11
Our Data Solution : Improve linguistic representation through network-based sampling eng Bootstrap dialectic corpora using existing classifiers to find monolingual individuals 11
Our Data Solution : Improve linguistic representation through network-based sampling eng Bootstrap dialectic corpora eng using existing classifiers to find monolingual individuals 11
Our Data Solution : Improve linguistic representation through network-based sampling eng Bootstrap dialectic corpora eng eng using existing classifiers to eng find monolingual individuals eng fra 11
Our Data Solution : Improve linguistic representation through network-based sampling eng Bootstrap dialectic corpora eng eng using existing classifiers to eng find monolingual individuals eng fra eng 11
Our Data Solution : Improve linguistic representation through network-based sampling eng Bootstrap dialectic corpora eng eng using existing classifiers to eng find monolingual individuals eng fra eng Sample from the geolocated Twitter social network to include text from people at all locations 11
Build a strategically-diverse corpora and synthesize code-switched examples 12
Build a strategically-diverse corpora and synthesize code-switched examples Topical 12
Build a strategically-diverse corpora and synthesize code-switched examples Geographic Topical 12
Build a strategically-diverse corpora and synthesize code-switched examples Geographic Topical Social 12
Build a strategically-diverse corpora and synthesize code-switched examples Geographic Topical Multilingual Social 12
Our model solution : treat language identification as a character-based sequence to sequence task. Decoder Encoder Je vais commander à emporter. I’m too lazy to cook. 13 Jaech et al. 2016; Samih et al. 2016
Our model solution : treat language identification as a character-based sequence to sequence task. Decoder Represents a multi- layer recurrent neural network Encoder Je vais commander à emporter. I’m too lazy to cook. 13 Jaech et al. 2016; Samih et al. 2016
Our model solution : treat language identification as a character-based sequence to sequence task. Decoder Represents a multi- layer recurrent neural network Encoder … J e _ o k . Je vais commander à emporter. I’m too lazy to cook. 13 Jaech et al. 2016; Samih et al. 2016
Our model solution : treat language identification as a character-based sequence to sequence task. Decoder Represents a multi- layer recurrent Encodes the whole neural network sentence using its Encoder characters … J e _ o k . Je vais commander à emporter. I’m too lazy to cook. 13 Jaech et al. 2016; Samih et al. 2016
Our model solution : treat language identification as a character-based sequence to sequence task. Decode each word’s Decoder language from the sentence encoding Represents a multi- layer recurrent Encodes the whole neural network sentence using its Encoder characters … J e _ o k . Je vais commander à emporter. I’m too lazy to cook. 13 Jaech et al. 2016; Samih et al. 2016
Our model solution : treat language identification as a character-based sequence to sequence task. Fra Fra Fra Fra Fra . Eng Eng Eng Eng Eng . Decode each word’s Decoder language from the sentence encoding Represents a multi- layer recurrent Encodes the whole neural network sentence using its Encoder characters … J e _ o k . Je vais commander à emporter. I’m too lazy to cook. 13 Jaech et al. 2016; Samih et al. 2016
Equilid vs off-the-shelf Our Method 14 Lui et al. 2013, 2014
Equilid vs off-the-shelf 100 Our Method Our Method 75 Macro F1 50 CLD2 langid.py 25 0 70 Languages on Twitter 14 Lui et al. 2013, 2014
Recommend
More recommend