incorporating dialectal variability for socially
play

Incorporating Dialectal Variability for Socially Equitable Language - PowerPoint PPT Presentation

Incorporating Dialectal Variability for Socially Equitable Language Identification David Jurgens, Yulia Tsvetkov, and Dan Jurafsky McNamee, P., Language identification: a solved problem suitable for undergraduate instruction Journal of


  1. Incorporating Dialectal Variability for Socially Equitable Language Identification David Jurgens, Yulia Tsvetkov, and Dan Jurafsky

  2. McNamee, P., “Language identification: a solved problem suitable for undergraduate instruction” Journal of Computing Sciences in Colleges 20(3) 2005.

  3. McNamee, P., “Language identification: a solved problem suitable for undergraduate instruction” Journal of Computing Sciences in Colleges 20(3) 2005. “ This paper describes […] how even the most simple of these methods using data obtained from the World Wide Web achieve accuracy approaching 100% on a test suite comprised of ten European languages ”

  4. Whose language are we identifying?

  5. Whose language are we identifying?

  6. Whose language are we identifying?

  7. Whose language are we identifying?

  8. Global platforms attract global diversity in a language English

  9. Global platforms attract global diversity in a language 60M Speakers 125M Speakers English 251M Speakers 90M Speakers 79M Speakers

  10. Global platforms attract global diversity in a language 60M Speakers 125M Speakers English 251M Speakers 90M Speakers 79M Speakers French Spanish Arabic

  11. Estimated LID accuracy for English tweets { Human Development Index of 
 Education 
 Life expectancy text’s origin country Income (Labov, 1964; Ash, 2002) 5

  12. Estimated LID accuracy for English tweets More 
 Less 
 Dialect Dialect { Human Development Index of 
 Education 
 Life expectancy text’s origin country Income (Labov, 1964; Ash, 2002) 5

  13. Estimated LID accuracy for English tweets More 
 Less 
 Dialect Dialect { Human Development Index of 
 Education 
 Life expectancy text’s origin country Income (Labov, 1964; Ash, 2002) 5

  14. Current language detection methods perform significantly worse in less-developed countries Estimated LID accuracy for English tweets More 
 Less 
 Dialect Dialect { Human Development Index of 
 Education 
 Life expectancy text’s origin country Income (Labov, 1964; Ash, 2002) 5

  15. Current language detection methods perform significantly worse in less-developed countries } Estimated LID accuracy for 23% English tweets More 
 Less 
 Dialect Dialect { Human Development Index of 
 Education 
 Life expectancy text’s origin country Income (Labov, 1964; Ash, 2002) 5

  16. Practical Motivation : Epidemic Detection Keyword Filter 
 NLP “flu”, “sick” Which symptoms? 6 6

  17. Practical Motivation : Epidemic Detection Language 
 Keyword Filter 
 NLP Detection “flu”, “sick” Which symptoms? 6 6

  18. Practical Motivation : Epidemic Detection Language 
 Keyword Filter 
 NLP Detection “flu”, “sick” Which symptoms? non-English 6 6

  19. Practical Motivation : Epidemic Detection Language 
 Keyword Filter 
 NLP Detection “flu”, “sick” Which symptoms? non-English 6 6

  20. Practical Motivation : Epidemic Detection Language 
 Keyword Filter 
 NLP Detection “flu”, “sick” Which symptoms? non-English 6 6

  21. Practical Motivation : Epidemic Detection Language 
 Keyword Filter 
 NLP Detection “flu”, “sick” Which symptoms? non-English? 6 6

  22. Failing to recognize a language silences its speakers’ voices

  23. Current language detection methods perform significantly worse in less-developed countries Estimated accuracy for English tweets More 
 Less 
 Dialect Dialect Human Development Index of 
 text’s origin country (Labov, 1964; Ash, 2002) 8

  24. Current language detection methods perform significantly worse in less-developed countries Estimated accuracy for Our goal is make language ID English tweets performance equal for all languages across all dialects More 
 Less 
 Dialect Dialect Human Development Index of 
 text’s origin country (Labov, 1964; Ash, 2002) 8

  25. Current language detection methods perform significantly worse in less-developed countries a s i s i h T Estimated l a s r e v i n u accuracy for ! Our goal is make language ID e u s s i P L N English tweets performance equal for all languages across all dialects More 
 Less 
 Dialect Dialect Human Development Index of 
 text’s origin country (Labov, 1964; Ash, 2002) 8

  26. Key Problems : Current methods struggle in the global setting because 9

  27. Key Problems : Current methods struggle in the global setting because Data : No corpora that captures global variation in lexicon and dialect 9

  28. Key Problems : Current methods struggle in the global setting because Data : No corpora that captures global variation in lexicon and dialect Model : makes simplistic assumptions about how multilinguals communicate 9

  29. Our approach Better social representation NLP methodologies through network-based capable of handling sampling linguistic variation 10

  30. Our Data Solution : Improve linguistic representation through network-based sampling 11

  31. Our Data Solution : Improve linguistic representation through network-based sampling Bootstrap dialectic corpora using existing classifiers to find monolingual individuals 11

  32. Our Data Solution : Improve linguistic representation through network-based sampling Bootstrap dialectic corpora using existing classifiers to find monolingual individuals 11

  33. Our Data Solution : Improve linguistic representation through network-based sampling eng Bootstrap dialectic corpora using existing classifiers to find monolingual individuals 11

  34. Our Data Solution : Improve linguistic representation through network-based sampling eng Bootstrap dialectic corpora eng using existing classifiers to find monolingual individuals 11

  35. Our Data Solution : Improve linguistic representation through network-based sampling eng Bootstrap dialectic corpora eng eng using existing classifiers to eng find monolingual individuals eng fra 11

  36. Our Data Solution : Improve linguistic representation through network-based sampling eng Bootstrap dialectic corpora eng eng using existing classifiers to eng find monolingual individuals eng fra eng 11

  37. Our Data Solution : Improve linguistic representation through network-based sampling eng Bootstrap dialectic corpora eng eng using existing classifiers to eng find monolingual individuals eng fra eng Sample from the geolocated Twitter social network to include text from people at all locations 11

  38. Build a strategically-diverse corpora 
 and synthesize code-switched examples 12

  39. Build a strategically-diverse corpora 
 and synthesize code-switched examples Topical 12

  40. Build a strategically-diverse corpora 
 and synthesize code-switched examples Geographic Topical 12

  41. Build a strategically-diverse corpora 
 and synthesize code-switched examples Geographic Topical Social 12

  42. Build a strategically-diverse corpora 
 and synthesize code-switched examples Geographic Topical Multilingual Social 12

  43. Our model solution : treat language identification as a character-based sequence to sequence task. 
 Decoder Encoder Je vais commander à emporter. I’m too lazy to cook. 13 Jaech et al. 2016; Samih et al. 2016

  44. Our model solution : treat language identification as a character-based sequence to sequence task. 
 Decoder Represents a multi- layer recurrent neural network Encoder Je vais commander à emporter. I’m too lazy to cook. 13 Jaech et al. 2016; Samih et al. 2016

  45. Our model solution : treat language identification as a character-based sequence to sequence task. 
 Decoder Represents a multi- layer recurrent neural network Encoder … J e _ o k . Je vais commander à emporter. I’m too lazy to cook. 13 Jaech et al. 2016; Samih et al. 2016

  46. Our model solution : treat language identification as a character-based sequence to sequence task. 
 Decoder Represents a multi- layer recurrent Encodes the whole neural network sentence using its Encoder characters … J e _ o k . Je vais commander à emporter. I’m too lazy to cook. 13 Jaech et al. 2016; Samih et al. 2016

  47. Our model solution : treat language identification as a character-based sequence to sequence task. 
 Decode each word’s Decoder language from the sentence encoding Represents a multi- layer recurrent Encodes the whole neural network sentence using its Encoder characters … J e _ o k . Je vais commander à emporter. I’m too lazy to cook. 13 Jaech et al. 2016; Samih et al. 2016

  48. Our model solution : treat language identification as a character-based sequence to sequence task. 
 Fra Fra Fra Fra Fra . Eng Eng Eng Eng Eng . Decode each word’s Decoder language from the sentence encoding Represents a multi- layer recurrent Encodes the whole neural network sentence using its Encoder characters … J e _ o k . Je vais commander à emporter. I’m too lazy to cook. 13 Jaech et al. 2016; Samih et al. 2016

  49. Equilid vs off-the-shelf Our Method 14 Lui et al. 2013, 2014

  50. Equilid vs off-the-shelf 100 Our Method Our Method 75 Macro F1 50 CLD2 langid.py 25 0 70 Languages on Twitter 14 Lui et al. 2013, 2014

Recommend


More recommend