NLP for low-resourced languages Teresa Lynn, PhD Research Fellow ADAPT Centre Dublin City University The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.
AI Challenges • Overview of The Irish Language for • NLP with few resources Low-resourced • Addressing the Lack of Irish Data Languages • The Future?
Irish language - status First Officia l Language Nationa l Language Census (2016): Pop. 4,761,8 65 Ability to speak: 1,761,4 2 0 Daily usage: 73,803
EU Language status Officia l EU Language Minority Language (low -res ourc e d) Derogati o n on official translat i o n s (until 2021)
Morphology/ Inflection VOWEL HARMONY LENITION sa c h eantar ‘in the area’ Caith im `I spend’ airgead a t h uillfeadh sé ‘money he would earn’ Cas aim `I turn’ a d h eartháir ‘his brother’ Rith finn `I would run’ D’íos fainn `I would eat’ ECLIPSIS Tír na n Óg ‘Land of the Youth’ i m Béarla ‘in English’ ar an m bord ‘on the table’
le – with ó – from Inflected liom `with me ’ uaim `from me ’ leat `with you ’ uait `from you ’ Prepositions ag – at do – to agam `at me ’ dom to me ’ agat `at you ’ duit `to you ’ faoi – about/under ar – on fúm ‘about/under me ’ orm ‘on me ’ fút ‘about/under you ’ ort ‘on you ’ 7
Word Order V S O English : `I saw the boy’ Irish : Chonaic mé an buachaill Gloss: Saw I the boy
Irish language technology META -NET white paper series (Judge et al., 2012) EU-led sur vey 31 EU language s Language resources and techno l og ie s
excellent good moderate fragmentary weak or no support www.adaptcentre.ie Basque, Bulgarian, Croatian, Czech, Danish, Estonian, Catalan, Dutch, German, Hungarian, Italian, MT Finnish, Galician, Greek, Icelandic, Irish , Latvian, English French, Spanish Polish, Romanian Lithuanian, Maltese, Norwegian, Portuguese, Serbian, Slovak, Slovene, Swedish, Welsh Text Analysis excellent good moderate fragmentary weak or no support Basque, Bulgarian, Catalan, Czech, Danish, Croatian, Estonian, Icelandic, Irish , Latvian, Dutch, French, German, Finnish, Galician, Greek, Hungarian, Norwegian, English Italian, Spanish Polish, Portuguese, Romanian, Slovak, Slovene, Lithuanian, Maltese, Serbian, Welsh Swedish excellent good moderate fragmentary weak or no support Speech Basque, Bulgarian, Catalan, Danish, Estonian, Czech, Dutch, Finnish, French, Croatian, Icelandic, Latvian, Galician, Greek, Hungarian, Irish , Norwegian, English German, Italian, Portuguese, Lithuanian, Maltese, Romanian, Welsh Spanish Polish, Serbian, Slovak, Slovene, Swedish excellent good moderate fragmentary weak or no support Resources Czech, Dutch, French, Basque, Bulgarian, Catalan, Croatian, Danish, English German, Hungarian, Italian, Estonian, Finnish, Galician, Greek, Norwegian, Icelandic, Irish , Latvian, Lithuanian, Maltese, Welsh Polish, Spanish, Swedish Portuguese, Romanian, Serbian, Slovak, Slovene 9
Risk of Digital Extinction “Printing Press resulted in the extinction of many minority and regional languages” Will technology have the same impact on Irish?
Risk of Digital Extinction Need to ensure continuing language usage through technology o Edutainment packages/ CALL o Multi-platform Word processing tools o Automated translation o Search engines o Games o Social media/ Online data mining o Text Generation (e.g. weather reports) o Automatic subtitling o …
T E X T M I N I N G T E X T Why S U M M A R I S AT I O N M A C H I N E S E N T I M E N T T R A N S L AT I O N do we A N A LY S I S Q U E S T I O N - A N S W E R I N G I N F O R M AT I O N S Y S T E M S R E T R I E V A L need L A N G U A G E L E A R N I N G A P P S G R A M M A R C H E C K I N G NLP? R E C O M M E N D E R S Y S T E M S V I D E O S U M M A R I S AT I O N 7
• Overview of The Irish Language • NLP with few resources • Addressing the Lack of Irish Data • The Future?
Why is NLP One word/sentence may have many meanings Many ways of saying the same thing a hard task? Meaning depends on context Literal and figurative language (metaphor) Language and culture (different ways of conceptualising the same thing) 7
Ambiguous Headlines Seman antic ic Amb mbig iguity ty Syntac Sy actic ic Amb mbig iguit ity PANDA MATING FAILS; VETERINARIAN TAKES OVER EYE DROPS OFF SHELF SAFETY EXPERTS SAY SCHOOL BUS PASSENGERS SHOULD BE SQUAD HELPS DOG BITE VICTIM BELTED ENRAGED COW INJURES FARMER WITH AXE POLICE BEGIN CAMPAIGN TO RUN DOWN JAYWALKERS STOLEN PAINTING FOUND BY TREE Source: http://www.alta.asn.au/events/altss_w2003_proc/altss/courses/somers/headlines.htm 8
What does a machine know about language?
What does a machine know about language? Sentence = a string/sequence of characters: “ The man saw the boy with the telescope ”
SYNTACTIC PARSING 101 Who is doing what? Who has the telescope? Part of Speech Tagging “ The man saw the boy with the telescope ” DET NOUN VERB DET NOUN PREP DET NOUN
“Traditional” Parsing S ➔ NP VP S ➔ NP VP PP NP ➔ Noun | Pronoun VP ➔ Verb NP | Verb PP PP ➔ Preposition Noun Noun ➔ ‘ice - cream’ | ‘summer’ Pronoun ➔ `I’ Verb ➔ `like’ Preposition ➔ ‘in’
STATISTICAL PARSING TEXT TEXT TEXT TEXT
Machine Learning in NLP (data driven approaches) STRUCTURED RELIABLE LABELLED DATA DATA DATA
Machine Learning – data sparsity
Data Envy
Irish Data Sparsity NUMBER OF MORPHOLOGY SKILL FUNDING SPEAKERS SHORTAGE
• Overview of The Irish Language • NLP with few resources • Addressing the Lack of Irish Data • The Future?
Addressing the lack of data TRAIN CROSS- BOOT- SYNTHETIC MORE LINGUAL STRAPPING DATA EXPERTS TRANSFER
CROSS-LINGUAL TRANSFER • Using data from one language to help build a system for another UNIVERSAL MULTI-WORD EXPRESSIONS DEPENDENCIES
BOOTSTRAPPING • Using limited data to train a sub-standard system to help further annotations (human correction rather than annotate from scratch) PASSIVE LEARNING ACTIVE LEARNING
SYNTHETIC DATA e.g. Back Translation for Machine Translation
On that MT note….. Tapadóir SMT system (BLEU 46) SMT vs NMT (NMT BLEU 40) Domain-tuning, linguistic features (hybrid) Increasing data collection (European Language Resource Coordination)
• Overview of The Irish Language • NLP with few resources • Addressing the Lack of Irish Data • The Future?
Digital Strategy for the Irish Language 2019 Linguistic Knowledge Corpora NLP Tools NLG Tools Resources Bases Spoken Speech Speech Speech Machine Dialogue Models Synthesis Recognition Translation Systems Synergies Information State and Disability and CALL (Industry and Retrieval Public Use Access Public)
TRAINING MORE EXPERTS Machine Translation Irish Twitter Analysis Processing Irish Multiword Expressions Irish Syntactic Parsing
Go Raibh Maith Agaibh #GRMA teresa.lynn@adaptcentre.ie @cigilt
Recommend
More recommend