This lecture is not 22 May 2014 About Arabic language technologies Description of the state-of-the-art Highly technical Duplicate to other presentations (I hope) Arabic Language Challenges Boring (promise) Walid Magdy This lecture is about This sentence is written in Arabic language Why Arabic Language is Important Arabic orthographic nature Arabic morphological nature Arabic phonetic nature Challenges stem from this nature
Language Technology Arabic Language Technology Related to the Language People Speak Arabic is the largest living member of the Semitic language family Information retrieval (Google) It is classified as a macro-language Translation (Google-translate) with 27 sub-languages Question Answering It is spoken by over 280 million Sentiment Analysis people in 28 countries (middle-east) Automatic Speech Recognition (ASR, e.g. Siri) The language of Quran (over 1.6 billion Muslims) Optical Character Recognition (OCR) Arabic Language (Internet) Arabic Language (Types) Current written Arabic is the modern standard Arabic English English Unified across all Arabic countries (news, political speeches) Chinese Chinese Easy to understand by all Arabs Spanish Spanish Not spoken by people! Japanese Japanese Portuguese Portuguese Spoken Arabic (dialectic Arabic) German German Different across Arabic countries (regions) Arabic Arabic Semi-understandable by different Arabic dialectic French French For informal use (on social media) Russian Russian Korean Korean Classic Arabic (Language of Quran) Rest of the Languages Rest of the Languages Contains ancient Arabic words 0E+00 1E+08 2E+08 3E+08 4E+08 5E+08 6E+08 0% 500% 1000% 1500% 2000% 2500% 3000% Mostly understandable by Arabic people Internet users by language (2010) Growth in Internet (2000-2010) Previously used different version of Arabic scripts
Arabic Language Nature Orthographical Nature Orthographical nature: Written from right to left (letters only) The way to write Arabic letters 15 of the 28 letters contain dots OCR Characters are connected or semi-connected Morphological nature: The way to construct Arabic sentences Character shape depends on position NLP, IR, MT, QA Printed text may include ligatures and kashida Phonetic nature: Optional diacritics may be present The way to pronounce Arabic letters and words ASR, T2S, S2S 15 of the 28 letters contain dots Character shape depends on position middle begin end isolated middle begin end isolated
Presence of kashida and ligatures Optional diacritics may be present It was very ambiguous What about Arabic OCR? Word Error Rates (WER) are considerably high Good Arabic OCR: 30-40% WER on average Trained on similar font: <10% WER Old fonts: >70% WER Average WER for English: <5%
Morphological Nature Short vowels are not written Language is built of 10k roots In the Arabic text we do not write its short vowels and the pronouns are attached to the words Short vowels are not written (diacritics) In th Arbc txt w do nt writ its short vwls and th pronuns ar Words contain prefix, infix, and suffix (pronouns, others) attachd to th words (the, and, his, her, their, it, him, them, will …) are attached to the main word In thArbc txt w do nt writ itsshort vwls andthpronuns ar attachd to thwords Word spelling can change according to grammatical بتك (kataba) position write بتك (kotub) books No rule for plural words بتك (kattaba) let someone write 60 billion possible surface forms بتك (kuttiba) forced to write Words contain prefix, infix, and suffix No rule for plural ءلبؤه ءانبأ رتيب They are Peter ’ s children Singular Plural ءانبلؤا اوفرصتاديج The children behaved well لجر man جرال اهءانبأ فاطل men Her children are cute يئانبأ ءافرظ My children are funny بتاك writer تكاب Writers انيلعنأيمحن انءانبأ We have to save our children ءابلأا ءانبلؤاو ءادعس بتكم office كمابت Patents and children are happy offices وهبحي هءانبأ He loves his children ةبتكم library بتكمتا libraries هؤانبأ هنوبحي His children loves him فتاه telephone وهافت telephones ــيـسوبتـكاـهـنو بتك (kataba) write يلصم prayer يلصمن prayers wasaya+ktub+unahaa كابت (kateb) writer and will + write + they it مامإ leader ةمئأ leaders تكاب (ketab) book = and they will write it
What about Arabic IR? Phonetic Nature Some phonemes are in Arabic doesn ’ t exist in other Some characters are normalized language ( ‘ ein, ghain, ha, kha, Dad, Sad, Ta, Hamza) Diacritics (short vowels) are removed (if existed) Examples: Later approaches for search Mohamed (ha) - Search with words ( ‘ ein, Ta) - Apply light stemming for words Attia - Apply morphological stemming for words Khalid (kha) - Simple character n-grams representation Ghada (ghain) Baraa (Hamza) New Methods are being developed for Social Arabic كنولشيا ،؟يازا ،؟جص ،؟ كلوقبمز ،ةدكلول Diaa (Dad, Hamza) we lessa ba2a el3arabi elli maktoob bel7rof el inglizi :D State-of-the-art / Areas of research What about Arabic ASR? Needs special training and decoding Language Technology MSA Dialect Arabic Stemming (Segmentation) Good Needs work Requires huge amount of training POS Good Good for some Requires diacritisation as a pre-processing step NER Good Can be improved State-of-the-art is not bad (for MSA) Search (IR) Good Good ASR Good Needs work Again for dialect, it is too bad Sentiment analysis Needs work Not working! Sarcasm detection NA HELP!! Syntactic tree parsing kind of What is it?
Conclusion Language technology requires deep algorithms to overcome language challenges Thank you Arabic language is full of challenges Huge amount of work already done اركش Huge amount of work is still needed (shokran) Some languages are just harder to deal with in NLP than others!
Recommend
More recommend