corpus linguistics resources and tools for arabic
play

Corpus linguistics resources and tools for Arabic lexicography - PowerPoint PPT Presentation

Workshop on Arabic Corpus Linguistics 11-12 April 2011, Lancaster University Corpus linguistics resources and tools for Arabic lexicography tools for Arabic lexicography Majdi Sawalha and Eric Atwell School of Computing, University of Leeds,


  1. Workshop on Arabic Corpus Linguistics 11-12 April 2011, Lancaster University Corpus linguistics resources and tools for Arabic lexicography tools for Arabic lexicography Majdi Sawalha and Eric Atwell School of Computing, University of Leeds, Leeds, LS2 9JT, UK http://www.comp.leeds.ac.uk/sawalha http://www.comp.leeds.ac.uk/eric

  2. Outline • Introduction – Oxford University Dictionaries • Monolingual Dictionaries • Bilingual Dictionaries – Arabic & Arabic NLP Arabic & Arabic NLP • Difficulties • Morphological analysis – Traditional Arabic lexicography – Constructing the Arabic broad-coverage lexical resource – Key notes – Conclusions – References 12/4/2011 2

  3. Oxford Dictionaries • Searching for a word Universities Universities University Lemmatising University University • Dictionary entries are the lemmas of the word. • Lemmatising of input words is needed to direct the search for • Lemmatising of input words is needed to direct the search for the input word to the correct dictionary entry. • What happens if the user entered a miss-spelled (unknown) word? Univercity Univercity • Gives university university suggestions of inveracity inveracity similar words intercity intercity 12/4/2011 3

  4. Dictionary Entry • Dictionary entry: Information provided Dictionary Entry Dictionary Dictionary entries Pronunciation POS + Plural form POS + Plural form Meaning Examples Phrases Meaning Examples Other Other Dictionaries Origin 12/4/2011 4

  5. What information can be provided by the Arabic dictionaries? Dictionary entry Meaning Pronunciation Pronunciation Related words list POS A list of the words derived from the Lemma Root Pattern Plural form Position in dictionary same root or words that have Examples Examples the same lemma the same lemma Using a suitable font that shows the letters and the diacritics Using colours to illustrate clitics attached to the word Phrases, Collocations, Idioms (meaning and examples) Origin (relation to traditional Arabic Dictionaries) Morphological features Detailed description of the morphological features of the word’s morphemes 12/4/2011 5

  6. ‘university’ in OED online 12/4/2011 6

  7. Traditional Arabic Lexicography • Arabic lexicography is one of the original and deep-rooted arts of Arabic literature. ���� � ���� � � • The first lexicon constructed was kit � b al-‘ayn al-‘ayn lexicon ’ by al-far � h � d � (died in 791). lexicon ’ by al-far � h d • Over the past 1200 years, many different kinds of Arabic language lexicons were constructed; these lexicons are different in ordering, size and aim or goal of construction. • Many Arabic language linguists and lexicographers studied the construction, development and the different methodologies used to construct these lexicons. 12/4/2011 7

  8. Ordering lexical entries in the Arabic lexicons al- � al � l methodology [5 lexicons] • – Listed the lexical entries based on the pronunciation of the letters starting from the farthest in the mouth to the nearest ab � ‘ubayd methodology [3 lexicons] • – Listed the lexical entries based on similarity in meaning. � al- � awhar � methodology [4 lexicons] • – Listed the lexical entries based on last letter of the word. al-barmak � methodology [11 lexicons] • – Listed the lexical entries alphabetically. • Modern dictionaries uses a combination of root/word as lexical entries arranged alphabetically. 12/4/2011 8

  9. http://www.comp.leeds.ac.uk/cgi-bin/scmss/arabic_roots.py 12/4/2011 9

  10. The use of Corpora in building dictionaries – Example 1 Selecting examples from concordance lines Lexicographers 12/4/2011 10

  11. Lemmatized Arabic Internet Corpus http://corpus.leeds.ac.uk/query-ar.html 12/4/2011 11

  12. The use of Corpora in building dictionaries – Example 2 12/4/2011 12

  13. The use of Corpora in building dictionaries – Example 2, cont. • Frequency lists /measurements can compare between two words, and find words, and find how these words are related to each other. 12/4/2011 13

  14. Collocations example from the Arabic Internet Corpus 12/4/2011 14

  15. Oxford bilingual dictionaries English English POS POS French French French French Examples from Examples from Other languages Other languages parallel corpora 12/4/2011 15

  16. Google translation 12/4/2011 16

  17. Oxford bilingual dictionaries Dictionary entry Dictionary entry Meaning in Meaning in English Word list Word list Language Dictionary entry Dictionary entry POS POS Pronunciation Pronunciation 12/4/2011 17

  18. Oxford bilingual dictionaries • Do users need to know the meaning of word in many different languages? • Users need to translate from one language to another. • Connecting terms with a central language Connecting terms with a central language (English) can connect two languages together. • specific linguistic information to each word from their monolingual dictionaries can be provided, in addition to information of the central language (English) 12/4/2011 18

  19. Arabic & Arabic NLP • 200 million people speaking Arabic as first language. • More than 1 billion Muslims need Arabic to recite the Quran ( the holy book of Muslims). recite the Quran ( the holy book of Muslims). • One of the UN official languages. • Increased potential for learning Arabic recently. • Many commercial software companies invest in Arabic NLP. 12/4/2011 19

  20. Why is Arabic NLP difficult? • Complex morphology – Words consist of multi-morphemes of 5 kinds – Proclitics, prefixes, stem/root, suffixes, enclitics. ��� ��� �� ���� ��� ��� ����� �� � � � � [ wasayaktub � nh � ] ( And they will write it ) • � � � � � � � � � � � � � � � � �� �� �� �� � � � �� � �� � � �� �� � � Conjunction Particle of Progressive Root / Stem Relative Pronoun Relative Pronoun futurity letter (Plural/Subject) (Object) • Vowels & Diacritic Marks – 3 long vowels ( � alif , � w � w , � y � ’ ) – 3 short vowels ( fatha h � , damma h � , kasra h � ) �� , šadda h �� – Other diaratics: suk � n �� �� , tanw � n ( � , � , � ) – ta � w � l character ( � ). � � ��� � ) , t � ’ marb � � a h ( � ) and h � ’ ( � ), y � ’ ( � ) and alif � � � – hamza h ( maq � � ra h ( � ), and madda h ( � � ). 12/4/2011 20

  21. Morphological Analyses of Arabic text • Morphological analysis is Morphological Analyzer for essential for processing Arabic text text corpora and building Step 1: Tokenization dictionaries. - Different text types • Existing Arabic - Spell-checking morphological analyzers morphological analyzers Step 2: Function words Step 2: Function words failed to achieve accuracy Step 3: Clitics, Affixes & Stems rates more than 75%. Step 4: Root/Lemma extraction (Sawalha & Atwell, 2008) Step 5: Pattern generation • We can not rely on such Step 6: Vowelization analyzers for further Step 7: Assigning detailed analysis such as part-of- morphological features tags for speech tagging and parsing. each of the word’s morphemes 12/4/2011 21

  22. Example of Analyzed word � �� �" #�� &� ��� � wa al- �� mi‘ � t (And the universities) � � !" � #�� $ %� � � � Word’s list of similar root Feminine plural stem Definite article conjunction � �� �� �� letters ��� �� �� ��� Pronunciation � ��� �� �� �� Lemma <link> Root <link> Lemma root POS Pattern � �� � �� �� �� ��� ��� � � �� � � �� �� �� � � � ��� �� �� � Meaning Meaning �� � �� � �� �� � � Examples �� � �� �� � Pattern Plural form Phrases, collocations, idioms �� ��� � �� � �� ��� � #�� �� ��� � Origin (links to traditional Arabic � �� �� ��� � lexicons) � !" �� �� ��� � Word’s morphemes �� ��� ��� � � � � #�� �� �� ��� � p--c------------------- Particle, conjunction (clitic) �� ���� � %� � r---d------------------ definite article (clitic) �� �� �� � � !" #�� $ � �� �� �� � np----flp-vndd---ncat-s collective noun, feminine �!� � �� �� � (M/S), varied, non-human … �� � �� �� $ � � r---l------------------ plural feminine letters 12/4/2011 22 http://www.comp.leeds.ac.uk/sawalha/tagset.html

Recommend


More recommend