Natural Language Processing for historical language varieties Cristina S´ anchez Marco Gjøvik University College MTL lectures April 3 2013 April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 1 / 28
NLP and its applications Natural Language Processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. Applications: question answering, sentiment analysis, machine translation, information extraction, ... An example: Information extraction → Email: Subject: curriculum meeting Date: January 15, 2012 To: Dan Hi Dan, we’ve now scheduled the curriculum meeting. It will be in Gates 159 tomorrow from 10:00-11:30. -Chris → Create new Calendar entry: Event: Curriculum meeting Date: Jan-16-2012 Start: 10:00am End: 11:30am Where: Gates 159 April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 2 / 28
Morphological analysis and tagging Morphological analysis and tagging is a basic NLP task useful in many applications. Two steps: Part-of-speech tagging : the process of marking up a word in a text as corresponding to a particular part of speech. 9 parts of speech: noun, verb, adjective, adverb, preposition, determiner, conjunction, pronoun, interjection. For example, the word the is a determiner. Lemmatisation is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. For example, the word better has good as its lemma. → Words often have more than one POS (“ambiguity”). The POS tagging problem is to determine the POS tag for a particular instance of a word in a sentence. → Uses: text-to-speech (how do we pronounce “lead”?), spelling checkers, as input to or to speed up a full parser, OCR scanning,... April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 3 / 28
Morphological analysis and tagging April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 4 / 28
An example from the 12th century April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 5 / 28
An example from the 12th century April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 6 / 28
An example → Electronic editions prepared by the Hispanic Seminary of Medieval Studies [fol. 32v] { CB1. Ca delo q < ue > mas amaua yal viene el mandado Dozi[en]tos cauall < er > os mando exir p < r ><< i >> uado Q < ue > Re¸ ciban a myanaya & alas duenas fijas dalgo El sedie en valen¸ cia curiando & guardando Ca bie < n > sabe q < ue > albarfanez t < r ><< a >> he todo Recabdo April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 7 / 28
An example: Paleographic symbols [fol. 32v] { CB1. Ca delo q < ue > mas amaua yal viene el mandado Dozi [en] tos cauall < er > os mando exir p < r >< i > uado Q < ue > Re¸ ciban a myanaya & alas duenas fijas dalgo El sedie en valen¸ cia curiando & guardando Ca bie < n > sabe q < ue > albarfanez t < r >< a > he todo Recabdo April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 8 / 28
An example: Spelling [fol. 32v] { CB1. Ca delo (de lo) que mas amaua (amaba) yal (ya le) viene el mandado Dozientos (Doscientos) caualleros (caballeros) mando (mand´ o) exir priuado (privado) Que Re¸ ciban (reciban) a myanaya & alas (a las) duenas (due˜ nas) fijas dalgo (hidalgas) El (´ El) sedie (ser´ ıa) en valen¸ cia (valencia) curiando (curando) & guardando Ca bien sabe que albarfanez trahe (trae) todo Recabdo (Recaudo) April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 9 / 28
An example: Capital letters and punctuation [fol. 32v] { CB1. Ca delo que mas amaua yal viene el mandado (,) Dozientos caualleros mando exir priuado (,) Que Re¸ ciban a myanaya (Myanaya) & alas duenas fijas dalgo (,) El sedie en valen¸ cia (Valencia) curiando & guardando (,) Ca bien sabe que albarfanez (Albarfanez) trahe todo Recabdo (recaudo)(.) April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 10 / 28
An example: Word order [fol. 32v] { CB1. Ca delo que mas amaua yal viene el mandado (Ca yal viene el mandado delo que mas amaua) Dozientos caualleros mando exir priuado Que Re¸ ciban a myanaya & alas duenas fijas dalgo El sedie en valen¸ cia curiando & guardando Ca bien sabe que albarfanez trahe todo Recabdo April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 11 / 28
Challenge and solution The challenge is to enrich the text with lemma and POS tag. Manually 1 Build a tagger from scratch 2 Use an existing tool for a modern language variety 3 Advantages: resource saving Disadvantages: non-acceptable accuracy, manual correction Adapt an existing tool for a modern language variety 4 Advantages: reusable, sustainable, relatively easy to adapt, resource saving, extensible to other language varieties Disadvantages: Is it easy and resource-saving to do this? Is it easy to adapt to other language varieties? April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 12 / 28
Our specific case study and proposal Solution 4 Adapt the tool Tool: Freeling http://nlp.lsi.upc.edu/freeling Language: (standard) Modern Spanish Specific advantages of adapting Freeling open-source well documented and actively mantained modular, relatively easy to adapt April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 13 / 28
FreeLing processing pipeline ra raw t text tokenizer ANAL ANALYZER dictionary morphological analysis affixation TAGGER TA ER probabilities ta tagged cor corpus pus April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 14 / 28
Method Using the existing standard Modern Spanish tool as a basis to create an Old Spanish analyzer Expansion of the dictionary Retraining of the tagger Modification of other modules: tokenization, affixation April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 15 / 28
Data Old Spanish Corpus 1 Gold Standard Corpus 2 (Standard) Modern Spanish Corpus 3 April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 16 / 28
1. Old Spanish Corpus Electronic editions by the Hispanic Seminary of Medieval Studies Critical editions of the original manuscripts 12th to 16th century Spanish Representative corpus: more than 20 million tokens, 470 thousand types variety of genres (fiction and non-fiction) → To expand the dictionary April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 17 / 28
2. Gold Standard Corpus 60,000 tokens from the Old Spanish Corpus (50%) and a Modern Spanish tagged corpus (50%) It mirrors the Old Spanish Corpus in size and text-type distribution → To retrain the tagger and carry out the evaluation and error analysis April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 18 / 28
3. Standard Modern Spanish Corpus Corpus LexEsp (Sebasti´ an et al 2000) from 1975 to 1995 more than 5 million words variety of genres → baseline performance for the tagger April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 19 / 28
Dictionary expansion: Data 556,210 Standard Spanish words (669,121 lemma- tag pairs) + 58,435 Old Spanish words = 614,000 word forms (744,160 lemma-tag pairs) Distribution of words added to the dictionary Verbs 83.4% Pronouns 1.3% Nouns 26.8% Determiners 1% Adjectives 9.4% Adverbs 0.7% Prepositions 2.1% Conjunctions 0.5% Numbers 1.7% Interjections 0.3% Proper names 1.4% Punctuation 0.01% April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 20 / 28
Dictionary expansion: Method Mapping rules : Substring rules (54 sequences of characters): 42% of the words added Old Modern Example euo evo nueuo → nuevo ‘new’ uio vio uio → vio ‘saw’ -f -ube nuf → nube ‘cloud’ sp- esp- spera → espera ‘wait’ Word rules : 39% of the words added consul → c´ onsul ‘consul’ catholica → cat´ olica ‘catholic’ VARD 2 (69 spelling rules): 19% of the words added Old Modern j ´ ı nn ˜ n rr r April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 21 / 28
Retraining of the tagger Use of the Gold Standard Corpus Two taggers: Hybrid ( relax ), integrating statistical and hand-coded grammatical rules Hidden Markov Model ( hmm ), trigram markovian tagger based on TnT (Brants, 2000) April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 22 / 28
Accuracy C0: original tools for standard Modern Spanish (baseline) C1-hmm: expanded dict. + modules + hmm trained tagger (60,000-token gold standard corpus) C1-relax: expanded dict. + modules + relax trained tagger (60,000-token gold standard corpus) Lemma PoS-1 PoS-2 C0 72.4 70.9 77.4 C1-hmm 95.8 90.1 95.3 C1-relax 95.8 92.6 95.7 SS 99.1 94 97.6 → PoS-1: whole label. E.g. viene VMIP3S0 → PoS-2: word class. E.g. viene V MIP3S0 April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 23 / 28
Recommend
More recommend