Natural Language Processing for historical language varieties - PowerPoint PPT Presentation

Natural Language Processing for historical language varieties Cristina S´ anchez Marco Gjøvik University College MTL lectures April 3 2013 April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 1 / 28

NLP and its applications Natural Language Processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. Applications: question answering, sentiment analysis, machine translation, information extraction, ... An example: Information extraction → Email: Subject: curriculum meeting Date: January 15, 2012 To: Dan Hi Dan, we’ve now scheduled the curriculum meeting. It will be in Gates 159 tomorrow from 10:00-11:30. -Chris → Create new Calendar entry: Event: Curriculum meeting Date: Jan-16-2012 Start: 10:00am End: 11:30am Where: Gates 159 April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 2 / 28

Morphological analysis and tagging Morphological analysis and tagging is a basic NLP task useful in many applications. Two steps: Part-of-speech tagging : the process of marking up a word in a text as corresponding to a particular part of speech. 9 parts of speech: noun, verb, adjective, adverb, preposition, determiner, conjunction, pronoun, interjection. For example, the word the is a determiner. Lemmatisation is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. For example, the word better has good as its lemma. → Words often have more than one POS (“ambiguity”). The POS tagging problem is to determine the POS tag for a particular instance of a word in a sentence. → Uses: text-to-speech (how do we pronounce “lead”?), spelling checkers, as input to or to speed up a full parser, OCR scanning,... April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 3 / 28

Morphological analysis and tagging April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 4 / 28

An example from the 12th century April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 5 / 28

An example from the 12th century April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 6 / 28

An example → Electronic editions prepared by the Hispanic Seminary of Medieval Studies [fol. 32v] { CB1. Ca delo q < ue > mas amaua yal viene el mandado Dozi[en]tos cauall < er > os mando exir p < r ><< i >> uado Q < ue > Re¸ ciban a myanaya & alas duenas fijas dalgo El sedie en valen¸ cia curiando & guardando Ca bie < n > sabe q < ue > albarfanez t < r ><< a >> he todo Recabdo April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 7 / 28

An example: Paleographic symbols [fol. 32v] { CB1. Ca delo q < ue > mas amaua yal viene el mandado Dozi [en] tos cauall < er > os mando exir p < r >< i > uado Q < ue > Re¸ ciban a myanaya & alas duenas fijas dalgo El sedie en valen¸ cia curiando & guardando Ca bie < n > sabe q < ue > albarfanez t < r >< a > he todo Recabdo April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 8 / 28

An example: Spelling [fol. 32v] { CB1. Ca delo (de lo) que mas amaua (amaba) yal (ya le) viene el mandado Dozientos (Doscientos) caualleros (caballeros) mando (mand´ o) exir priuado (privado) Que Re¸ ciban (reciban) a myanaya & alas (a las) duenas (due˜ nas) fijas dalgo (hidalgas) El (´ El) sedie (ser´ ıa) en valen¸ cia (valencia) curiando (curando) & guardando Ca bien sabe que albarfanez trahe (trae) todo Recabdo (Recaudo) April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 9 / 28

An example: Capital letters and punctuation [fol. 32v] { CB1. Ca delo que mas amaua yal viene el mandado (,) Dozientos caualleros mando exir priuado (,) Que Re¸ ciban a myanaya (Myanaya) & alas duenas fijas dalgo (,) El sedie en valen¸ cia (Valencia) curiando & guardando (,) Ca bien sabe que albarfanez (Albarfanez) trahe todo Recabdo (recaudo)(.) April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 10 / 28

An example: Word order [fol. 32v] { CB1. Ca delo que mas amaua yal viene el mandado (Ca yal viene el mandado delo que mas amaua) Dozientos caualleros mando exir priuado Que Re¸ ciban a myanaya & alas duenas fijas dalgo El sedie en valen¸ cia curiando & guardando Ca bien sabe que albarfanez trahe todo Recabdo April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 11 / 28

Challenge and solution The challenge is to enrich the text with lemma and POS tag. Manually 1 Build a tagger from scratch 2 Use an existing tool for a modern language variety 3 Advantages: resource saving Disadvantages: non-acceptable accuracy, manual correction Adapt an existing tool for a modern language variety 4 Advantages: reusable, sustainable, relatively easy to adapt, resource saving, extensible to other language varieties Disadvantages: Is it easy and resource-saving to do this? Is it easy to adapt to other language varieties? April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 12 / 28

Our specific case study and proposal Solution 4 Adapt the tool Tool: Freeling http://nlp.lsi.upc.edu/freeling Language: (standard) Modern Spanish Specific advantages of adapting Freeling open-source well documented and actively mantained modular, relatively easy to adapt April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 13 / 28

FreeLing processing pipeline ra raw t text tokenizer ANAL ANALYZER dictionary morphological analysis affixation TAGGER TA ER probabilities ta tagged cor corpus pus April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 14 / 28

Method Using the existing standard Modern Spanish tool as a basis to create an Old Spanish analyzer Expansion of the dictionary Retraining of the tagger Modification of other modules: tokenization, affixation April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 15 / 28

Data Old Spanish Corpus 1 Gold Standard Corpus 2 (Standard) Modern Spanish Corpus 3 April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 16 / 28

1. Old Spanish Corpus Electronic editions by the Hispanic Seminary of Medieval Studies Critical editions of the original manuscripts 12th to 16th century Spanish Representative corpus: more than 20 million tokens, 470 thousand types variety of genres (fiction and non-fiction) → To expand the dictionary April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 17 / 28

2. Gold Standard Corpus 60,000 tokens from the Old Spanish Corpus (50%) and a Modern Spanish tagged corpus (50%) It mirrors the Old Spanish Corpus in size and text-type distribution → To retrain the tagger and carry out the evaluation and error analysis April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 18 / 28

3. Standard Modern Spanish Corpus Corpus LexEsp (Sebasti´ an et al 2000) from 1975 to 1995 more than 5 million words variety of genres → baseline performance for the tagger April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 19 / 28

Dictionary expansion: Data 556,210 Standard Spanish words (669,121 lemma- tag pairs) + 58,435 Old Spanish words = 614,000 word forms (744,160 lemma-tag pairs) Distribution of words added to the dictionary Verbs 83.4% Pronouns 1.3% Nouns 26.8% Determiners 1% Adjectives 9.4% Adverbs 0.7% Prepositions 2.1% Conjunctions 0.5% Numbers 1.7% Interjections 0.3% Proper names 1.4% Punctuation 0.01% April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 20 / 28

Dictionary expansion: Method Mapping rules : Substring rules (54 sequences of characters): 42% of the words added Old Modern Example euo evo nueuo → nuevo ‘new’ uio vio uio → vio ‘saw’ -f -ube nuf → nube ‘cloud’ sp- esp- spera → espera ‘wait’ Word rules : 39% of the words added consul → c´ onsul ‘consul’ catholica → cat´ olica ‘catholic’ VARD 2 (69 spelling rules): 19% of the words added Old Modern j ´ ı nn ˜ n rr r April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 21 / 28

Retraining of the tagger Use of the Gold Standard Corpus Two taggers: Hybrid ( relax ), integrating statistical and hand-coded grammatical rules Hidden Markov Model ( hmm ), trigram markovian tagger based on TnT (Brants, 2000) April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 22 / 28

Accuracy C0: original tools for standard Modern Spanish (baseline) C1-hmm: expanded dict. + modules + hmm trained tagger (60,000-token gold standard corpus) C1-relax: expanded dict. + modules + relax trained tagger (60,000-token gold standard corpus) Lemma PoS-1 PoS-2 C0 72.4 70.9 77.4 C1-hmm 95.8 90.1 95.3 C1-relax 95.8 92.6 95.7 SS 99.1 94 97.6 → PoS-1: whole label. E.g. viene VMIP3S0 → PoS-2: word class. E.g. viene V MIP3S0 April 3 2013 C. S´ anchez Marco, GUC NLP for historical language varieties 23 / 28

Natural Language Processing for historical language varieties - PowerPoint PPT Presentation

Natural Language Processing for historical language varieties Cristina S anchez Marco Gjvik University College MTL lectures April 3 2013 April 3 2013 C. S anchez Marco, GUC NLP for historical language varieties 1 / 28 NLP and its

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Natural language is a programming language: Applying natural language processing to software

Natural Language Processing Stages in understanding natural language Why its hard

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

LANGUAGE MODELS 24.05.19 Statistical Natural Language Processing 1 Statistical natural

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Natural Language Processing

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Hmong Language 2. The Hmong Language 1. Hmong People 3. Natural Language Processing of Hmong 1

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Statistical natural language processing 24.05.19 Statistical Natural Language Processing 1 The

Natural Language Processing Historical Document Transcription Dan Klein UC Berkeley Joint

Natural Language Processing 1 Lecture 6: Distributional semantics: generalisation and word

NLTK: The Natural Language Toolkit Edward Loper Natural Language Processing Use

Statistical Natural Language Processing Prasad Tadepalli CS430 lecture Natural Language

Pragmatic aspects of natural language Vojtch Kov Natural Language Processing Centre

Fuzzy Logic in Natural Fuzzy Logic in Natural Language Processing Language Processing ...wild

Natural Language Processing Lecture 11/13/2015 CSCI 5832 Susan W. Brown Natural Language