shallow nlp
play

Shallow NLP Three Early Stages: Pre-processing, Tokenization & - PDF document

CoLi USb Resources for com putational linguists Sem inar Shallow NLP Three Early Stages: Pre-processing, Tokenization & Morphological Analysis by Achmad Yani CoLi Saarland University Contents 1 STAGES IN NLP SYSTEMS PRE-PROCESSING 2 3


  1. CoLi USb Resources for com putational linguists Sem inar Shallow NLP Three Early Stages: Pre-processing, Tokenization & Morphological Analysis by Achmad Yani CoLi Saarland University Contents 1 STAGES IN NLP SYSTEMS PRE-PROCESSING 2 3 TOKENIZATION 3 4 MORPHOLOGICAL ANALYSIS 4 CONCLUSIONS 5 1

  2. CoLi Saarland University Stages in a Com prehensive NLP System Text Generation KB Reasoning P&D Analysis Linguistic Analysis Linguistic Analysis Sem antic Analysis Stages Stages Syntactic Analysis Morphological Analysis Morphological Analysis Tokenization Tokenization Pre- Pre - Linguistic Linguistic Analysis Analysis Preprocessing Preprocessing CoLi Saarland University 1 . Preprocessing Stage Main Task of the Stage: � Filter out the text from unnecessary character, such as: � extra whitespace � text subdivisions � special character � SGML-type code HOW ? � Using: lex or flex on Unix-based w orkstations 2

  3. CoLi Saarland University Flex program for filtering out SGML m arkings /* Call this file StripSGML.lx, and then run: Flex -8 –CF StripSGML.lx; gcc –o StripSGML lex.yy.c –lfl –s To pass this simple filter over a text file called toto, run: StripSGML < toto %% “<“[^\n<>]+“>“; ECHO; ECHO; [\n] %% Delete SGML markings from an input files CoLi Saarland University Flex program for dehyphenating a text /* Call this file dehyphen.lx, and then run: Flex -8 –CF dehyphen.lx; gcc –o dehyphen lex.yy.c –lfl –s To pass this simple filter over a text file called toto, run: dehyphen < toto %% [a-z]-[\t]*\n[\t] * {printf( “%c“,yytext[0]);} %% Lower-case letter, followed by a hyphen, then any number of tabs or spaces, followed by a newline character and more spaces. 3

  4. CoLi Saarland University 2 . Tokenization Main Task of the Stage : � Isolation of word-like units from a text / Recognition of sentence boundaries � The element of the text is recognized by: � certain syntactic class. • For example: dog � SINGULAR-NOUN � Sentence boundaries CoLi Saarland University Non-Trivial Tokenization Cases � Recognize token that contains am biguous punctuatin � Numbers, Alphanumeric references • e.g. T-1-AB.1.2 � Dates • e.g. 05/07/07 � Acronyms • e.g. AT&T � Punctuations • !,?,. � Abbreviations • e.g. m.p.h 4

  5. CoLi Saarland University Sentence Boundaries I dentification Approach Tokenization Approach Maximum Entropy Manually Writing Approach Approach The system learns to Primitive Way, Using classify each Regular Expression occurance of Grammar punctuation as sentence boundary. CoLi Saarland University MANUAL APPROACH – RE for Am biguous Separators in Num bers � English num ber: 1 2 3 ,4 5 6 .7 8 ([0-9]+[,])*[0-9]([.][0-9]+)? � French Num ber: 1 2 3 4 5 6 ,7 8 ([0-9]+[ ])*[0-9]([,][0-9]+) � Fractions, Dates [0-9]+(\/[0-9]+)+ � Percent ([+\-])?[0-9]+ (\.)? [0-9] *% � Decim al Num ers ( e.g. 1 ,2 3 4 .5 6 ) ([0-9]+,?)+(\.[0-9]+ | [0-9]+)* 5

  6. CoLi Saarland University MANUAL APPROACH - RE for Abbreviations � Three classes of Abbreviations: 1. A single capital followed by period, e.g. A.,B., C. [A-Za-z]\. 2. A sequence of letter-period-letter-period‘s, e.g. U.S., i.e., m.p.h [A-Za-z]\.([A-Za-z0-9]\.)+ 3. A capital letter followed by a sequence of consonant followed by a period, e.g. Mr., St., Assn., [A-Z][bcdfghj-np-tvxz]+\. CoLi Saarland University MANUAL APPROACH - System Perform ance • Using Brow n Corpus Correct Errors Full Stop Regular Expression [A-Za-]\. 1 3 2 7 5 2 1 4 [A-Za-z]\.([A-Za-z0-9]\.)+ 5 7 0 0 6 6 [A-Z][bcdfghj-np-tvxz]+\. 1 9 3 8 4 4 2 6 Totals 3 8 3 5 9 6 1 0 6 6

  7. CoLi Saarland University MANUAL APPROACH - Problem � The list of exception lists w ill never be exhaustive, alw ays need to be updated! � Multiple rules m ay interact badly, since punctuation m arks does not alw ays follow the logic of the form ula � e.g. • The president lives in Washington D.C . � Logically, it should be: • The president lives in Washington D.C .. CoLi Saarland University Maxim um Entropy ( ME) Approach THE I DEA : � Scanning text for sequences of character separated by w hitespace ( tokens) : ., ?, and ! � potential sentence boundaries � contextual information 7

  8. CoLi Saarland University ME APPROACH - Term inology � Candidate: � token containing the symbol which marks a putative sentence boundary � Prefix: � the portion of the Candidate preceding the potential sentence boundary � Suffix: � the portion of the Candidate following the potential sentence boundary CoLi Saarland University ME APPROACH - Contextual Tem plates � The Prefix � The Suffix � Whether the Prefix or Suffix is on the list of induced abbreviations (from training data) � The word left, of the Candidate � The word right of the Candidate � Whether the word to the left or right of the Candidate is on the list of induced abbreviations 8

  9. CoLi Saarland University ME APPROACH - Exam ple 1 � Sentence: � ANLP Corp. chairman Dr. Smith resigned. � The exact information for the potential sentence boundary marked by . in Corp. would be: � PreviousWord =ANLP, Following-Word =chairman, Prefix =Corp, Suffix=NULL, PrefixFeature=InducedAbbreviation. CoLi Saarland University ME APPROACH - Joint Probability � For each potential boundary token ( .,?,!) , estim ate the joint probability p and its surrounding context. { } = ∏ = f (b,c) k ∈ α p(b, c) , where b no, yes j j j 1 α j = unknown parameter of the model, each of it corresponds to f j . The probability of seeing an actual sentence boundary in the context c is given by p(yes, c) 9

  10. CoLi Saarland University ME APPROACH – Exam ple Useful Feature ⎧ = = 1 if Prefix (c) Mr & b no = f j b c ⎨ ( , ) ⎩ 0 otherwise Allow to discover that the period at the end of the word Mr. seldom occurs as sentence boundary CoLi Saarland University ME APPROACH - Decision Rule � A potential sentence boundary is an actual sentence boundary if and only if p( yes| c) > .5 w here: p yes c ( , ) = p yes c ( | ) + p yes c p no c ( , ) ( , ) 10

  11. CoLi Saarland University ME APPROACH - System Perform ance Corpus W SJ Brow n Sentences 2 0 4 7 8 5 1 6 7 2 Candidate P.Marks 3 2 1 7 3 6 1 2 8 2 Accuracy 9 8 .8 % 9 7 .9 % False Positives 2 0 1 7 5 0 False Negatives 1 7 1 5 0 6 CoLi Saarland University 3 . Morphological Analysis Main Task of the Stage: � Analysing the m eaningful com ponents of w ords � Non- trivial Case: � Word division 11

  12. CoLi Saarland University W ord Division English: � I t‘s, he‘s, that‘s, there‘s, w ho‘s, she‘s French: L‘addition, m ‘appelle, donne-le, va-t-ill, etc Bahasa ( I ndonesian) : Pertanggungjaw aban, kem erdekaan CoLi Saarland University Morphology � Hebrew ( transliterated) : ukshepagashtihu � English translation: and w hen I m et you ( m asculine) 12

  13. CoLi Saarland University Morphology Analysis Tools : PC-Kim m o � Tw o-level Processor for Morphological Analysis � The program is designed to generate and parse w ords using tw o-level m odel of w ord structure, represented as a correspondence betw een: 1. Its lexical level form and 2. Its surface level form. CoLi Saarland University PC-KI MMO FI LES ( provided by the user) 1 . A rules file � specifies the alphabet and the phonological (spelling) rules 2 . A lexicon file � lists lexical items (words and morphemes) and their glosses, and encodes morphotactic constraints 13

  14. CoLi Saarland University Main Com ponents od PC-Kim m o Rules Rules Lexicon Lexicon Recognizer Recognizer Surface Form Surface Form Lexical Form Lexical Form Add Your Text here Generator Generator Lexical Form Lexical Form Surface Form Surface Form CoLi Saarland University Exam ple : � W ord form : dying � Lexical Representation : d i e + i n g � Surface Representation : d y 0 0 i n g + indicates morpheme boundary 0 indicates a null element 14

  15. CoLi Saarland University Exam ple ( cont) � Rules m ust be w ritten to account the correspondences: d:d , i:y, e:0, +:0, i:i, n:n and g:g � The phonological rules som ehow looks like this: i:y => ___ e:0 +:0 i � And w ill be translated into finite state table like these: i e + i @ y 0 0 I @ 1: 2 1 1 1 1 2: 0 3 0 0 0 3: 0 0 4 0 0 4: 0 0 0 1 0 CoLi Saarland University Tw o Level Rules Notation � Made up of three parts: � Correspondence � The rule Operator � The environment or context � Exam ple: � Lexical Representation (LR) : t a t i � Surface Representation (SR) : t a c i 15

  16. CoLi Saarland University 1 . Correspondence � Correspondence pair � Lexical-character : surface-character � There must be an exact 1 to 1 correspondence between LR and SR � From the exam ple: � LR : t a t i and SR : t a c i , we have: •t:t, a:a, i:i � default correspondence •t:c � special correspondence CoLi Saarland University 2 . Rule Operator � Four types of Operator: => the correspondence only (but not always) occurs in the environment <= the correspondence always (but not only) occurs in the environment <=> the correspondence always and only occurs in the environment /<= the correspondence never occurs in the environment 16

Recommend


More recommend