Maximum Entropy Tagging (for the Maximum Entropy method itself, - PowerPoint PPT Presentation

Maximum Entropy Tagging (for the Maximum Entropy method itself, refer to NPFL067 added slides 2018/9)

The Task, Again • Recall: – tagging ~ morphological disambiguation – tagset V T  (C 1 ,C 2 ,...C n ) • C i - morphological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER, ... – mapping w  {t  V T } exists • restriction of Morphological Analysis: A +  2 (L,C1,C2,...,Cn) where A is the language alphabet, L is the set of lemmas – extension to punctuation, sentence boundaries (treated as words) 2018/2019 UFAL MFF UK NPFL068/Intro to statistical NLP II/Jan Hajic and Pavel Pecina 49

Maximum Entropy Tagging Model • General p(y,x) = (1/Z) e  i=1..N  i f i (y,x) Task: find  i satisfying the model and constraints • E p (f i (y,x)) = d i where • d i = E’(f i (y,x)) (empirical expectation i.e. feature frequency) • Tagging p(t,x) = (1/Z) e  i=1..N  i f i (t,x) (  0 might be extra: cf.  in AR) • t  Tagset, • x ~ context (words and tags alike; say, up to three positions R/L) 2018/2019 UFAL MFF UK NPFL068/Intro to statistical NLP II/Jan Hajic and Pavel Pecina 50

Features for Tagging • Context definition – two words back and ahead, two tags back, current word: • x i = (w i-2 ,t i-2 ,w i-1 ,t i-1 ,w i ,w i+1 ,w i+2 ) – features may ask any information from this window • e.g.: – previous tag is DT – previous two tags are PRP$ and MD, and the following word is “be” – current word is “an” – suffix of current word is “ing” • do not forget: feature also contains t i , the current tag: – feature #45: suffix of current word is “ing” & the tag is VBG  f 45 = 1 2018/2019 UFAL MFF UK NPFL068/Intro to statistical NLP II/Jan Hajic and Pavel Pecina 51

Feature Selection • The PC 1 way (see also yesterday’s class): – (try to) test all possible feature combinations • features may overlap, or be redundant; also, general or specific - impossible to select manually – greedy selection: • add one feature at a time, test if (good) improvement: – keep if yes, return to the pool of features if not – even this is costly, unless some shortcuts are made • see Berger & DPs for details • The other way: – use some heuristic to limit the number of features • 1 Politically (or, Probabilistically-stochastically) Correct 2018/2019 UFAL MFF UK NPFL068/Intro to statistical NLP II/Jan Hajic and Pavel Pecina 52

Limiting the Number of Features • Always do (regardless whether you’re PC or not): – use contexts which appear in the training data (lossless selection) • More or less PC, but entails huge savings (in the number of features to estimate  i weights for): – use features appearing only L-times in the data (L ~ 10) – use w i -derived features which appear with rare words only – do not use all combinations of context (this is even “LC 1 ”) – but then, use all of them, and compute the  i only once using the Generalized Iterative Scaling algorithm • 1 Linguistically Correct 2018/2019 UFAL MFF UK NPFL068/Intro to statistical NLP II/Jan Hajic and Pavel Pecina 53

Feature Examples (Context) • From A. Ratnaparkhi (EMNLP, 1996, UPenn) – t i = T, w i = X (frequency c > 4): • t i = VBG, w i = selling – t i = T, w i contains uppercase char (rare): • t i = NNP, tolower(w i )  w i – t i = T, t i-1 = Y, t i-2 = X: • t i = VBP, t i-2 = PRP, t i-1 = RB • Other examples of possible features: – t i = T, t j is X, where j is the closest left position where Y • t i = VBZ, t j = NN, Y  t j  {NNP, NNS, NN} 2018/2019 UFAL MFF UK NPFL068/Intro to statistical NLP II/Jan Hajic and Pavel Pecina 54

Feature Examples (Lexical/Unknown) • From AR: – t i = T, suffix(w i )= X (length X < 5): • t i = JJ, suffix(w i ) = eled (traveled, leveled, ....) – t i = T, prefix(w i )= X (length X < 5): • t i = JJ, prefix(w i ) = well- (well-done, well-received,...) – t i = T, w i contains hyphen: • t i = JJ, ‘-’ in w i (open-minded, short-sighted,...) • Other possibility, for example: – t i = T, w i contains X: • t i = NounPl, w i contains umlaut (ä,ö,ü) (Wörter, Länge,...) 2018/2019 UFAL MFF UK NPFL068/Intro to statistical NLP II/Jan Hajic and Pavel Pecina 55

“Specialized” Word-based Features • List of words with most errors (WSJ, Penn Treebank): – about, that, more, up, ... • Add “specialized”, detailed features: – t i = T, w i = X, t i-1 = Y, t i-2 = Z: • t i = IN, w i = about, t i-1 = NNS, t i-2 = DT – possible only for relatively high-frequency words • Slightly better results (also, problems with inconsistent [test] data) 2018/2019 UFAL MFF UK NPFL068/Intro to statistical NLP II/Jan Hajic and Pavel Pecina 56

Maximum Entropy Tagging: Results • Base experiment (133k words, < 3% unknown): – 96.31% word accuracy • Specialized features added: – 96.49% word accuracy • Consistent subset (training + test) – 97.04% word accuracy (97.13% w/specialized features) • Best in 2000; for details, see the AR paper • Now: perceptron, ~97.4% – Collins 2002, Raab 2009 2018/2019 UFAL MFF UK NPFL068/Intro to statistical NLP II/Jan Hajic and Pavel Pecina 57

Maximum Entropy Tagging (for the Maximum Entropy method itself, - PowerPoint PPT Presentation

Maximum Entropy Tagging (for the Maximum Entropy method itself, refer to NPFL067 added slides 2018/9) The Task, Again Recall: tagging ~ morphological disambiguation tagset V T (C 1 ,C 2 ,...C n ) C i - morphological

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Topic III.2: Maximum Entropy Models Discrete Topics in Data Mining Universitt des Saarlandes,

Comparison Between Bayesian and Maximum Entropy Analysis of Flow Networks 1 Maximum Entropy

POS Tagging HMMs L645 / B659 Dept. of Linguistics, Indiana University Fall 2015 1 / 17 POS

Empirical Methods in Natural Language Processing Lecture 8 Tagging (III): Maximum Entropy Models

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2003 References: 1. Speech and

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

A Maximum Entropy Model for Part-of-Speech Introduction Tagging The probability model Adwait

Log-Linear Models for Tagging (Maximum-entropy Markov Models (MEMMs)) Michael Collins, Columbia

Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis Feature Expectation Matching

A gentle introduction to Maximum Entropy Models and their friends Mark Johnson Brown University

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

1) Entropy = measure of randomness 2) Entropy = measure of compressibility More random = Less

Chapter 2 Entropy, Relative Entropy, and Mutual Infor- mation Peng-Hua Wang Graduate Institute

Statistical Parsing Paper presentation: natural language parsing. In: Computational linguistics

Proceedings of the 34th Annual Meeting of the ACL, Santa Cruz, June 1996, pp. 79-86.

Probabilistic Frame-Semantic Parsing Noah A. Smith Dipanjan Das Nathan Schneider Desai Chen

Cadastral GIS An example from Bolivia Pretoria November 7, 2002 Christiaan Lemmen Foto: Edgar

QuantLib Erlk onige Peter Caspers IKB December 4th 2014 Peter Caspers (IKB) QuantLib Erlk

PROBABILISTIC ASPECTS OF ARBITRAGE IOANNIS KARATZAS INTECH Investment Management LLC, Princeton,

Existence and Comparisons for BSDEs in general spaces Samuel N. Cohen (joint work with Robert J.

Ratemaking application of Bayesian LASSO with conjugate hyperprior Himchan Jeong and Emiliano