CSE 490 U Natural Language Processing Spring 2016 Feature Rich - PowerPoint PPT Presentation

CSE 490 U Natural Language Processing Spring 2016 Feature Rich Models Yejin Choi - University of Washington [Many slides from Dan Klein, Luke Zettlemoyer]

Structure in the output variable(s)? What is the input representation? No Structure Structured Inference Generative models Naïve Bayes HMMs (classical probabilistic PCFGs models) IBM Models Log-linear models Perceptron MEMM (discriminatively Maximum Entropy CRF trained feature-rich Logistic Regression models) Neural network Feedforward NN RNN models CNN LSTM (representation GRU … learning)

Feature Rich Models § Throw anything (features) you want into the stew (the model) § Log-linear models § Often lead to great performance. (sometimes even a best paper award) "11,001 New Features for Statistical Machine Translation", D. Chiang, K. Knight, and W. Wang, NAACL, 2009.

Why want richer features? § POS tagging: more information about the context? § Is previous word “ the ”? § Is previous word “ the ” and the next word “ of ”? § Is previous word capitalized and the next word is numeric ? § Is there a word “ program ” within [-5,+5] window? § Is the current word part of a known idiom? § Conjunctions of any of above? § Desiderata: § Lots and lots of features like above: > 200K § No independence assumption among features § Classical probability models, however § Permit very small amount of features § Make strong independence assumption among features

HMMs: P(tag sequence|sentence) § We want a model of sequences y and observations x y 0 y 1 y 2 y n y n+1 x 1 x 2 x n n Y p ( x 1 ...x n , y 1 ...y n +1 ) = q ( stop | y n ) q ( y i | y i − 1 ) e ( x i | y i ) i =1 where y 0 = START and we call q(y’|y) the transition distribution and e(x|y) the emission (or observation) distribution. § Assumptions: § Tag/state sequence is generated by a markov model § Words are chosen independently, conditioned only on the tag/state § These are totally broken assumptions: why?

PCFG Example PCFGs: P(parse tree|sentence) S 1.0 Vi sleeps 1.0 VP ⇒ S NP VP 1.0 0.2 ⇒ Vt saw 1.0 t 2 = ⇒ VP Vi 0.4 VP ⇒ PP NN man 0.7 ⇒ 0.4 0.4 VP Vt NP 0.4 ⇒ NN woman 0.2 NP Vt NP IN NP ⇒ VP VP PP 0.2 0.3 0.3 0.3 ⇒ NN telescope 0.1 1.0 0.5 ⇒ NP DT NN 0.3 ⇒ DT NN DT NN DT NN DT the 1.0 ⇒ 1.0 0.7 1.0 0.2 1.0 0.1 NP NP PP 0.7 ⇒ The man saw the woman with the telescope IN with 0.5 ⇒ PP P NP 1.0 ⇒ IN in 0.5 p(t s )=1.8*0.3*1.0*0.7*0.2*0.4*1.0*0.3*1.0*0.2*0.4*0.5*0.3*1.0*0.1 ⇒ • Probability of a tree t with rules α 1 → β 1 , α 2 → β 2 , . . . , α n → β n is n � p ( t ) = q ( α i → β i ) i =1 where q ( α → β ) is the probability for rule α → β .

Rich features for long range dependencies § What ’ s different between basic PCFG scores here? § What (lexical) correlations need to be scored?

LMs: P(text) n Y X p ( x 1 ...x n ) = q ( x i | x i − 1 ) where q ( x i | x i − 1 ) = 1 i =1 x i ∈ V ∗ x 0 = START & V ∗ := V ∪ { STOP } § Generative process: (1) generate the very first word conditioning on the special symbol START, then, (2) pick the next word conditioning on the previous word, then repeat (2) until the special word STOP gets picked. § Graphical Model: x 1 x 2 x n -1 STOP START § Subtleties: § If we are introducing the special START symbol to the model, then we are making the assumption that the sentence always starts with the special start word START, thus when we talk about it is in fact p ( x 1 ...x n | x 0 = START) p ( x 1 ...x n ) § While we add the special STOP symbol to the vocabulary , we do not add the V ∗ special START symbol to the vocabulary. Why?

arbitrary scores instead of log probs ? Change log p(this | that) to Φ(this ; that) § LM: … + Φ (w7 ; w5, w6) + Φ (w8 ; w6, w7) + … § PCFG: Φ (NP VP ; S) + Φ (Papa ; NP) + Φ (VP PP ; VP) … § HMM tagging: … + Φ (t7 ; t5, t6) + Φ (w7 ; t7) + … § Noisy channel: [ Φ (source) ] + [ Φ (data ; source) ] § Naïve Bayes: Φ (Class) + Φ (feature1 ; Class) + Φ (feature2 ; Class) …

arbitrary scores instead of log probs ? Change log p(this | that) to Φ(this ; that) § LM: … + Φ (w7 ; w5, w6) + Φ (w8 ; w6, w7) + … § PCFG: Φ (NP VP ; S) + Φ (Papa ; NP) + Φ (VP PP ; VP) … § HMM tagging: … + Φ (t7 ; t5, t6) + Φ (w7 ; t7) + … MEMM or CRF § Noisy channel: [ Φ (source) ] + [ Φ (data ; source) ] § Naïve Bayes: Φ (Class) + Φ (feature1 ; Class) + Φ (feature2 ; Class) … logistic regression / max-ent

Running example: POS tagging § Roadmap of (known / unknown) accuracies: § Strawman baseline: § Most freq tag: ~90% / ~50% § Generative models: § Trigram HMM: ~95% / ~55% § TnT (HMM++): 96.2% / 86.0% (with smart UNK’ing) § Feature-rich models? § Upper bound: ~98%

Structure in the output variable(s)? What is the input representation? No Structure Structured Inference Generative models Naïve Bayes HMMs (classical probabilistic PCFGs models) IBM Models Log-linear models Perceptron MEMM (discriminatively Maximum Entropy CRF trained feature-rich Logistic Regression models) Neural network Feedforward NN RNN models CNN LSTM (representation GRU … learning)

Rich features for rich contextual information Throw in various features about the context: § § Is previous word “ the ” and the next word “ of ”? § Is previous word capitalized and the next word is numeric ? § Frequencies of “ the ” within [-15,+15] window? § Is the current word part of a known idiom? You can also define features that look at the output ‘Y’! § § Is previous word “ the ” and the next tag is “ IN ”? Is previous word “ the ” and the next tag is “ NN ”? § Is previous word “ the ” and the next tag is “ VB ”? § You can also take any conjunctions of above. § f ( x, y ) = [0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 3 , 0 . 2 , 0 , 0 , .... ] Create a very long feature vector with dimensions often >200K § Overlapping features are fine – no independence assumption among § features

Maximum Entropy (MaxEnt) Models Output: y y 3 One POS tag for one word (at a time) Input: x (any words in the context) x 2 x 3 x 4 Represented as a feature vector f( x , y ) Model parameters: w Make probability using SoftMax function: Also known as “ Log-linear ” Models ( linear if you take log ) Make positive! exp( w · f ( x, y )) p ( y | x ) = P y 0 exp( w · f ( x, y 0 )) Normalize!

Training MaxEnt Models Make probability using SoftMax function exp( w · f ( x, y )) p ( y | x ) = P y 0 exp( w · f ( x, y 0 )) Training: { ( x i , y i ) } n maximize log likelihood of training data i =1 exp( w · f ( x i , y i )) Y X p ( y i | x i ) = L ( w ) = log log P y 0 exp( w · f ( x i , y 0 )) i i which also incidentally maximizes the entropy (hence “maximum entropy”)

Training MaxEnt Models Make probability using SoftMax function exp( w · f ( x, y )) p ( y | x ) = P y 0 exp( w · f ( x, y 0 )) Training: maximize log likelihood exp( w · f ( x i , y i )) Y X p ( y i | x i ) = L ( w ) = log log P y 0 exp( w · f ( x i , y 0 )) i i ⇣ ⌘ X X w · f ( x i , y i ) − log exp( w · f ( x i , y 0 )) = i y 0

Training MaxEnt Models ⇣ ⌘ X X w · f ( x i , y i ) − log exp( w · f ( x i , y 0 )) L ( w ) = i y 0 Take partial derivative for each in the weight vector w: w k ∂ L ( w ) ⇣ ⌘ X X f k ( x i , y i ) − p ( y 0 | x i ) f k ( x i , y 0 )) = ∂ w k i y 0 Total count of feature k Expected count of feature k with respect to the with respect to the correct predictions predicted output

Convex Optimization for Training The likelihood function is convex. (can get global optimum) Many optimization algorithms/software available. Gradient ascent (descent), Conjugate Gradient, L-BFGS, etc All we need are: (1) evaluate the function at current ‘w’ (2) evaluate its derivative at current ‘w’

Graphical Representation of MaxEnt exp( w · f ( x, y )) p ( y | x ) = P y 0 exp( w · f ( x, y 0 )) Y Output x 2 … x n Input x 1

Graphical Representation of Naïve Bayes Y p ( x | y ) = p ( x j | y ) j Y Output x 2 … x n Input x 1

CSE 490 U Natural Language Processing Spring 2016 Feature Rich - PowerPoint PPT Presentation

CSE 490 U Natural Language Processing Spring 2016 Feature Rich Models Yejin Choi - University of Washington [Many slides from Dan Klein, Luke Zettlemoyer] Structure in the output variable(s)? What is the input representation? No Structure

CSE 490 Natural Language Processing Spring 2016 Language Models Yejin Choi Slides adapted from

CSE 490 Natural Language Processing Spring 2016 Introduction Yejin Choi Slides adapted

CSE 490 U Natural Language Processing Spring 2016 Parsing (Trees) Yejin Choi - University of

CSE 490 U Natural Language Processing Spring 2016 Dependency Parsing And Other Grammar

CSE 490 U Natural Language Processing Spring 2016 Frame Semantics Yejin Choi Some slides

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

CSP 490 U Natural Language Processing Spring 2016 Machine Translation Yejin Choi Slides from

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Algorithms for NLP IITP, Spring 2020 HMMs, POS tagging, NER Yulia Tsvetkov 1 Plan POS

Algorithms for NLP CS 11711, Fall 2019 Lecture 7: HMMs, POS tagging Yulia Tsvetkov 1 Readings

Tagging Problems, and Hidden Markov Models Michael Collins, Columbia University Overview The

Lecture 09: Part-of-Speech Tagging Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center

MathML 1 Mathematical Typesetting Mathematical typesetting differs in significant ways from

Caching 1 Key Point What are Cache lines Tags Index offset How do we find

Contextualization of Morphological Inflection Ekaterina Vylomova 1 Ryan Cotterell 2 Timothy Baldwin

2019 - 20 Financial Aid High School Presentation New Jersey Higher Education Student Assistance