Log-Linear Models for Tagging (Maximum-entropy Markov Models - PowerPoint PPT Presentation

Log-Linear Models for Tagging (Maximum-entropy Markov Models (MEMMs)) Michael Collins, Columbia University

Part-of-Speech Tagging INPUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street, as their CEO Alan Mulally announced first quarter results. OUTPUT: Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV topping/V forecasts/N on/P Wall/N Street/N ,/, as/P their/POSS CEO/N Alan/N Mulally/N announced/V first/ADJ quarter/N results/N ./. N = Noun V = Verb P = Preposition Adv = Adverb Adj = Adjective . . .

Named Entity Recognition INPUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street, as their CEO Alan Mulally announced first quarter results. OUTPUT: Profits soared at [ Company Boeing Co. ] , easily topping forecasts on [ Location Wall Street ] , as their CEO [ Person Alan Mulally ] announced first quarter results.

Named Entity Extraction as Tagging INPUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street, as their CEO Alan Mulally announced first quarter results. OUTPUT: Profits/NA soared/NA at/NA Boeing/SC Co./CC ,/NA easily/NA topping/NA forecasts/NA on/NA Wall/SL Street/CL ,/NA as/NA their/NA CEO/NA Alan/SP Mulally/CP announced/NA first/NA quarter/NA results/NA ./NA NA = No entity SC = Start Company CC = Continue Company SL = Start Location CL = Continue Location . . .

Our Goal Training set: 1 Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./. 2 Mr./NNP Vinken/NNP is/VBZ chairman/NN of/IN Elsevier/NNP N.V./NNP ,/, the/DT Dutch/NNP publishing/VBG group/NN ./. 3 Rudolph/NNP Agnew/NNP ,/, 55/CD years/NNS old/JJ and/CC chairman/NN of/IN Consolidated/NNP Gold/NNP Fields/NNP PLC/NNP ,/, was/VBD named/VBN a/DT nonexecutive/JJ director/NN of/IN this/DT British/JJ industrial/JJ conglomerate/NN ./. . . . 38,219 It/PRP is/VBZ also/RB pulling/VBG 20/CD people/NNS out/IN of/IN Puerto/NNP Rico/NNP ,/, who/WP were/VBD helping/VBG Huricane/NNP Hugo/NNP victims/NNS ,/, and/CC sending/VBG them/PRP to/TO San/NNP Francisco/NNP instead/RB ./. ◮ From the training set, induce a function/algorithm that maps new sentences to their tag sequences.

Overview ◮ Recap: The Tagging Problem ◮ Log-linear taggers

Log-Linear Models for Tagging ◮ We have an input sentence w [1: n ] = w 1 , w 2 , . . . , w n ( w i is the i ’th word in the sentence)

Log-Linear Models for Tagging ◮ We have an input sentence w [1: n ] = w 1 , w 2 , . . . , w n ( w i is the i ’th word in the sentence) ◮ We have a tag sequence t [1: n ] = t 1 , t 2 , . . . , t n ( t i is the i ’th tag in the sentence)

Log-Linear Models for Tagging ◮ We have an input sentence w [1: n ] = w 1 , w 2 , . . . , w n ( w i is the i ’th word in the sentence) ◮ We have a tag sequence t [1: n ] = t 1 , t 2 , . . . , t n ( t i is the i ’th tag in the sentence) ◮ We’ll use an log-linear model to define p ( t 1 , t 2 , . . . , t n | w 1 , w 2 , . . . , w n ) for any sentence w [1: n ] and tag sequence t [1: n ] of the same length. (Note: contrast with HMM that defines p ( t 1 . . . t n , w 1 . . . w n ) )

Log-Linear Models for Tagging ◮ We have an input sentence w [1: n ] = w 1 , w 2 , . . . , w n ( w i is the i ’th word in the sentence) ◮ We have a tag sequence t [1: n ] = t 1 , t 2 , . . . , t n ( t i is the i ’th tag in the sentence) ◮ We’ll use an log-linear model to define p ( t 1 , t 2 , . . . , t n | w 1 , w 2 , . . . , w n ) for any sentence w [1: n ] and tag sequence t [1: n ] of the same length. (Note: contrast with HMM that defines p ( t 1 . . . t n , w 1 . . . w n ) ) ◮ Then the most likely tag sequence for w [1: n ] is t ∗ [1: n ] = argmax t [1: n ] p ( t [1: n ] | w [1: n ] )

How to model p ( t [1: n ] | w [1: n ] ) ? A Trigram Log-Linear Tagger: p ( t [1: n ] | w [1: n ] ) = � n j =1 p ( t j | w 1 . . . w n , t 1 . . . t j − 1 ) Chain rule

How to model p ( t [1: n ] | w [1: n ] ) ? A Trigram Log-Linear Tagger: p ( t [1: n ] | w [1: n ] ) = � n j =1 p ( t j | w 1 . . . w n , t 1 . . . t j − 1 ) Chain rule = � n j =1 p ( t j | w 1 , . . . , w n , t j − 2 , t j − 1 ) Independence assumptions ◮ We take t 0 = t − 1 = *

How to model p ( t [1: n ] | w [1: n ] ) ? A Trigram Log-Linear Tagger: p ( t [1: n ] | w [1: n ] ) = � n j =1 p ( t j | w 1 . . . w n , t 1 . . . t j − 1 ) Chain rule = � n j =1 p ( t j | w 1 , . . . , w n , t j − 2 , t j − 1 ) Independence assumptions ◮ We take t 0 = t − 1 = * ◮ Independence assumption: each tag only depends on previous two tags p ( t j | w 1 , . . . , w n , t 1 , . . . , t j − 1 ) = p ( t j | w 1 , . . . , w n , t j − 2 , t j − 1 )

An Example Hispaniola/NNP quickly/RB became/VB an/DT important/JJ base/?? from which Spain expanded its empire into the rest of the Western Hemisphere . • There are many possible tags in the position ?? Y = { NN, NNS, Vt, Vi, IN, DT, . . . }

Representation: Histories ◮ A history is a 4-tuple � t − 2 , t − 1 , w [1: n ] , i � ◮ t − 2 , t − 1 are the previous two tags. ◮ w [1: n ] are the n words in the input sentence. ◮ i is the index of the word being tagged ◮ X is the set of all possible histories Hispaniola/NNP quickly/RB became/VB an/DT important/JJ base/?? from which Spain expanded its empire into the rest of the Western Hemisphere . ◮ t − 2 , t − 1 = DT, JJ ◮ w [1: n ] = � Hispaniola, quickly, became, . . . , Hemisphere, . � ◮ i = 6

Recap: Feature Vector Representations in Log-Linear Models ◮ We have some input domain X , and a finite label set Y . Aim is to provide a conditional probability p ( y | x ) for any x ∈ X and y ∈ Y . ◮ A feature is a function f : X × Y → R (Often binary features or indicator functions f : X × Y → { 0 , 1 } ). ◮ Say we have m features f k for k = 1 . . . m ⇒ A feature vector f ( x, y ) ∈ R m for any x ∈ X and y ∈ Y .

An Example (continued) ◮ X is the set of all possible histories of form � t − 2 , t − 1 , w [1: n ] , i � ◮ Y = { NN, NNS, Vt, Vi, IN, DT, . . . } ◮ We have m features f k : X × Y → R for k = 1 . . . m For example: � 1 if current word w i is base and t = Vt f 1 ( h, t ) = 0 otherwise � 1 if current word w i ends in ing and t = VBG f 2 ( h, t ) = 0 otherwise . . . f 1 ( � JJ, DT, � Hispaniola, . . . � , 6 � , Vt ) = 1 f 2 ( � JJ, DT, � Hispaniola, . . . � , 6 � , Vt ) = 0 . . .

The Full Set of Features in [(Ratnaparkhi, 96)] ◮ Word/tag features for all word/tag pairs, e.g., � 1 if current word w i is base and t = Vt f 100 ( h, t ) = 0 otherwise ◮ Spelling features for all prefixes/suffixes of length ≤ 4 , e.g., � 1 if current word w i ends in ing and t = VBG f 101 ( h, t ) = 0 otherwise � 1 if current word w i starts with pre and t = NN f 102 ( h, t ) = 0 otherwise

The Full Set of Features in [(Ratnaparkhi, 96)] ◮ Contextual Features, e.g., � 1 if � t − 2 , t − 1 , t � = � DT, JJ, Vt � f 103 ( h, t ) = 0 otherwise � 1 if � t − 1 , t � = � JJ, Vt � f 104 ( h, t ) = 0 otherwise � 1 if � t � = � Vt � f 105 ( h, t ) = 0 otherwise � 1 if previous word w i − 1 = the and t = Vt f 106 ( h, t ) = 0 otherwise � 1 if next word w i +1 = the and t = Vt f 107 ( h, t ) = 0 otherwise

Log-Linear Models ◮ We have some input domain X , and a finite label set Y . Aim is to provide a conditional probability p ( y | x ) for any x ∈ X and y ∈ Y . ◮ A feature is a function f : X × Y → R (Often binary features or indicator functions f : X × Y → { 0 , 1 } ). ◮ Say we have m features f k for k = 1 . . . m ⇒ A feature vector f ( x, y ) ∈ R m for any x ∈ X and y ∈ Y . ◮ We also have a parameter vector v ∈ R m ◮ We define e v · f ( x,y ) p ( y | x ; v ) = � y ′ ∈Y e v · f ( x,y ′ )

Training the Log-Linear Model ◮ To train a log-linear model, we need a training set ( x i , y i ) for i = 1 . . . n . Then search for     − λ � � v ∗ = argmax v   v 2 log p ( y i | x i ; v )   k 2    i k  � �� Log − Likelihood Regularizer (see last lecture on log-linear models) ◮ Training set is simply all history/tag pairs seen in the training data

The Viterbi Algorithm Problem: for an input w 1 . . . w n , find arg max t 1 ...t n p ( t 1 . . . t n | w 1 . . . w n ) We assume that p takes the form n � p ( t 1 . . . t n | w 1 . . . w n ) = q ( t i | t i − 2 , t i − 1 , w [1: n ] , i ) i =1 (In our case q ( t i | t i − 2 , t i − 1 , w [1: n ] , i ) is the estimate from a log-linear model.)

The Viterbi Algorithm ◮ Define n to be the length of the sentence ◮ Define k � r ( t 1 . . . t k ) = q ( t i | t i − 2 , t i − 1 , w [1: n ] , i ) i =1 ◮ Define a dynamic programming table π ( k, u, v ) = maximum probability of a tag sequence ending in tags u, v at position k that is, π ( k, u, v ) = � t 1 ,...,t k − 2 � r ( t 1 . . . t k − 2 , u, v ) max

A Recursive Definition Base case: π (0 , * , * ) = 1 Recursive definition: For any k ∈ { 1 . . . n } , for any u ∈ S k − 1 and v ∈ S k : � � π ( k, u, v ) = max π ( k − 1 , t, u ) × q ( v | t, u, w [1: n ] , k ) t ∈S k − 2 where S k is the set of possible tags at position k

Log-Linear Models for Tagging (Maximum-entropy Markov Models - PowerPoint PPT Presentation

Log-Linear Models for Tagging (Maximum-entropy Markov Models (MEMMs)) Michael Collins, Columbia University Part-of-Speech Tagging INPUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street, as their CEO Alan Mulally

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Maximum Entropy Tagging (for the Maximum Entropy method itself, refer to NPFL067 added slides

Topic III.2: Maximum Entropy Models Discrete Topics in Data Mining Universitt des Saarlandes,

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Empirical Methods in Natural Language Processing Lecture 8 Tagging (III): Maximum Entropy Models

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Comparison Between Bayesian and Maximum Entropy Analysis of Flow Networks 1 Maximum Entropy

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

POS Tagging HMMs L645 / B659 Dept. of Linguistics, Indiana University Fall 2015 1 / 17 POS

A gentle introduction to Maximum Entropy Models and their friends Mark Johnson Brown University

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

Entropy & Hidden Markov Models Natural Language Processing CMSC 35100 April 22, 2003

Autodifferentiation CMSC 678 UMBC Recap from last time Maximum Entropy (Log-linear) Models

2017 Mid-Year Congregational Meeting S First Unitarian Universalist Church of Columbus 2017

E D C C M M or A A K K 1) Move the coins out of E make it deterministic [RBBK0 1]

Welcome to our HIVE Peer Coaching Webinar! We will get started in a few minutes. In the

Some Sample LFBs: Netdevice, IPV4, and IPV6 Jamal Hadi Salim <hadi@znyx.com> Sample LFB

G-FORCE 2 SLIDE Deep Flume Provides Sliding Base DESCRIPTION PART NUMBER Comfort for All

CARBON CAPTURE & STORAGE JOHN SCOWCROFT WHAT DO WE HAVE AND WHAT DO WE NEED A SET OF KEY

Welcome to The Acorn Early Years Foundation Stage (EYFS) Birth to 5 EYFS statutory

V ud from Neutron Beta Decay Presence & Prospects Bastian Mrkisch Physics Department

Log-Linear Models for Tagging (Maximum-entropy Markov Models - PowerPoint PPT Presentation

Log-Linear Models for Tagging (Maximum-entropy Markov Models (MEMMs)) Michael Collins, Columbia University Part-of-Speech Tagging INPUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street, as their CEO Alan Mulally

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Maximum Entropy Tagging (for the Maximum Entropy method itself, refer to NPFL067 added slides

Topic III.2: Maximum Entropy Models Discrete Topics in Data Mining Universitt des Saarlandes,

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Empirical Methods in Natural Language Processing Lecture 8 Tagging (III): Maximum Entropy Models

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Comparison Between Bayesian and Maximum Entropy Analysis of Flow Networks 1 Maximum Entropy

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

POS Tagging HMMs L645 / B659 Dept. of Linguistics, Indiana University Fall 2015 1 / 17 POS

A gentle introduction to Maximum Entropy Models and their friends Mark Johnson Brown University

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

Entropy &amp; Hidden Markov Models Natural Language Processing CMSC 35100 April 22, 2003

Autodifferentiation CMSC 678 UMBC Recap from last time Maximum Entropy (Log-linear) Models

2017 Mid-Year Congregational Meeting S First Unitarian Universalist Church of Columbus 2017

E D C C M M or A A K K 1) Move the coins out of E make it deterministic [RBBK0 1]

Welcome to our HIVE Peer Coaching Webinar! We will get started in a few minutes. In the

Some Sample LFBs: Netdevice, IPV4, and IPV6 Jamal Hadi Salim &lt;hadi@znyx.com&gt; Sample LFB

G-FORCE 2 SLIDE Deep Flume Provides Sliding Base DESCRIPTION PART NUMBER Comfort for All

CARBON CAPTURE &amp; STORAGE JOHN SCOWCROFT WHAT DO WE HAVE AND WHAT DO WE NEED A SET OF KEY

Welcome to The Acorn Early Years Foundation Stage (EYFS) Birth to 5 EYFS statutory

V ud from Neutron Beta Decay Presence &amp; Prospects Bastian Mrkisch Physics Department

Entropy & Hidden Markov Models Natural Language Processing CMSC 35100 April 22, 2003

Some Sample LFBs: Netdevice, IPV4, and IPV6 Jamal Hadi Salim <hadi@znyx.com> Sample LFB

CARBON CAPTURE & STORAGE JOHN SCOWCROFT WHAT DO WE HAVE AND WHAT DO WE NEED A SET OF KEY

V ud from Neutron Beta Decay Presence & Prospects Bastian Mrkisch Physics Department