HiddenVariable Models for Discriminative Reranking Terry Koo and - PowerPoint PPT Presentation

Hidden–Variable Models for Discriminative Reranking Terry Koo and Michael Collins { maestro|mcollins } @csail.mit.edu

Overview of reranking The reranking approach Use a baseline model to get the N -best candidates “Rerank” the candidates using a more complex model Parse reranking Collins (2000): 88.2% ⇒ 89.8% Charniak and Johnson (2005): 89.7% ⇒ 91.0% Talk by Brooke Cowan in 7B: 83.6% ⇒ 85.1% Also applied to MT (Och and Ney, 2002; Shen et al., 2004) NL Generation (Walker et al., 2001)

Representing NLP structures Proper representation is critical to success Hand–crafted feature vector representations Φ ( ) = { 0, 1, 2, 0, 0, 3, 0, 1 } Features defined through kernels K ( ) · Φ ( , ) = Φ ( ) This talk: A new approach using hidden variables

Two facets of lexical items Different lexical items can have similar meanings, e.g. president and chairman Clustering: president , chairman ∈ NounCluster 4 A single lexical item can have different meanings, e.g. [river] bank vs [financial] bank Refinement: bank 1 , bank 2 ∈ bank Model clusterings and refinements as hidden variables that support the reranking task

Highlights of the approach Conditional log–linear model with hidden variables Dynamic programming is used for training and decoding Clustering and refinement done automatically using a discriminative criterion

Overview of talk Motivation Design General form of the model Training and decoding efficiently Creating specific instantiations Results Discussion Conclusion

The parse reranking framework Sentences s i for 1 ≤ i ≤ n s 1 : Pierre Vinken , 61 years old , will join ... s 2 : Mr. Vinken is chairman of Elsevier N.V. ... s 3 : Big Board Chairman John Phelan said yesterday ... Each s i has candidate parses t i , j for 1 ≤ j ≤ n i t i ,1 is the best candidate parse for s i

The parse reranking framework t i , j has phrase structure and dependency tree S VP NP PP NP NP NP NNP NNP VB NN IN NNP NNP Mr. Vinken is chairman of Elsevier N.V.

Adding hidden variables Hidden–value domains H w ( t i , j ) for 1 ≤ w ≤ len( s i ) S VP NP PP NP NP NP NNP NNP VB NN IN NNP NNP Mr. Vinken is chairman of Elsevier N.V. NNP NNP VB 1 NN IN 1 NNP NNP 1 1 1 1 1 NNP NNP VB NN IN 2 NNP NNP 2 2 2 2 2 2 NNP NNP VB NN 3 IN 3 NNP NNP 3 3 3 3 3

Adding hidden variables Assignment h ∈ H 1 ( t i , j ) × ... × H len( s i ) ( t i , j ) S VP NP PP NP NP NP NNP NNP VB NN IN NNP NNP Mr. Vinken is chairman of Elsevier N.V. NNP NNP VB 1 NN IN 1 NNP NNP 1 1 1 1 1 NNP NNP VB NN IN 2 NNP NNP 2 2 2 2 2 2 NNP NNP VB NN 3 IN 3 NNP NNP 3 3 3 3 3

Marginalized probability model Φ ( t i , j , h ) produces a descriptive vector of feature occurrence counts, e.g. Φ 2 ( t i , j , h ) = Count( chairman has hidden value NN 1 ) Φ 13 ( t i , j , h ) = Count(NNP 2 is a direct object of VB 1 ) Φ 19 ( t i , j , h ) = Count(NN 1 coordinates with NN 2 )

Marginalized probability model Log–linear distribution over ( t i , j , h ) with parameters Θ : e Φ ( t i , j , h ) · Θ p ( t i , j , h | s i , Θ ) = j ′ , h ′ e Φ ( t i , j ′ , h ′ ) · Θ � Marginalize over assignments h : p ( t i , j | s i , Θ ) = � p ( t i , j , h | s i , Θ ) h

Optimizing the parameters Define loss as negative log-likelihood n L ( Θ ) = - log p ( t i ,1 | s i , Θ ) � i=1 Minimize L ( Θ ) through gradient descent p ( h | t i ,1 , s i , Θ ) Φ ( t i ,1 , h ) ∂ L ∂ Θ = - � � i h p ( t i , j | s i , Θ ) � p ( h | t i , j , s i , Θ ) Φ ( t i , j , h ) + � i , j h

Problems with efficiency | H 1 ( t i , j ) × ... × H len( s i ) ( t i , j ) | grows exponentially, so training the model is intractable: p ( h | t i ,1 , s i , Θ ) Φ ( t i ,1 , h ) ∂ L ∂ Θ = - � � h i p ( t i , j | s i , Θ ) � p ( h | t i , j , s i , Θ ) Φ ( t i , j , h ) + � h i , j Decoding the model is also intractable: p ( t i , j | s i , Θ ) = � p ( t i , j , h | s i , Θ ) h

Locality constraint on features Features have pairwise local scope on hidden variables Features still have global scope on non-hidden information Φ can be factored into local feature vectors, allowing dynamic programming

Local feature vectors Define two kinds of local feature vector φ : Single-variable φ ( t i , j , w , h w ) look at a single hidden variable Pairwise φ ( t i , j , u , v , h u , h v ) look at two hidden variables in a dependency relationship

Local feature vectors Φ ( t i , j , h ) looks at every hidden variable S VP NP PP NP NP NP NNP NNP VB NN IN NNP NNP Mr. Vinken is chairman of Elsevier N.V. NNP NNP VB 1 NN IN 1 NNP NNP 1 1 1 1 1 NNP NNP VB NN IN 2 NNP NNP 2 2 2 2 2 2 NNP NNP VB NN 3 IN 3 NNP NNP 3 3 3 3 3

Local feature vectors φ ( t i , j , chairman , NN 3 ) only sees NN 3 S VP NP PP NP NP NP NNP NNP VB NN IN NNP NNP Mr. Vinken is chairman of Elsevier N.V. NN 1 NN 2 NN 3

Local feature vectors φ ( t i , j , chairman , of , NN 3 , IN 2 ) sees NN 3 and IN 2 S VP NP PP NP NP NP NNP NNP VB NN IN NNP NNP Mr. Vinken is chairman of Elsevier N.V. NN IN 1 1 NN IN 2 2 NN 3 IN 3

Local feature vectors Rewrite global Φ as a sum over local φ Φ ( t i , j , h ) = φ ( t i , j , w , h w ) � w ∈ ti , j φ ( t i , j , u , v , h u , h v ) + � ( u , v ) ∈ D ( ti , j )

Applying belief propagation New restrictions enable dynamic–programming approaches, e.g. belief propagation BP generalizes the forward–backward algorithm from a chain to a tree Runtime O (len( s i ) H 2 ), H = max | H w ( t i , j ) | BP efficiently computes p ( t i , j , h | s i , Θ ) � h p ( h | t i , j , s i , Θ ) Φ ( t i , j , h ) � h

Two areas for choice in the model Definition of the hidden–value domains H w ( t i , j ) Definition of the feature vectors φ

Hidden–value domains Lexical domains allow word refinement Mr. Vinken is chairman of 1 Elsevier N.V. 1 1 1 1 1 1 Mr. Vinken is chairman of Elsevier N.V. 2 2 2 2 2 2 2 Mr. Vinken is chairman of Elsevier N.V. 3 3 3 3 3 3 3 Mr. Vinken is chairman of Elsevier N.V.

Hidden–value domains Lexical domains allow word refinement Mr. Vinken is chairman 1 of 1 Elsevier N.V. 1 1 1 1 1 Mr. Vinken is chairman of Elsevier N.V. 2 2 2 2 2 2 2 Mr. Vinken is chairman of Elsevier N.V. 3 3 3 3 3 3 3 Mr. Vinken is chairman of Elsevier N.V.

Hidden–value domains Part-of-speech domains allow word clustering NNP NNP VB NN 1 IN NNP NNP 1 1 1 1 1 1 NNP NNP VB NN IN NNP NNP 2 2 2 2 2 2 2 NNP NNP VB NN IN NNP NNP 3 3 3 3 3 3 3 NNP NNP VB NN IN NNP NNP 4 4 4 4 4 4 4 NNP NNP VB NN IN NNP NNP 5 5 5 5 5 5 5 Mr. Vinken is chairman of Elsevier N.V.

Hidden–value domains Supersense domains model WordNet ontology (Ciaramita and Johnson, 2003; Miller et at., 1993) NNP NNP verb.stative noun.person IN NNP NNP 1 1 1 1 1 1 1 NNP NNP verb.stative noun.person IN NNP NNP 2 2 2 2 2 2 2 NNP NNP verb.stative noun.person IN NNP NNP 3 3 3 3 3 3 3 NNP NNP verb.social IN NNP NNP 4 4 1 4 4 4 chairman NNP NNP verb.social IN NNP NNP 5 5 2 5 5 5 verb.social Mr. Vinken of Elsevier N.V. 3 verb.possession 1 verb.possession 2 verb.possession 3 is

HiddenVariable Models for Discriminative Reranking Terry Koo and - PowerPoint PPT Presentation

HiddenVariable Models for Discriminative Reranking Terry Koo and Michael Collins { maestro|mcollins } @csail.mit.edu Overview of reranking The reranking approach Use a baseline model to get the N -best candidates Rerank the candidates

Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Concept-to-text Generation via Discriminative Reranking Ioannis Konstas and Mirella Lapata School

Three models for discriminative machine Three models for discriminative machine translation using

Neural Reranking Improves Subjective Quality of Machine Translation: NAIST at WAT 2015 Graham

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Generative vs. discriminative Generative Discriminative Belief network A is more More

Discriminative word alignment by learning the Discriminative word alignment by learning the

Another view Hidden Input CEC is constant error Hidden carrousel No vanishing gradients

Hidden Markov Models Pratik Lahiri Introduction A hidden Markov model (HMM) is a

Discriminative Regularization for Latent Variable Models with Applications to Electrocardiography

Markov Models Kunsch, H.R., State Space and Hidden Markov Models . ETH- Zurich, Zurich;

Part-of-Speech Tagging COSI 114 Computational Linguistics James Pustejovsky March 17, 2017

Syntax-Based Decoding Philipp Koehn 9 November 2017 Philipp Koehn Machine Translation:

Information Extraction Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 February 21, 2017

Named Entity Recognition Katharine Jarmul Founder, kjamistan DataCamp Introduction to Natural

Treebank Grammars and Parser Evaluation Syntactic analysis (5LN455) 2016-11-15 Sara Stymne

Radiative pion capture in 2 H, 3 He and 3 H J. Golak , R. Skibiski, K. Topolnicki, H. Witaa,

Filtering relevant information from reports on flood Lubo s Popel nsk y Knowledge

? (entity type) Apr 23, 2007 NAACL-HLT 2 1 What Is Relation Extraction? hundreds of