Structured Prediction Models via the Matrix-Tree Theorem Terry Koo - PowerPoint PPT Presentation

Structured Prediction Models via the Matrix-Tree Theorem Terry Koo maestro@csail.mit.edu Amir Globerson gamir@csail.mit.edu Xavier Carreras carreras@csail.mit.edu Michael Collins mcollins@csail.mit.edu MIT Computer Science and Artificial Intelligence Laboratory

Dependency parsing * John saw Mary Syntactic structure represented by head-modifier dependencies

Projective vs. non-projective structures * John saw a movie today that he liked Non-projective structures allow crossing dependencies Frequent in languages like Czech, Dutch, etc. Non-projective parsing is max-spanning-tree (McDonald et al., 2005)

Contributions of this work Fundamental inference algorithms that sum over possible structures: Model type Inference Algorithm HMM Forward-Backward Graphical Model Belief Propagation PCFG Inside-Outside Projective Dep. Trees Inside-Outside Non-projective Dep. Trees ?? This talk: Inside-outside-style algorithms for non-projective dependency structures An application: training log-linear and max-margin parsers Independently-developed work: Smith and Smith (2007), McDonald and Satta (2007)

Overview Background Matrix-Tree-based inference Experiments

Edge-factored structured prediction 0 1 2 3 * John saw Mary A dependency tree y is a set of head-modifier dependencies (McDonald et al., 2005; Eisner, 1996) ( h, m ) is a dependency with feature vector f ( x , h, m ) Y ( x ) is the set of all possible trees for sentence x y ∗ � = argmax w · f ( x , h, m ) y ∈Y ( x ) ( h,m ) ∈ y

Training log-linear dependency parsers Given a training set { ( x i , y i ) } N i =1 , minimize N C 2 || w || 2 − � L ( w ) = log P ( y i | x i ; w ) i =1

Training log-linear dependency parsers Given a training set { ( x i , y i ) } N i =1 , minimize N C 2 || w || 2 − � L ( w ) = log P ( y i | x i ; w ) i =1 Log-linear distribution over trees   1   � P ( y | x ; w ) = Z ( x ; w ) exp w · f ( x , h, m )   ( h,m ) ∈ y     � � Z ( x ; w ) = exp w · f ( x , h, m )   y ∈Y ( x ) ( h,m ) ∈ y

Training log-linear dependency parsers Gradient-based optimizers evaluate L ( w ) and ∂L ∂ w N C 2 || w || 2 − � � L ( w ) = w · f ( x i , h, m ) i =1 ( h,m ) ∈ y i N C 2 || w || 2 + � log Z ( x i ; w ) i =1 Main difficulty: computation of the partition functions

Training log-linear dependency parsers Gradient-based optimizers evaluate L ( w ) and ∂L ∂ w N ∂L � � = C w − f ( x i , h, m ) ∂ w i =1 ( h,m ) ∈ y i N P ( h ′ → m ′ | x ; w ) f ( x i , h ′ , m ′ ) � � C w + i =1 h ′ ,m ′ The marginals are edge-appearance probabilities � P ( h → m | x ; w ) = P ( y | x ; w ) y ∈Y ( x ) : ( h,m ) ∈ y

Generalized log-linear inference Vector θ with parameter θ h,m for each dependency   1   � P ( y | x ; θ ) = Z ( x ; θ ) exp θ h,m   ( h,m ) ∈ y     � � Z ( x ; θ ) = exp θ h,m   y ∈Y ( x ) ( h,m ) ∈ y   1   � � P ( h → m | x ; θ ) = exp θ h,m Z ( x ; θ )   y ∈Y ( x ) : ( h,m ) ∈ y ( h,m ) ∈ y E.g., θ h,m = w · f ( x , h, m )

Applications of log-linear inference Generalized inference engine that takes θ as input Different definitions of θ can be used for log-linear or max-margin training N � C � 2 || w || 2 − w ∗ � = argmin log P ( y i | x i ; w ) LL w i =1 N � C � 2 || w || 2 + w ∗ � = argmin max ( E i,y − m i,y ( w )) MM y w i =1 Exponentiated-gradient updates for max-margin models Bartlett, Collins, Taskar and McAllester (2004) Globerson, Koo, Carreras and Collins (2007)

Single-root vs. multi-root structures * John saw Mary * John saw Mary Multi-root structures allow multiple edges from * Single-root structures have exactly one edge from * Independent adaptations of the Matrix-Tree Theorem: Smith and Smith (2007), McDonald and Satta (2007)

Matrix-Tree Theorem (Tutte, 1984) 2 Given: 1 2 1. Directed graph G 1 4 2. Edge weights θ 3 3. A node r in G 3 A matrix L ( r ) can be constructed whose determinant is the sum of weighted spanning trees of G rooted at r

Matrix-Tree Theorem (Tutte, 1984) 2 Given: 1 2 1. Directed graph G 1 4 2. Edge weights θ 3 3. A node r in G 3 A matrix L ( r ) can be constructed whose determinant is the sum of weighted spanning trees of G rooted at r = exp { 2 + 4 } + exp { 1 + 3 } = det( L (1) ) �

Multi-root partition function 0 1 2 3 * John saw Mary Edge weights θ , root r = 0 det( L (0) ) = non-projective multi-root partition function

Construction of L (0) L (0) has a simple construction L (0) off-diagonal: = − exp { θ h,m } h,m n L (0) � on-diagonal: = exp { θ h,m } m,m h ′ =0 E.g., L (0) 3 , 3 0 1 2 3 * John saw Mary The determinant of L (0) can be evaluated in O ( n 3 ) time

Single-root vs. multi-root structures * John saw Mary * John saw Mary Multi-root structures allow multiple edges from * Single-root structures have exactly one edge from * Independent adaptations of the Matrix-Tree Theorem: Smith and Smith (2007), McDonald and Satta (2007)

Single-root partition function Na¨ ıve method for computing the single-root non-projective partition function 0 1 2 3 * John saw Mary

Single-root partition function Na¨ ıve method for computing the single-root non-projective partition function 0 1 2 3 * John saw Mary Exclude all root edges except (0 , 1) Computing n determinants requires O ( n 4 ) time

Single-root partition function An alternate matrix ˆ L can be constructed such that det(ˆ L ) is the single-root partition function ˆ first row: = exp { θ 0 ,m } L 1 ,m n ˆ � other rows, on-diagonal: = exp { θ h,m } L m,m h ′ =1 ˆ other rows, off-diagonal: = − exp { θ h,m } L h,m Single-root partition function requires O ( n 3 ) time

Non-projective marginals The log-partition generates the marginals ∂ log det(ˆ ∂ log Z ( x ; θ ) L ) P ( h → m | x ; θ ) = = ∂θ h,m ∂θ h,m ∂ log det(ˆ ∂ ˆ L ) L h ′ ,m ′ � = ∂ ˆ ∂θ h,m L h ′ ,m ′ h ′ ,m ′ ∂ log det(ˆ L ) L − 1 � T � ˆ = Derivative of log-determinant: ∂ ˆ L Complexity dominated by matrix inverse, O ( n 3 )

Summary of non-projective inference Partition function: matrix determinant, O ( n 3 ) Marginals: matrix inverse, O ( n 3 ) Single-root inference: ˆ L Multi-root inference: L (0)

Log-linear and max-margin training Log-linear training N � C � 2 || w || 2 − w ∗ � = argmin log P ( y i | x i ; w ) LL w i =1 Max-margin training N � C � 2 || w || 2 + w ∗ � = argmin max ( E i,y − m i,y ( w )) MM y w i =1

Multilingual parsing experiments Six languages from CoNLL 2006 shared task Training algorithms: averaged perceptron, log-linear models, max-margin models Projective models vs. non-projective models Single-root models vs. multi-root models

Multilingual parsing experiments Dutch Projective Non-Projective Training Training (4.93%cd) 77.17 78.83 Perceptron 76.23 79.55 Log-Linear 76.53 79.69 Max-Margin Non-projective training helps on non-projective languages

Multilingual parsing experiments Spanish Projective Non-Projective Training Training (0.06%cd) 81.19 80.02 Perceptron 81.75 81.57 Log-Linear 81.71 81.93 Max-Margin Non-projective training doesn’t hurt on projective languages

Multilingual parsing experiments Results across all 6 languages (Arabic, Dutch, Japanese, Slovene, Spanish, Turkish) Perceptron 79.05 Log-Linear 79.71 Max-Margin 79.82 Log-linear and max-margin parsers show improvement over perceptron-trained parsers Improvements are statistically significant (sign test)

Summary Inside-outside-style inference algorithms for non-projective structures Application of the Matrix-Tree Theorem Inference for both multi-root and single-root structures Empirical results Non-projective training is good for non-projective languages Log-linear and max-margin parsers outperform perceptron parsers

Thanks! Thanks for listening!

Thanks!

Structured Prediction Models via the Matrix-Tree Theorem Terry Koo - PowerPoint PPT Presentation

Structured Prediction Models via the Matrix-Tree Theorem Terry Koo maestro@csail.mit.edu Amir Globerson gamir@csail.mit.edu Xavier Carreras carreras@csail.mit.edu Michael Collins mcollins@csail.mit.edu MIT Computer Science and Artificial

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

31. Stokes Theorem Stokes theorem is to Greens theorem, for the work done, as the

Training Strategies CS 6355: Structured Prediction 1 So far we saw What is structured output

CSCE 496/896 Lecture 11: Structured Prediction and Structured Prediction and Probabilistic

Final Examples Announcements Trees Tree-Structured Data def tree(label, branches=[]): A tree

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

INDEXING - 1 Tree-Structured Indices Tree-structured indexing techniques support both

Variational Inference for Tutorial Outline Structured NLP Models 1. Structured Models and Factor

Course Information CS 6355: Structured Prediction Building up structured output prediction

L101: Incremental structured prediction Structured prediction reminder Given an input x (e.g. a

Structured Prediction Final words CS 6355: Structured Prediction 1 A look back What is a

Mean Field Limits for Ginzburg-Landau Vortices Sylvia Serfaty Universit P. et M. Curie Paris 6,

Mathematics by Experiment, I & II : Plausible Reasoning in the 21st Century Jonathan M.

REPRESENTING THE PRESENT-DAY REGIONAL CLIMATE OVER SOUTHERN AFRICA IN A DOUBLE NESTED SYSTEM Dr

Word Representations, Seed Lexicons, Mapping Procedures, and Reference Lists: What Matters in

Why splitting your focus could be good Alex Peattie CTO & co-founder, Peg alex@peg.co Mon

Network OS OpenFlow Network OS: distributed system that creates a consistent, up-to-date network

Daren Hasenkamp*, Alex Sim, Michael Wehner, Kesheng Wu Lawrence Berkeley National Laboratory

A Meta-Language for Hardware Testbench Michael Katelman and Jos e Meseguer University of

Structured Prediction Models via the Matrix-Tree Theorem Terry Koo - PowerPoint PPT Presentation

Structured Prediction Models via the Matrix-Tree Theorem Terry Koo maestro@csail.mit.edu Amir Globerson gamir@csail.mit.edu Xavier Carreras carreras@csail.mit.edu Michael Collins mcollins@csail.mit.edu MIT Computer Science and Artificial

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

31. Stokes Theorem Stokes theorem is to Greens theorem, for the work done, as the

Training Strategies CS 6355: Structured Prediction 1 So far we saw What is structured output

CSCE 496/896 Lecture 11: Structured Prediction and Structured Prediction and Probabilistic

Final Examples Announcements Trees Tree-Structured Data def tree(label, branches=[]): A tree

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

INDEXING - 1 Tree-Structured Indices Tree-structured indexing techniques support both

Variational Inference for Tutorial Outline Structured NLP Models 1. Structured Models and Factor

Course Information CS 6355: Structured Prediction Building up structured output prediction

L101: Incremental structured prediction Structured prediction reminder Given an input x (e.g. a

Structured Prediction Final words CS 6355: Structured Prediction 1 A look back What is a

Mean Field Limits for Ginzburg-Landau Vortices Sylvia Serfaty Universit P. et M. Curie Paris 6,

Mathematics by Experiment, I &amp; II : Plausible Reasoning in the 21st Century Jonathan M.

REPRESENTING THE PRESENT-DAY REGIONAL CLIMATE OVER SOUTHERN AFRICA IN A DOUBLE NESTED SYSTEM Dr

Word Representations, Seed Lexicons, Mapping Procedures, and Reference Lists: What Matters in

Why splitting your focus could be good Alex Peattie CTO &amp; co-founder, Peg alex@peg.co Mon

Network OS OpenFlow Network OS: distributed system that creates a consistent, up-to-date network

Daren Hasenkamp*, Alex Sim, Michael Wehner, Kesheng Wu Lawrence Berkeley National Laboratory

A Meta-Language for Hardware Testbench Michael Katelman and Jos e Meseguer University of

Mathematics by Experiment, I & II : Plausible Reasoning in the 21st Century Jonathan M.

Why splitting your focus could be good Alex Peattie CTO & co-founder, Peg alex@peg.co Mon