structured prediction models via the matrix tree theorem
play

Structured Prediction Models via the Matrix-Tree Theorem Terry Koo - PowerPoint PPT Presentation

Structured Prediction Models via the Matrix-Tree Theorem Terry Koo maestro@csail.mit.edu Amir Globerson gamir@csail.mit.edu Xavier Carreras carreras@csail.mit.edu Michael Collins mcollins@csail.mit.edu MIT Computer Science and Artificial


  1. Structured Prediction Models via the Matrix-Tree Theorem Terry Koo maestro@csail.mit.edu Amir Globerson gamir@csail.mit.edu Xavier Carreras carreras@csail.mit.edu Michael Collins mcollins@csail.mit.edu MIT Computer Science and Artificial Intelligence Laboratory

  2. Dependency parsing * John saw Mary Syntactic structure represented by head-modifier dependencies

  3. Projective vs. non-projective structures * John saw a movie today that he liked Non-projective structures allow crossing dependencies Frequent in languages like Czech, Dutch, etc. Non-projective parsing is max-spanning-tree (McDonald et al., 2005)

  4. Contributions of this work Fundamental inference algorithms that sum over possible structures: Model type Inference Algorithm HMM Forward-Backward Graphical Model Belief Propagation PCFG Inside-Outside Projective Dep. Trees Inside-Outside Non-projective Dep. Trees ?? This talk: Inside-outside-style algorithms for non-projective dependency structures An application: training log-linear and max-margin parsers Independently-developed work: Smith and Smith (2007), McDonald and Satta (2007)

  5. Overview Background Matrix-Tree-based inference Experiments

  6. Edge-factored structured prediction 0 1 2 3 * John saw Mary A dependency tree y is a set of head-modifier dependencies (McDonald et al., 2005; Eisner, 1996) ( h, m ) is a dependency with feature vector f ( x , h, m ) Y ( x ) is the set of all possible trees for sentence x y ∗ � = argmax w · f ( x , h, m ) y ∈Y ( x ) ( h,m ) ∈ y

  7. Training log-linear dependency parsers Given a training set { ( x i , y i ) } N i =1 , minimize N C 2 || w || 2 − � L ( w ) = log P ( y i | x i ; w ) i =1

  8. Training log-linear dependency parsers Given a training set { ( x i , y i ) } N i =1 , minimize N C 2 || w || 2 − � L ( w ) = log P ( y i | x i ; w ) i =1 Log-linear distribution over trees   1   � P ( y | x ; w ) = Z ( x ; w ) exp w · f ( x , h, m )   ( h,m ) ∈ y     � � Z ( x ; w ) = exp w · f ( x , h, m )   y ∈Y ( x ) ( h,m ) ∈ y

  9. Training log-linear dependency parsers Gradient-based optimizers evaluate L ( w ) and ∂L ∂ w N C 2 || w || 2 − � � L ( w ) = w · f ( x i , h, m ) i =1 ( h,m ) ∈ y i N C 2 || w || 2 + � log Z ( x i ; w ) i =1 Main difficulty: computation of the partition functions

  10. Training log-linear dependency parsers Gradient-based optimizers evaluate L ( w ) and ∂L ∂ w N ∂L � � = C w − f ( x i , h, m ) ∂ w i =1 ( h,m ) ∈ y i N P ( h ′ → m ′ | x ; w ) f ( x i , h ′ , m ′ ) � � C w + i =1 h ′ ,m ′ The marginals are edge-appearance probabilities � P ( h → m | x ; w ) = P ( y | x ; w ) y ∈Y ( x ) : ( h,m ) ∈ y

  11. Generalized log-linear inference Vector θ with parameter θ h,m for each dependency   1   � P ( y | x ; θ ) = Z ( x ; θ ) exp θ h,m   ( h,m ) ∈ y     � � Z ( x ; θ ) = exp θ h,m   y ∈Y ( x ) ( h,m ) ∈ y   1   � � P ( h → m | x ; θ ) = exp θ h,m Z ( x ; θ )   y ∈Y ( x ) : ( h,m ) ∈ y ( h,m ) ∈ y E.g., θ h,m = w · f ( x , h, m )

  12. Applications of log-linear inference Generalized inference engine that takes θ as input Different definitions of θ can be used for log-linear or max-margin training N � C � 2 || w || 2 − w ∗ � = argmin log P ( y i | x i ; w ) LL w i =1 N � C � 2 || w || 2 + w ∗ � = argmin max ( E i,y − m i,y ( w )) MM y w i =1 Exponentiated-gradient updates for max-margin models Bartlett, Collins, Taskar and McAllester (2004) Globerson, Koo, Carreras and Collins (2007)

  13. Overview Background Matrix-Tree-based inference Experiments

  14. Single-root vs. multi-root structures * John saw Mary * John saw Mary Multi-root structures allow multiple edges from * Single-root structures have exactly one edge from * Independent adaptations of the Matrix-Tree Theorem: Smith and Smith (2007), McDonald and Satta (2007)

  15. Matrix-Tree Theorem (Tutte, 1984) 2 Given: 1 2 1. Directed graph G 1 4 2. Edge weights θ 3 3. A node r in G 3 A matrix L ( r ) can be constructed whose determinant is the sum of weighted spanning trees of G rooted at r

  16. Matrix-Tree Theorem (Tutte, 1984) 2 Given: 1 2 1. Directed graph G 1 4 2. Edge weights θ 3 3. A node r in G 3 A matrix L ( r ) can be constructed whose determinant is the sum of weighted spanning trees of G rooted at r = exp { 2 + 4 } + exp { 1 + 3 } = det( L (1) ) �

  17. Matrix-Tree Theorem (Tutte, 1984) 2 Given: 1 2 1. Directed graph G 1 4 2. Edge weights θ 3 3. A node r in G 3 A matrix L ( r ) can be constructed whose determinant is the sum of weighted spanning trees of G rooted at r = exp { 2 + 4 } + exp { 1 + 3 } = det( L (1) ) �

  18. Matrix-Tree Theorem (Tutte, 1984) 2 Given: 1 2 1. Directed graph G 1 4 2. Edge weights θ 3 3. A node r in G 3 A matrix L ( r ) can be constructed whose determinant is the sum of weighted spanning trees of G rooted at r = exp { 2 + 4 } + exp { 1 + 3 } = det( L (1) ) �

  19. Multi-root partition function 0 1 2 3 * John saw Mary Edge weights θ , root r = 0 det( L (0) ) = non-projective multi-root partition function

  20. Construction of L (0) L (0) has a simple construction L (0) off-diagonal: = − exp { θ h,m } h,m n L (0) � on-diagonal: = exp { θ h,m } m,m h ′ =0 E.g., L (0) 3 , 3 0 1 2 3 * John saw Mary The determinant of L (0) can be evaluated in O ( n 3 ) time

  21. Single-root vs. multi-root structures * John saw Mary * John saw Mary Multi-root structures allow multiple edges from * Single-root structures have exactly one edge from * Independent adaptations of the Matrix-Tree Theorem: Smith and Smith (2007), McDonald and Satta (2007)

  22. Single-root partition function Na¨ ıve method for computing the single-root non-projective partition function 0 1 2 3 * John saw Mary

  23. Single-root partition function Na¨ ıve method for computing the single-root non-projective partition function 0 1 2 3 * John saw Mary Exclude all root edges except (0 , 1) Computing n determinants requires O ( n 4 ) time

  24. Single-root partition function Na¨ ıve method for computing the single-root non-projective partition function 0 1 2 3 * John saw Mary Exclude all root edges except (0 , 2) Computing n determinants requires O ( n 4 ) time

  25. Single-root partition function Na¨ ıve method for computing the single-root non-projective partition function 0 1 2 3 * John saw Mary Exclude all root edges except (0 , 3) Computing n determinants requires O ( n 4 ) time

  26. Single-root partition function An alternate matrix ˆ L can be constructed such that det(ˆ L ) is the single-root partition function ˆ first row: = exp { θ 0 ,m } L 1 ,m n ˆ � other rows, on-diagonal: = exp { θ h,m } L m,m h ′ =1 ˆ other rows, off-diagonal: = − exp { θ h,m } L h,m Single-root partition function requires O ( n 3 ) time

  27. Non-projective marginals The log-partition generates the marginals ∂ log det(ˆ ∂ log Z ( x ; θ ) L ) P ( h → m | x ; θ ) = = ∂θ h,m ∂θ h,m ∂ log det(ˆ ∂ ˆ L ) L h ′ ,m ′ � = ∂ ˆ ∂θ h,m L h ′ ,m ′ h ′ ,m ′ ∂ log det(ˆ L ) L − 1 � T � ˆ = Derivative of log-determinant: ∂ ˆ L Complexity dominated by matrix inverse, O ( n 3 )

  28. Summary of non-projective inference Partition function: matrix determinant, O ( n 3 ) Marginals: matrix inverse, O ( n 3 ) Single-root inference: ˆ L Multi-root inference: L (0)

  29. Overview Background Matrix-Tree-based inference Experiments

  30. Log-linear and max-margin training Log-linear training N � C � 2 || w || 2 − w ∗ � = argmin log P ( y i | x i ; w ) LL w i =1 Max-margin training N � C � 2 || w || 2 + w ∗ � = argmin max ( E i,y − m i,y ( w )) MM y w i =1

  31. Multilingual parsing experiments Six languages from CoNLL 2006 shared task Training algorithms: averaged perceptron, log-linear models, max-margin models Projective models vs. non-projective models Single-root models vs. multi-root models

  32. Multilingual parsing experiments Dutch Projective Non-Projective Training Training (4.93%cd) 77.17 78.83 Perceptron 76.23 79.55 Log-Linear 76.53 79.69 Max-Margin Non-projective training helps on non-projective languages

  33. Multilingual parsing experiments Spanish Projective Non-Projective Training Training (0.06%cd) 81.19 80.02 Perceptron 81.75 81.57 Log-Linear 81.71 81.93 Max-Margin Non-projective training doesn’t hurt on projective languages

  34. Multilingual parsing experiments Results across all 6 languages (Arabic, Dutch, Japanese, Slovene, Spanish, Turkish) Perceptron 79.05 Log-Linear 79.71 Max-Margin 79.82 Log-linear and max-margin parsers show improvement over perceptron-trained parsers Improvements are statistically significant (sign test)

  35. Summary Inside-outside-style inference algorithms for non-projective structures Application of the Matrix-Tree Theorem Inference for both multi-root and single-root structures Empirical results Non-projective training is good for non-projective languages Log-linear and max-margin parsers outperform perceptron parsers

  36. Thanks! Thanks for listening!

  37. Thanks!

Recommend


More recommend