hidden variable models for discriminative reranking
play

HiddenVariable Models for Discriminative Reranking Terry Koo and - PowerPoint PPT Presentation

HiddenVariable Models for Discriminative Reranking Terry Koo and Michael Collins { maestro|mcollins } @csail.mit.edu Overview of reranking The reranking approach Use a baseline model to get the N -best candidates Rerank the candidates


  1. Hidden–Variable Models for Discriminative Reranking Terry Koo and Michael Collins { maestro|mcollins } @csail.mit.edu

  2. Overview of reranking The reranking approach Use a baseline model to get the N -best candidates “Rerank” the candidates using a more complex model Parse reranking Collins (2000): 88.2% ⇒ 89.8% Charniak and Johnson (2005): 89.7% ⇒ 91.0% Talk by Brooke Cowan in 7B: 83.6% ⇒ 85.1% Also applied to MT (Och and Ney, 2002; Shen et al., 2004) NL Generation (Walker et al., 2001)

  3. Representing NLP structures Proper representation is critical to success Hand–crafted feature vector representations Φ ( ) = { 0, 1, 2, 0, 0, 3, 0, 1 } Features defined through kernels K ( ) · Φ ( , ) = Φ ( ) This talk: A new approach using hidden variables

  4. Two facets of lexical items Different lexical items can have similar meanings, e.g. president and chairman Clustering: president , chairman ∈ NounCluster 4 A single lexical item can have different meanings, e.g. [river] bank vs [financial] bank Refinement: bank 1 , bank 2 ∈ bank Model clusterings and refinements as hidden variables that support the reranking task

  5. Highlights of the approach Conditional log–linear model with hidden variables Dynamic programming is used for training and decoding Clustering and refinement done automatically using a discriminative criterion

  6. Overview of talk Motivation Design General form of the model Training and decoding efficiently Creating specific instantiations Results Discussion Conclusion

  7. The parse reranking framework Sentences s i for 1 ≤ i ≤ n s 1 : Pierre Vinken , 61 years old , will join ... s 2 : Mr. Vinken is chairman of Elsevier N.V. ... s 3 : Big Board Chairman John Phelan said yesterday ... Each s i has candidate parses t i , j for 1 ≤ j ≤ n i t i ,1 is the best candidate parse for s i

  8. The parse reranking framework t i , j has phrase structure and dependency tree S VP NP PP NP NP NP NNP NNP VB NN IN NNP NNP Mr. Vinken is chairman of Elsevier N.V.

  9. The parse reranking framework t i , j has phrase structure and dependency tree S VP NP PP NP NP NP NNP NNP VB NN IN NNP NNP Mr. Vinken is chairman of Elsevier N.V.

  10. Adding hidden variables Hidden–value domains H w ( t i , j ) for 1 ≤ w ≤ len( s i ) S VP NP PP NP NP NP NNP NNP VB NN IN NNP NNP Mr. Vinken is chairman of Elsevier N.V. NNP NNP VB 1 NN IN 1 NNP NNP 1 1 1 1 1 NNP NNP VB NN IN 2 NNP NNP 2 2 2 2 2 2 NNP NNP VB NN 3 IN 3 NNP NNP 3 3 3 3 3

  11. Adding hidden variables Assignment h ∈ H 1 ( t i , j ) × ... × H len( s i ) ( t i , j ) S VP NP PP NP NP NP NNP NNP VB NN IN NNP NNP Mr. Vinken is chairman of Elsevier N.V. NNP NNP VB 1 NN IN 1 NNP NNP 1 1 1 1 1 NNP NNP VB NN IN 2 NNP NNP 2 2 2 2 2 2 NNP NNP VB NN 3 IN 3 NNP NNP 3 3 3 3 3

  12. Marginalized probability model Φ ( t i , j , h ) produces a descriptive vector of feature occurrence counts, e.g. Φ 2 ( t i , j , h ) = Count( chairman has hidden value NN 1 ) Φ 13 ( t i , j , h ) = Count(NNP 2 is a direct object of VB 1 ) Φ 19 ( t i , j , h ) = Count(NN 1 coordinates with NN 2 )

  13. Marginalized probability model Log–linear distribution over ( t i , j , h ) with parameters Θ : e Φ ( t i , j , h ) · Θ p ( t i , j , h | s i , Θ ) = j ′ , h ′ e Φ ( t i , j ′ , h ′ ) · Θ � Marginalize over assignments h : p ( t i , j | s i , Θ ) = � p ( t i , j , h | s i , Θ ) h

  14. Optimizing the parameters Define loss as negative log-likelihood n L ( Θ ) = - log p ( t i ,1 | s i , Θ ) � i=1 Minimize L ( Θ ) through gradient descent p ( h | t i ,1 , s i , Θ ) Φ ( t i ,1 , h ) ∂ L ∂ Θ = - � � i h p ( t i , j | s i , Θ ) � p ( h | t i , j , s i , Θ ) Φ ( t i , j , h ) + � i , j h

  15. Overview of talk Motivation Design General form of the model Training and decoding efficiently Creating specific instantiations Results Discussion Conclusion

  16. Problems with efficiency | H 1 ( t i , j ) × ... × H len( s i ) ( t i , j ) | grows exponentially, so training the model is intractable: p ( h | t i ,1 , s i , Θ ) Φ ( t i ,1 , h ) ∂ L ∂ Θ = - � � h i p ( t i , j | s i , Θ ) � p ( h | t i , j , s i , Θ ) Φ ( t i , j , h ) + � h i , j Decoding the model is also intractable: p ( t i , j | s i , Θ ) = � p ( t i , j , h | s i , Θ ) h

  17. Problems with efficiency | H 1 ( t i , j ) × ... × H len( s i ) ( t i , j ) | grows exponentially, so training the model is intractable: p ( h | t i ,1 , s i , Θ ) Φ ( t i ,1 , h ) ∂ L ∂ Θ = - � � h i p ( t i , j | s i , Θ ) � p ( h | t i , j , s i , Θ ) Φ ( t i , j , h ) + � h i , j Decoding the model is also intractable: p ( t i , j | s i , Θ ) = � p ( t i , j , h | s i , Θ ) h

  18. Locality constraint on features Features have pairwise local scope on hidden variables Features still have global scope on non-hidden information Φ can be factored into local feature vectors, allowing dynamic programming

  19. Local feature vectors Define two kinds of local feature vector φ : Single-variable φ ( t i , j , w , h w ) look at a single hidden variable Pairwise φ ( t i , j , u , v , h u , h v ) look at two hidden variables in a dependency relationship

  20. Local feature vectors Φ ( t i , j , h ) looks at every hidden variable S VP NP PP NP NP NP NNP NNP VB NN IN NNP NNP Mr. Vinken is chairman of Elsevier N.V. NNP NNP VB 1 NN IN 1 NNP NNP 1 1 1 1 1 NNP NNP VB NN IN 2 NNP NNP 2 2 2 2 2 2 NNP NNP VB NN 3 IN 3 NNP NNP 3 3 3 3 3

  21. Local feature vectors φ ( t i , j , chairman , NN 3 ) only sees NN 3 S VP NP PP NP NP NP NNP NNP VB NN IN NNP NNP Mr. Vinken is chairman of Elsevier N.V. NN 1 NN 2 NN 3

  22. Local feature vectors φ ( t i , j , chairman , of , NN 3 , IN 2 ) sees NN 3 and IN 2 S VP NP PP NP NP NP NNP NNP VB NN IN NNP NNP Mr. Vinken is chairman of Elsevier N.V. NN IN 1 1 NN IN 2 2 NN 3 IN 3

  23. Local feature vectors Rewrite global Φ as a sum over local φ Φ ( t i , j , h ) = φ ( t i , j , w , h w ) � w ∈ ti , j φ ( t i , j , u , v , h u , h v ) + � ( u , v ) ∈ D ( ti , j )

  24. Local feature vectors Rewrite global Φ as a sum over local φ Φ ( t i , j , h ) = φ ( t i , j , w , h w ) � w ∈ ti , j φ ( t i , j , u , v , h u , h v ) + � ( u , v ) ∈ D ( ti , j )

  25. Applying belief propagation New restrictions enable dynamic–programming approaches, e.g. belief propagation BP generalizes the forward–backward algorithm from a chain to a tree Runtime O (len( s i ) H 2 ), H = max | H w ( t i , j ) | BP efficiently computes p ( t i , j , h | s i , Θ ) � h p ( h | t i , j , s i , Θ ) Φ ( t i , j , h ) � h

  26. Overview of talk Motivation Design General form of the model Training and decoding efficiently Creating specific instantiations Results Discussion Conclusion

  27. Two areas for choice in the model Definition of the hidden–value domains H w ( t i , j ) Definition of the feature vectors φ

  28. Hidden–value domains Lexical domains allow word refinement Mr. Vinken is chairman of 1 Elsevier N.V. 1 1 1 1 1 1 Mr. Vinken is chairman of Elsevier N.V. 2 2 2 2 2 2 2 Mr. Vinken is chairman of Elsevier N.V. 3 3 3 3 3 3 3 Mr. Vinken is chairman of Elsevier N.V.

  29. Hidden–value domains Lexical domains allow word refinement Mr. Vinken is chairman 1 of 1 Elsevier N.V. 1 1 1 1 1 Mr. Vinken is chairman of Elsevier N.V. 2 2 2 2 2 2 2 Mr. Vinken is chairman of Elsevier N.V. 3 3 3 3 3 3 3 Mr. Vinken is chairman of Elsevier N.V.

  30. Hidden–value domains Part-of-speech domains allow word clustering NNP NNP VB NN 1 IN NNP NNP 1 1 1 1 1 1 NNP NNP VB NN IN NNP NNP 2 2 2 2 2 2 2 NNP NNP VB NN IN NNP NNP 3 3 3 3 3 3 3 NNP NNP VB NN IN NNP NNP 4 4 4 4 4 4 4 NNP NNP VB NN IN NNP NNP 5 5 5 5 5 5 5 Mr. Vinken is chairman of Elsevier N.V.

  31. Hidden–value domains Part-of-speech domains allow word clustering NNP NNP VB NN 1 IN NNP NNP 1 1 1 1 1 1 NNP NNP VB NN IN NNP NNP 2 2 2 2 2 2 2 NNP NNP VB NN IN NNP NNP 3 3 3 3 3 3 3 NNP NNP VB NN IN NNP NNP 4 4 4 4 4 4 4 NNP NNP VB NN IN NNP NNP 5 5 5 5 5 5 5 Mr. Vinken is chairman of Elsevier N.V.

  32. Hidden–value domains Part-of-speech domains allow word clustering NNP NNP VB NN 1 IN NNP NNP 1 1 1 1 1 1 NNP NNP VB NN IN NNP NNP 2 2 2 2 2 2 2 NNP NNP VB NN IN NNP NNP 3 3 3 3 3 3 3 NNP NNP VB NN IN NNP NNP 4 4 4 4 4 4 4 NNP NNP VB NN IN NNP NNP 5 5 5 5 5 5 5 Mr. Vinken is chairman of Elsevier N.V.

  33. Hidden–value domains Supersense domains model WordNet ontology (Ciaramita and Johnson, 2003; Miller et at., 1993) NNP NNP verb.stative noun.person IN NNP NNP 1 1 1 1 1 1 1 NNP NNP verb.stative noun.person IN NNP NNP 2 2 2 2 2 2 2 NNP NNP verb.stative noun.person IN NNP NNP 3 3 3 3 3 3 3 NNP NNP verb.social IN NNP NNP 4 4 1 4 4 4 chairman NNP NNP verb.social IN NNP NNP 5 5 2 5 5 5 verb.social Mr. Vinken of Elsevier N.V. 3 verb.possession 1 verb.possession 2 verb.possession 3 is

Recommend


More recommend