viterbi training for pcfgs hardness results and
play

Viterbi Training for PCFGs: Hardness Results and Competitiveness of - PowerPoint PPT Presentation

Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization Shay Cohen Noah Smith Carnegie Mellon University July 14, 2010 Outline Hardness results for unsupervised learning of PCFGs Background and problem


  1. Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization Shay Cohen Noah Smith Carnegie Mellon University July 14, 2010

  2. Outline Hardness results for unsupervised learning of PCFGs Background and problem definition Main hardness result Extensions Open problems Conclusion

  3. Viterbi EM Let p ( x , z | θ ) be some parametrized statistical model Viterbi EM identifies θ and z given x

  4. Viterbi EM Let p ( x , z | θ ) be some parametrized statistical model Viterbi EM identifies θ and z given x Let x 1 , ..., x n be the observed data Algorithm (Viterbi EM) 1 start with some θ 2 set z i ← argmax p ( x i , z i | θ ) ⇐ = “E-step” z i n � 3 set θ ← argmax p ( x i , z i | θ ) ⇐ = “M-step” θ i = 1 � �� � likelihood 4 go to step 2 unless converged

  5. Viterbi EM Simple and useful algorithm. Recent examples include: Machine translation (Brown et al., 2003) Language acquisition (Goldwater and Johnson, 2005) Coreference resolution (Choi and Cardie, 2007) Question answering (Wang et al., 2007) Grammar induction (Spitkovsky et al., 2010) We focus on Viterbi EM for PCFGs z i - parse tree, x i - sentence, θ - rule probabilities

  6. Viterbi training Viterbi EM is coordinate ascent, and it greedily tries to find: n � � θ, z 1 , ..., z n � = argmax p ( x i , z i | θ ) θ, z 1 ,..., z n i = 1 We call this maximization problem “Viterbi training” Viterbi EM finds local maximum for Viterbi training

  7. Viterbi training Viterbi EM is coordinate ascent, and it greedily tries to find: n � � θ, z 1 , ..., z n � = argmax p ( x i , z i | θ ) θ, z 1 ,..., z n i = 1 We call this maximization problem “Viterbi training” Viterbi EM finds local maximum for Viterbi training Main question: can we hope to optimize this objective function and find the global maximum? ... computational complexity answers this kind of question

  8. Hardness of a problem We usually show that a problem A is hard by showing that another hard problem B can be solved if we could solve A The type of problem we usually do this for is “decision problems” (answer is 0 or 1) “Hardness” in this paper refers to being able to solve all problems in the NP class (“NP hardness”) We convert every input x of B to an input x ′ of A such that A ( x ′ ) = 1 B ( x ) = 1 ⇐ ⇒

  9. Optimization problem → decision problem Viterbi training optimizes an objective function. To convert to a decision problem we define: Problem (Viterbi Train) Input: G context-free grammar, x 1 , . . . , x n sentences, α ∈ [ 0 , 1 ] Output: 1 if there are θ and z 1 , . . . , z n derivation trees such that n � p ( x i , z i | θ ) ≥ α i = 1 and 0 otherwise. Note that knowing how to optimize the likelihood means we can solve this decision problem. Viterbi Train is in NP (witness: parse trees and parameters)

  10. 3-SAT We show that Viterbi Train is NP hard by showing that there is a reduction from 3-SAT (an NP hard problem) to Viterbi Train Problem (3-SAT) Input: A formula φ = � m i = 1 ( a i ∨ b i ∨ c i ) in conjunctive normal form, such that each clause has 3 literals. Output: 1 if there is a satisfying assignment for φ and 0 otherwise. For example, if we have the formula φ = ( a ∨ b ∨ c ) ∧ ( ¬ a ∨ b ∨ c ) then a satisfying assignment is a = 0 , b = 0 , c = 1

  11. 3-SAT and reductions We map every instance of 3-SAT (a formula φ ) to a grammar G and a string x such that z ,θ p ( x , z | θ ) = 1 max if and only if there is a satisfying assignment for the formula The maximizing z and θ will contain a description of the assignment Since 3-SAT is NP hard, Viterbi Train is NP hard

  12. The reduction (an example) Let φ = ( a ∨ ¬ b ∨ c ) ∧ ( ¬ a ∨ b ∨ c ) ∧ ( d ∨ ¬ c ∨ a ) � �� � � �� � � �� � C 1 C 2 C 3 We create the following context-free grammar: Σ = { 0 , 1 } ⇐ = Terminal symbols For the variables, a , b , c , d we create the rules: V a → 0 V a → 1 V ¬ a → 0 V ¬ a → 1 V b → 0 V b → 1 V ¬ b → 0 V ¬ b → 1 ⇐ = Assignment rules V c → 0 V c → 1 V ¬ c → 0 V ¬ c → 1 V d → 0 V d → 1 V ¬ d → 0 V ¬ d → 1

  13. The reduction (an example) φ = ( a ∨ ¬ b ∨ c ) ∧ ( ¬ a ∨ b ∨ c ) ∧ ( d ∨ ¬ c ∨ a ) � �� � � �� � � �� � C 1 C 2 C 3 We have so far: V • → 0 | 1 and V ¬• → 0 | 1 (assignment rules) For the variables, a , b , c , d we create the rules: U a , 1 → V a V ¬ a U a , 0 → V ¬ a V a U b , 1 → V b V ¬ b U b , 0 → V ¬ b V b ⇐ = Consistency rules U c , 1 → V c V ¬ c U c , 0 → V ¬ c V c U d , 1 → V d V ¬ d U d , 0 → V ¬ d V d

  14. The reduction (an example) φ = ( a ∨ ¬ b ∨ c ) ∧ ( ¬ a ∨ b ∨ c ) ∧ ( d ∨ ¬ c ∨ a ) � �� � � �� � � �� � C 1 C 2 C 3 We have so far: assignment rules and U • , 1 → V • V ¬• and U • , 0 → V ¬• V • (consistency rules) For the clauses C 1 , C 2 and C 3 we create the rules: S 1 → C 1 S 2 → S 1 C 2 ⇐ = Clause rules S 3 → S 2 C 3 S → S 3 S is the start symbol of the grammar

  15. The reduction (an example) φ = ( a ∨ ¬ b ∨ c ) ∧ ( ¬ a ∨ b ∨ c ) ∧ ( d ∨ ¬ c ∨ a ) � �� � � �� � � �� � C 1 C 2 C 3 We have so far: assignment rules, consistency rules and clause rules For the clause C 1 , for example, we create the rules: C 1 → U a , 1 U b , 1 U c , 1 → C 1 U a , 0 U b , 1 U c , 1 C 1 → U a , 1 U b , 0 U c , 1 C 1 → U a , 1 U b , 1 U c , 0 ⇐ = Satisfaction rules for C 1 C 1 → U a , 0 U b , 0 U c , 1 C 1 → U a , 1 U b , 0 U c , 0 C 1 → U a , 0 U b , 0 U c , 0

  16. The reduction (an example) φ = ( a ∨ ¬ b ∨ c ) ∧ ( ¬ a ∨ b ∨ c ) ∧ ( d ∨ ¬ c ∨ a ) � �� � � �� � � �� � C 1 C 2 C 3 We have so far: assignment rules, consistency rules, clause rules and satisfaction rules – that’s the complete grammar! We need to decide on the string to parse, x Set x = 101010 101010 101010 � �� � � �� � � �� � C 1 C 2 C 3

  17. The reduction (an example) φ = ( a ∨ ¬ b ∨ c ) ∧ ( ¬ a ∨ b ∨ c ) ∧ ( d ∨ ¬ c ∨ a ) � �� � � �� � � �� � C 1 C 2 C 3 x = 101010 101010 101010 � �� � � �� � � �� � C 1 C 2 C 3 We can use a parse for x to extract an assignment for the variables

  18. Extracting an assignment φ = ( a ∨ ¬ b ∨ c ) ∧ ( ¬ a ∨ b ∨ c ) ∧ ( d ∨ ¬ c ∨ a ) � �� � � �� � � �� � C 1 C 2 C 3 S 3 � � ������������� � � � � � � � � � � � � � � � � � � � � � � � rest of tree C 3 ������������������� � � � � � � � � � � � � � � � � � � � U d , 0 U c , 0 U a , 1 � � � ������� � ������� � ������� � � � � � � � � � � � � � � � � V ¬ d V d V ¬ c V c V a V ¬ a 0 0 0 1 1 1 If we use the rule V a → 0 set the variable a to 0 If we use the rule V a → 1 set the variable a to 1 Same for other variables Note that we use V a → • and V ¬ a → • together

  19. Consistent assignments φ = ( a ∨ ¬ b ∨ c ) ∧ ( ¬ a ∨ b ∨ c ) ∧ ( d ∨ ¬ c ∨ a ) � �� � � �� � � �� � C 1 C 2 C 3 But! What if we use both V a → 0 and V a → 1?

  20. Consistent assignments φ = ( a ∨ ¬ b ∨ c ) ∧ ( ¬ a ∨ b ∨ c ) ∧ ( d ∨ ¬ c ∨ a ) � �� � � �� � � �� � C 1 C 2 C 3 But! What if we use both V a → 0 and V a → 1? Lemma Let θ be weights for the grammar we constructed. If the (multiplicative) weight of the Viterbi parse of 101010 101010 101010 is 1 , then the assignment extracted � �� � � �� � � �� � C 1 C 2 C 3 from the parse tree is consistent

  21. Finding a satisfying assignment φ = ( a ∨ ¬ b ∨ c ) ∧ ( ¬ a ∨ b ∨ c ) ∧ ( d ∨ ¬ c ∨ a ) � �� � � �� � � �� � C 1 C 2 C 3 Lemma There exists θ such that the Viterbi parse of 101010 101010 101010 is 1 if and only if φ is satisfiable. The � �� � � �� � � �� � C 1 C 2 C 3 satisfying assignment is the one extracted from the parse tree with weight 1

  22. NP hardness result Problem (Viterbi Train) Input: G context-free grammar, x 1 , . . . , x n sentences, α ∈ [ 0 , 1 ] Output: 1 if there are θ and z 1 , . . . , z n derivation trees such that n � p ( x i , z i | θ ) ≥ α i = 1 and 0 otherwise. Corollary Viterbi Train is NP hard In fact, we have NP completeness (Viterbi Train is in NP)

  23. Approximate solutions Reminder, Viterbi Train tries to maximize: n � max p ( x i , z i | θ ) θ, z 1 ,..., z n i = 1 We know it is hard to find the exact maximum. Can we hope to approximate the maximal solution?

  24. Approximate solutions The question we ask is: “is there a ρ ∈ ( 0 , 1 ] such that there is an efficient algorithm which returns z ′ 1 , ..., z ′ n and θ ′ such that � � n n � � p ( x i , z ′ i | θ ′ ) ≥ ρ max p ( x i , z i | θ ) θ, z 1 ,.., z n i = 1 i = 1 for any input sentences x 1 , ..., x n and a grammar G ? ”

Recommend


More recommend