dynamic feature selection for dependency parsing
play

Dynamic Feature Selection for Dependency Parsing He He, Hal Daum - PowerPoint PPT Presentation

Dynamic Feature Selection for Dependency Parsing He He, Hal Daum III and Jason Eisner EMNLP 2013, Seattle Structured Prediction in NLP Part-of-Speech Tagging Parsing N N V Det N


  1. + next feature group 31 gray edges with unknown fate... 74 features per gray edge . the firms were ready $ This time , Undetermined edge Current 1-best tree Winner edge (permanently in 1-best tree) 41 Loser edge

  2. 31 gray edges with unknown fate... 74 features per gray edge . the firms were ready $ This time , Undetermined edge Non-projective decoding to Current 1-best tree find new 1-best tree Winner edge (permanently in 1-best tree) 42 Loser edge

  3. 28 gray edges with unknown fate... 74 features per gray edge . the firms were ready $ This time , Undetermined edge Classifier picks winners Current 1-best tree among the blue edges Winner edge (permanently in 1-best tree) 43 Loser edge

  4. 8 gray edges with unknown fate... 74 features per gray edge . the firms were ready $ This time , Undetermined edge Remove losers in conflict Current 1-best tree with the winners Winner edge (permanently in 1-best tree) 44 Loser edge

  5. 8 gray edges with unknown fate... 74 features per gray edge . the firms were ready $ This time , Undetermined edge Remove losers in conflict Current 1-best tree with the winners Winner edge (permanently in 1-best tree) 45 Loser edge

  6. + next feature group 8 gray edges with unknown fate... 107 features per gray edge . the firms were ready $ This time , Undetermined edge Current 1-best tree Winner edge (permanently in 1-best tree) 46 Loser edge

  7. 8 gray edges with unknown fate... 107 features per gray edge . the firms were ready $ This time , Undetermined edge Non-projective decoding to Current 1-best tree find new 1-best tree Winner edge (permanently in 1-best tree) 47 Loser edge

  8. 7 gray edges with unknown fate... 107 features per gray edge . the firms were ready $ This time , Undetermined edge Classifier picks winners Current 1-best tree among the blue edges Winner edge (permanently in 1-best tree) 48 Loser edge

  9. 3 gray edges with unknown fate... 107 features per gray edge . the firms were ready $ This time , Undetermined edge Remove losers in conflict Current 1-best tree with the winners Winner edge (permanently in 1-best tree) 49 Loser edge

  10. 3 gray edges with unknown fate... 107 features per gray edge . the firms were ready $ This time , Undetermined edge Remove losers in conflict Current 1-best tree with the winners Winner edge (permanently in 1-best tree) 50 Loser edge

  11. + last feature group 3 gray edges with unknown fate... 268 features per gray edge . the firms were ready $ This time , Undetermined edge Current 1-best tree Winner edge (permanently in 1-best tree) 51 Loser edge

  12. 0 gray edge with unknown fate... 268 features per gray edge . the firms were ready $ This time , Undetermined edge Projective decoding to find Current 1-best tree final 1-best tree Winner edge (permanently in 1-best tree) 52 Loser edge

  13. What Happens During the Average Parse? 53

  14. What Happens During the Average Parse? Most edges win or lose early 54

  15. What Happens During the Average Parse? Most edges win or lose early Some edges win late 55

  16. What Happens During the Average Parse? Later features are helpful Most edges win or lose early Some edges win late 56

  17. What Happens During the Average Parse? Later features are helpful Most edges win or lose early Linear increase in runtime Some edges win late 57

  18. Summary: How Early Decisions Are Made ● Winners – Will definitely appear in the 1-best tree ● Losers – Have the same child as a winning edge – Form cycle with winning edges – Cross a winning edge (optional) – Share root ($) with a winning edge (optional) ● Undetermined – Add the next feature group to the remaining gray edges 58

  19. Feature Template Ranking ● Forward selection 59

  20. Feature Template Ranking ● Forward selection A 0.60 B 0.49 C 0.55 60

  21. Feature Template Ranking ● Forward selection 1 A A 0.60 A B 0.49 C 0.55 61

  22. Feature Template Ranking ● Forward selection 1 A A 0.60 A A&B 0.80 B 0.49 A&C 0.85 C 0.55 62

  23. Feature Template Ranking ● Forward selection 1 A A 0.60 A A&B 0.80 C 2 C B 0.49 A&C 0.85 C 0.55 63

  24. Feature Template Ranking ● Forward selection 1 A A 0.60 A A&B 0.80 C 2 C B 0.49 A&C&B 0.9 A&C 0.85 3 B C 0.55 64

  25. Feature Template Ranking ● Forward selection 1 A A 0.60 A A&B 0.80 C 2 C B 0.49 A&C&B 0.9 A&C 0.85 3 B C 0.55 ● Grouping head cPOS+ mod cPOS + in-between punct # 0.49 in-between cPOS 0.59 ⋮ head POS + mod POS + in-between conj # 0.71 head POS + mod POS + in-between POS + dist 0.72 ⋮ head token + mod cPOS + dist 0.80 ⋮ 65

  26. Feature Template Ranking ● Forward selection 1 A A 0.60 A A&B 0.80 C 2 C B 0.49 A&C&B 0.9 A&C 0.85 3 B C 0.55 ● Grouping head cPOS+ mod cPOS + in-between punct # 0.49 in-between cPOS 0.59 + ~0.1 ⋮ head POS + mod POS + in-between conj # 0.71 head POS + mod POS + in-between POS + dist 0.72 ⋮ + ~0.1 head token + mod cPOS + dist 0.80 ⋮ 66

  27. Partition Feature List Into Groups 67

  28. How to pick the winners? 68

  29. How to pick the winners? ● Learn a classifier 69

  30. How to pick the winners? ● Learn a classifier ● Features – Currently added parsing features – Meta-features -- confidence of a prediction 70

  31. How to pick the winners? ● Learn a classifier ● Features – Currently added parsing features – Meta-features -- confidence of a prediction ● Training examples – Input: each blue edge in current 1-best tree – Output: is the edge in the gold tree? If so, we want it to win! 71

  32. Classifier Features ● Currently added parsing features ● Meta-features – : …, 0.5, 0.8, 0.85 the firms (scores are normalized by the sigmoid function) – Margins to the highest-scoring competing edge 0.72 0.65 0.30 0.23 . the firms were $ This time 0.12 – Index of the next feature group 72

  33. Classifier Features ● Currently added parsing features ● Meta-features – : …, 0.5, 0.8, 0.85 the firms (scores are normalized by the sigmoid function) – Margins to the highest-scoring competing edge 0.72 0.65 0.30 0.23 . the firms were $ This time 0.12 – Index of the next feature group 73

  34. Classifier Features ● Currently added parsing features ● Meta-features Dynamic Features – : …, 0.5, 0.8, 0.85 the firms (scores are normalized by the sigmoid function) – Margins to the highest-scoring competing edge 0.72 0.65 0.30 0.23 . the firms were $ This time 0.12 – Index of the next feature group 74

  35. How To Train With Dynamic Features ● Training examples are not fixed in advance! ● Winners/losers from stages < k affect: – Set of edges to classify at stage k – The dynamic features of those edges at stage k ● Bad decisions can cause future errors 75

  36. How To Train With Dynamic Features ● Training examples are not fixed in advance!! ● Winners/losers from stages < k affect: – Set of edges to classify at stage k – The dynamic features of those edges at stage k ● Bad decisions can cause future errors Reinforcement / Imitation Learning ● Dataset Aggregation (DAgger) ( Ross et al., 2011 ) – Iterates between training and running a model – Learns to recover from past mistakes 76

  37. Upper Bound of Our Performance ● “Labels” – Gold edges always win – 96.47% UAS with 2.9% first-order features . . the the firms were ready firms were ready $ $ This This time time , , 77

  38. How To Train Our Parser 1.Train parsers (non-projective, projective) using all features 2.Rank and group feature templates 3.Iteratively train a classifier to decide winners/losers 78

  39. Experiment ● Data – Penn Treebank: English – CoNLL-X: Bulgarian, Chinese, German, Japanese, Portuguese, Swedish ● Parser – MSTParser (McDonald et al., 2006) ● Dynamically-trained Classifier – LibLinear (Fan et al., 2008) 79

  40. Dynamic Feature Selection Beats Static Forward Selection 80

  41. Dynamic Feature Selection Beats Static Forward Selection Always add the next feature group to all edges Add features as needed 81

  42. Experiment: 1st-order 2x to 6x speedup 6 5 4 p DynFS 3 u d e Baseline e 2 p S 1 0 Chinese German Portuguese Bulgarian English Japanese Swedish 82

  43. Experiment: 1st-order ~0.2% loss in accuracy 100.3% 100.2% 100.1% y 100.0% c a 99.9% r u c DynFS 99.8% c a 99.7% Baseline e 99.6% v i t a 99.5% l e 99.4% R 99.3% Chinese German Portuguese Bulgarian English Japanese Swedish relative accuracy = accuracy of the pruning parser accuracy of the full parser 83

  44. Second-order Dependency Parsing ● Features depend on the siblings as well were ready . ● First-order: ● O(n 2 ) substructure to score $ were ● Second-order: ● O(n 3 ) substructure to score , . time ready firms ~380 feature templates This ~96M features the ● Decoding: still O(n 3 ) 84

  45. Experiment: 2nd-order 2x to 8x speedup 9 8 7 6 5 p DynFS u d 4 e Baseline e 3 p S 2 1 0 Chinese German Portuguese Bulgarian English Japanese Swedish 85

  46. Experiment: 2nd-order ~0.3% loss in accuracy 100.3% 100.2% 100.1% y 100.0% c a 99.9% r u c DynFS 99.8% c a 99.7% Baseline e 99.6% v i t a 99.5% l e 99.4% R 99.3% Chinese German Portuguese Bulgarian English Japanese Swedish 86

  47. Ours vs Vine Pruning (Rush and Petrov, 2012) ● Vine pruning : a very fast parser that speeds up using orthogonal techniques – Start with short edges ( fully scored) – Add long edges in if needed ● Ours – Start with all edges ( partially scored) – Quickly remove unneeded edges ● Could be combined for further speedup! 87

  48. VS Vine Pruning: 1st-order comparable performance 6 5 4 DynFS p 3 u d VineP e e Baseline 2 p S 1 0 Chinese German Portuguese Bulgarian English Japanese Swedish 88

  49. VS Vine Pruning: 1st-order 100.3% 100.2% 100.1% y 100.0% c a 99.9% r DynFS u c 99.8% c VineP a 99.7% e Baseline 99.6% v i t a 99.5% l e 99.4% R 99.3% Chinese German Portuguese Bulgarian English Japanese Swedish 89

  50. VS Vine Pruning: 2nd-order 16 14 12 10 DynFS p 8 u d VineP e 6 e Baseline p S 4 2 0 Chinese German Portuguese Bulgarian English Japanese Swedish 90

  51. VS Vine Pruning: 2nd-order 100.3% 100.2% 100.1% y 100.0% c a 99.9% r DynFS u c 99.8% c VineP a 99.7% e Baseline 99.6% v i t a 99.5% l e 99.4% R 99.3% Chinese German Portuguese Bulgarian English Japanese Swedish 91

  52. Conclusion ● Feature computation is expensive in structured prediction ● Commitment should be made dynamically ● Early commitment to edges reduce both searching and scoring time ● Can be used in other feature-rich models for structured prediction 92

  53. Backup Slides 93

  54. Static dictionary pruning (Rush and Petrov, 2012) VB CD: 18 → VB CD: 3 ← NN VBG: 22 → NN VBG: 11 ← ... . the firms were ready $ This time , 94

  55. Reinforcement Learning 101 ● Markov Decision Process (MDP) – State: all the information helping us to make decisions – Action: things we choose to do – Reward: criteria for evaluating actions – Policy: the “brain” that makes the decision ● Goal – Maximize the expected future reward 95

  56. Policy Learning ● Markov Decision Process (MDP) π ( + context) = add / lock the firms – reward = accuracy + λ∙speed ● Reinforcement learning – Delayed reward – Long time to converge ● Imitation learning – Mimic the oracle – Reduced to supervised classification problem 96

  57. Imitation Learning ● Oracle – (near) optimal performance – generate target action in any given state π ( + context) = lock the firms π ( + context) = add the time , ... Binary classifier 97

  58. Dataset Aggregation (DAgger) ● Collect data from the oracle only – Different distribution at training and test time ● Iterative policy training ● Correct the learner's mistake ● Obtain a policy performs well under its own policy distribution 98

  59. Experiment (1st-order) 45.00% 40.00% 35.00% 30.00% t s 25.00% o c DynFS 20.00% e r u 15.00% t a e 10.00% F 5.00% 0.00% Chinese German Portuguese Bulgarian English Japanese Swedish cost = # feature templates used total # feature templates on the statically pruned graph 99

  60. Experiment (2nd-order) 80.00% 70.00% 60.00% 50.00% t s o 40.00% c DynFS e r 30.00% u t a 20.00% e F 10.00% 0.00% Chinese German Portuguese Bulgarian English Japanese Swedish 100

Recommend


More recommend