lti
play

lti 1 (typically) Unsupervised learning in NLP - PowerPoint PPT Presentation

Concavity and Initialization for Unsupervised Dependency Parsing Kevin Gimpel Noah A. Smith lti 1 (typically) Unsupervised learning in NLP non-convex optimization lti 2 Dependency Model with Valence (Klein & Manning,


  1. Concavity and Initialization for Unsupervised Dependency Parsing Kevin Gimpel Noah A. Smith lti 1

  2. (typically) Unsupervised learning in NLP non-convex optimization lti 2

  3. Dependency Model with Valence (Klein & Manning, 2004) 60 EM with 50 Random Initializers Attachment Accuracy (%) 50 40 30 20 10 -20.2 -20 -19.8 -19.6 -19.4 -19.2 -19 Log-Likelihood (per sentence) lti 3

  4. Dependency Model with Valence (Klein & Manning, 2004) 60 Attachment Accuracy (%) Pearson’s r = 0.63 50 (strong correlation) 40 30 20 10 -20.2 -20 -19.8 -19.6 -19.4 -19.2 -19 Log-Likelihood (per sentence) lti 4

  5. Dependency Model with Valence (Klein & Manning, 2004) 60 Attachment Accuracy (%) 50 40 Range = 20%! 30 20 10 -20.2 -20 -19.8 -19.6 -19.4 -19.2 -19 Log-Likelihood (per sentence) lti 5

  6. Dependency Model with Valence (Klein & Manning, 2004) 60 Attachment Accuracy (%) 50 40 30 initializer 20 from K&M04 10 -20.2 -20 -19.8 -19.6 -19.4 -19.2 -19 Log-Likelihood (per sentence) lti 6

  7. How has this been addressed? � Scaffolding / staged training (Brown et al., 1993; Elman, 1993; Spitkovsky et al., 2010) � Curriculum learning (Bengio et al., 2009) � Deterministic annealing (Smith & Eisner, 2004), Structural annealing (Smith & Eisner, 2006) � Continuation methods (Allgower & Georg, 1990) lti 7

  8. Example: Word Alignment IBM Model 1 Brown et al. (1993) HMM Model IBM Model 4 lti 8

  9. Example: Word Alignment CONCAVE IBM Model 1 Brown et al. (1993) HMM Model IBM Model 4 lti 9

  10. (typically) Unsupervised learning in NLP non-convex optimization lti 10

  11. (typically) Unsupervised learning in NLP non-convex optimization Except IBM Model 1 for word alignment (which has a concave log-likelihood function) lti 11

  12. IBM Model 1 (Brown et al., 1993) lti 12

  13. IBM Model 1 (Brown et al., 1993) alignment translation probability probability lti 13

  14. IBM Model 1 (Brown et al., 1993) alignment translation probability probability IBM Model 2 lti 14

  15. IBM Model 1 (Brown et al., 1993) CONCAVE alignment translation probability probability NOT IBM Model 2 CONCAVE lti 15

  16. IBM Model 1 (Brown et al., 1993) CONCAVE alignment translation probability probability NOT IBM Model 2 CONCAVE product of parameters within log-sum lti 16

  17. IBM Model 1 (Brown et al., 1993) CONCAVE alignment translation For concavity: probability probability 1 parameter is permitted for each atomic piece of latent structure. NOT IBM Model 2 No atomic piece of latent structure can affect any other piece. CONCAVE product of parameters within log-sum lti 17

  18. (typically) Unsupervised learning in NLP non-convex optimization Except IBM Model 1 for word alignment (which has a concave log-likelihood function) What models can we build without sacrificing concavity? lti 18

  19. For concavity: 1 parameter is permitted for each atomic piece of latent structure. No atomic piece of latent structure can affect any other piece. lti 19

  20. For concavity: 1 parameter is permitted for each atomic piece of latent structure. No atomic piece of latent structure can affect any other piece. single dependency arc lti 20

  21. For concavity: 1 parameter is permitted for each atomic piece of latent structure. No atomic piece of latent structure can affect any other piece. single dependency arc Every dependency arc must be independent, so we can’t use a tree constraint lti 21

  22. For concavity: 1 parameter is permitted for each atomic piece of latent structure. No atomic piece of latent structure can affect any other piece. single dependency arc Every dependency arc must be independent, so we can’t use a tree constraint Only one parameter allowed per dependency arc lti 22

  23. For concavity: 1 parameter is permitted for each atomic piece of latent structure. No atomic piece of latent structure can affect any other piece. single dependency arc Our Model: Like IBM Model 1, but we generate the same sentence again, aligning words to the original sentence (cf. Brody, 2010) lti 23

  24. $ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD lti 24

  25. $ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD lti 25

  26. $ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD lti 26

  27. $ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD lti 27

  28. $ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD lti 28

  29. $ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD Cycles, multiple roots, and non-projectivity are all permitted by this model lti 29

  30. $ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD Only one parameter per dependency arc: lti 30

  31. $ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD Only one parameter per dependency arc: lti 31

  32. $ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD Only one parameter per dependency arc: We cannot look at other dependency arcs, but we can condition on (properties of) the sentence: lti 32

  33. $ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD We condition on direction: (“Concave Model A”) lti 33

  34. $ NNPS VBD IN NNS IN NNP IN CD NN $ NNPS VBD IN NNSfrom Scandinavia in 1000 AD Note: we’ve been using words in our examples, but in our model we follow standard practice and use gold POS tags We condition on direction: (“Concave Model A”) lti 34

  35. $ NNPS VBD IN NNS IN NNP IN CD NN $ NNPS VBD IN NNSfrom Scandinavia in 1000 AD Model Initializer Accuracy* *Penn Treebank Attach Right N/A 31.7 test set, sentences DMV Uniform 17.6 of all lengths DMV K&M 32.9 WSJ10 used for Concave Model A Uniform 25.6 training We condition on direction: (“Concave Model A”) lti 35

  36. Note: $ NNPS VBD IN NNS IN NNP IN CD NN IBM Model 1 is not strictly concave (Toutanova & Galley, 2011) $ NNPS VBD IN NNSfrom Scandinavia in 1000 AD Model Initializer Accuracy* *Penn Treebank Attach Right N/A 31.7 test set, sentences DMV Uniform 17.6 of all lengths DMV K&M 32.9 WSJ10 used for Concave Model A Uniform 25.6 training We condition on direction: (“Concave Model A”) lti 36

  37. $ NNPS VBD IN NNS IN NNP IN CD NN $ NNPS VBD IN NNSfrom Scandinavia in 1000 AD We can also use hard constraints while preserving concavity: The only tags that can align to $ are verbs (Marecček & Žabokrtský, 2011; Naseem et al., 2010) (“Concave Model B”) lti 37

  38. $ NNPS VBD IN NNS IN NNP IN CD NN $ NNPS VBD IN NNSfrom Scandinavia in 1000 AD Model Initializer Accuracy* Attach Right N/A 31.7 *Penn Treebank test set, sentences DMV Uniform 17.6 of all lengths DMV K&M 32.9 Concave Model A Uniform 25.6 WSJ10 used for training Concave Model B Uniform 28.6 lti 38

  39. (typically) Unsupervised learning in NLP non-convex optimization Except IBM Model 1 for word alignment (which has a concave log-likelihood function) What models can we build without sacrificing concavity? Can these concave models be useful? lti 39

  40. As IBM Model 1 is used to initialize other word alignment models, we can use our concave models to initialize the DMV lti 40

  41. As IBM Model 1 is used to initialize other word alignment models, we can use our concave models to initialize the DMV Model Initializer Accuracy *Penn Treebank Attach Right N/A 31.7 test set, sentences DMV Uniform 17.6 of all lengths DMV K&M 32.9 WSJ10 used for DMV Concave Model A 34.4 training DMV Concave Model B 43.0 lti 41

  42. As IBM Model 1 is used to initialize other word alignment models, we can use our concave models to initialize the DMV Model Initializer Accuracy* DMV, trained on sentences of length ≤ 20 Concave Model B 53.1 Shared Logistic Normal (Cohen & Smith, 2009) K&M 41.4 Posterior Regularization (Gillenwater et al., 2010) K&M 53.3 LexTSG-DMV (Blunsom & Cohn, 2010) K&M 55.7 Punctuation/UnsupTags (Spitkovsky et al., 2011), K&M’ 59.1 trained on sentences of length ≤ 45 *Penn Treebank test set, sentences of all lengths lti 42

Recommend


More recommend