lti 1 (typically) Unsupervised learning in NLP - PowerPoint PPT Presentation

Concavity and Initialization for Unsupervised Dependency Parsing Kevin Gimpel Noah A. Smith lti 1

(typically) Unsupervised learning in NLP non-convex optimization lti 2

Dependency Model with Valence (Klein & Manning, 2004) 60 EM with 50 Random Initializers Attachment Accuracy (%) 50 40 30 20 10 -20.2 -20 -19.8 -19.6 -19.4 -19.2 -19 Log-Likelihood (per sentence) lti 3

Dependency Model with Valence (Klein & Manning, 2004) 60 Attachment Accuracy (%) Pearson’s r = 0.63 50 (strong correlation) 40 30 20 10 -20.2 -20 -19.8 -19.6 -19.4 -19.2 -19 Log-Likelihood (per sentence) lti 4

Dependency Model with Valence (Klein & Manning, 2004) 60 Attachment Accuracy (%) 50 40 Range = 20%! 30 20 10 -20.2 -20 -19.8 -19.6 -19.4 -19.2 -19 Log-Likelihood (per sentence) lti 5

Dependency Model with Valence (Klein & Manning, 2004) 60 Attachment Accuracy (%) 50 40 30 initializer 20 from K&M04 10 -20.2 -20 -19.8 -19.6 -19.4 -19.2 -19 Log-Likelihood (per sentence) lti 6

How has this been addressed? � Scaffolding / staged training (Brown et al., 1993; Elman, 1993; Spitkovsky et al., 2010) � Curriculum learning (Bengio et al., 2009) � Deterministic annealing (Smith & Eisner, 2004), Structural annealing (Smith & Eisner, 2006) � Continuation methods (Allgower & Georg, 1990) lti 7

Example: Word Alignment IBM Model 1 Brown et al. (1993) HMM Model IBM Model 4 lti 8

Example: Word Alignment CONCAVE IBM Model 1 Brown et al. (1993) HMM Model IBM Model 4 lti 9

(typically) Unsupervised learning in NLP non-convex optimization lti 10

(typically) Unsupervised learning in NLP non-convex optimization Except IBM Model 1 for word alignment (which has a concave log-likelihood function) lti 11

IBM Model 1 (Brown et al., 1993) lti 12

IBM Model 1 (Brown et al., 1993) alignment translation probability probability lti 13

IBM Model 1 (Brown et al., 1993) alignment translation probability probability IBM Model 2 lti 14

IBM Model 1 (Brown et al., 1993) CONCAVE alignment translation probability probability NOT IBM Model 2 CONCAVE lti 15

IBM Model 1 (Brown et al., 1993) CONCAVE alignment translation probability probability NOT IBM Model 2 CONCAVE product of parameters within log-sum lti 16

IBM Model 1 (Brown et al., 1993) CONCAVE alignment translation For concavity: probability probability 1 parameter is permitted for each atomic piece of latent structure. NOT IBM Model 2 No atomic piece of latent structure can affect any other piece. CONCAVE product of parameters within log-sum lti 17

(typically) Unsupervised learning in NLP non-convex optimization Except IBM Model 1 for word alignment (which has a concave log-likelihood function) What models can we build without sacrificing concavity? lti 18

For concavity: 1 parameter is permitted for each atomic piece of latent structure. No atomic piece of latent structure can affect any other piece. lti 19

For concavity: 1 parameter is permitted for each atomic piece of latent structure. No atomic piece of latent structure can affect any other piece. single dependency arc lti 20

For concavity: 1 parameter is permitted for each atomic piece of latent structure. No atomic piece of latent structure can affect any other piece. single dependency arc Every dependency arc must be independent, so we can’t use a tree constraint lti 21

For concavity: 1 parameter is permitted for each atomic piece of latent structure. No atomic piece of latent structure can affect any other piece. single dependency arc Every dependency arc must be independent, so we can’t use a tree constraint Only one parameter allowed per dependency arc lti 22

For concavity: 1 parameter is permitted for each atomic piece of latent structure. No atomic piece of latent structure can affect any other piece. single dependency arc Our Model: Like IBM Model 1, but we generate the same sentence again, aligning words to the original sentence (cf. Brody, 2010) lti 23

$ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD lti 24

$ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD lti 25

$ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD Cycles, multiple roots, and non-projectivity are all permitted by this model lti 29

$ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD Only one parameter per dependency arc: lti 30

$ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD Only one parameter per dependency arc: lti 31

$ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD Only one parameter per dependency arc: We cannot look at other dependency arcs, but we can condition on (properties of) the sentence: lti 32

$ Vikings came in longboats from Scandinavia in 1000 AD $ Vikings came in longboats from Scandinavia in 1000 AD We condition on direction: (“Concave Model A”) lti 33

$ NNPS VBD IN NNS IN NNP IN CD NN $ NNPS VBD IN NNSfrom Scandinavia in 1000 AD Note: we’ve been using words in our examples, but in our model we follow standard practice and use gold POS tags We condition on direction: (“Concave Model A”) lti 34

$ NNPS VBD IN NNS IN NNP IN CD NN $ NNPS VBD IN NNSfrom Scandinavia in 1000 AD Model Initializer Accuracy* *Penn Treebank Attach Right N/A 31.7 test set, sentences DMV Uniform 17.6 of all lengths DMV K&M 32.9 WSJ10 used for Concave Model A Uniform 25.6 training We condition on direction: (“Concave Model A”) lti 35

Note: $ NNPS VBD IN NNS IN NNP IN CD NN IBM Model 1 is not strictly concave (Toutanova & Galley, 2011) $ NNPS VBD IN NNSfrom Scandinavia in 1000 AD Model Initializer Accuracy* *Penn Treebank Attach Right N/A 31.7 test set, sentences DMV Uniform 17.6 of all lengths DMV K&M 32.9 WSJ10 used for Concave Model A Uniform 25.6 training We condition on direction: (“Concave Model A”) lti 36

$ NNPS VBD IN NNS IN NNP IN CD NN $ NNPS VBD IN NNSfrom Scandinavia in 1000 AD We can also use hard constraints while preserving concavity: The only tags that can align to $ are verbs (Marecček & Žabokrtský, 2011; Naseem et al., 2010) (“Concave Model B”) lti 37

$ NNPS VBD IN NNS IN NNP IN CD NN $ NNPS VBD IN NNSfrom Scandinavia in 1000 AD Model Initializer Accuracy* Attach Right N/A 31.7 *Penn Treebank test set, sentences DMV Uniform 17.6 of all lengths DMV K&M 32.9 Concave Model A Uniform 25.6 WSJ10 used for training Concave Model B Uniform 28.6 lti 38

(typically) Unsupervised learning in NLP non-convex optimization Except IBM Model 1 for word alignment (which has a concave log-likelihood function) What models can we build without sacrificing concavity? Can these concave models be useful? lti 39

As IBM Model 1 is used to initialize other word alignment models, we can use our concave models to initialize the DMV lti 40

As IBM Model 1 is used to initialize other word alignment models, we can use our concave models to initialize the DMV Model Initializer Accuracy *Penn Treebank Attach Right N/A 31.7 test set, sentences DMV Uniform 17.6 of all lengths DMV K&M 32.9 WSJ10 used for DMV Concave Model A 34.4 training DMV Concave Model B 43.0 lti 41

As IBM Model 1 is used to initialize other word alignment models, we can use our concave models to initialize the DMV Model Initializer Accuracy* DMV, trained on sentences of length ≤ 20 Concave Model B 53.1 Shared Logistic Normal (Cohen & Smith, 2009) K&M 41.4 Posterior Regularization (Gillenwater et al., 2010) K&M 53.3 LexTSG-DMV (Blunsom & Cohn, 2010) K&M 55.7 Punctuation/UnsupTags (Spitkovsky et al., 2011), K&M’ 59.1 trained on sentences of length ≤ 45 *Penn Treebank test set, sentences of all lengths lti 42

lti 1 (typically) Unsupervised learning in NLP - PowerPoint PPT Presentation

Concavity and Initialization for Unsupervised Dependency Parsing Kevin Gimpel Noah A. Smith lti 1 (typically) Unsupervised learning in NLP non-convex optimization lti 2 Dependency Model with Valence (Klein & Manning,

Models for LTI systems LTI system stands for linear time invariant system Model describing LTI

Topic 2: LTI Systems and Convolution Response of LTI Systems Impulse response and unit

SIMPLE & LEAN PRODUCER Expanding Production and Reducing Costs Health and Safety Update: No

lti The Goal Input: educational text Output: quiz lti The Goal Input:

C. H. Perez & Associates C Consulting E lti E ngineers, Inc. i I FDOT District Four

CMU LTI @ KBP 2015 Event Track Zhengzhong Liu Dheeru Dua Jun Araki Teruko Mitamura Eduard Hovy

dt < | ( ) | h t (this has to do with system stability system stability)

INC 212 Signals and systems Lecture#4: Frequency response of LTI systems Assoc. Prof. Benjamas

M u lti v ariable logistic regression G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON Ita

Textual Predictors of Bill Survival in Congressional Committees Tae Yano , LTI, CMU Noah Smith ,

Graph-Based Lexicon Expansion with Sparsity-Inducing Penalties Dipanjan Das , LTI, CMU Google

Signal and Systems Chapter 2: LTI Systems Representation of DT signals in terms of shifted unit

dt < | ( ) | h t (this has to do with system stability system stability (ECE

Representation of LTI Systems Prof. Seungchul Lee Industrial AI Lab. Transfer Function

VASCO (VAcuum Stability COde) : multi-gas code to calculate gas density lti d t l l t d it

Grating Incident resulting in LTI Agenda Background and Medical Condition Incident

Pa#ent safety accountability Lisa McGiffert Consumers Union Safe Pa#ent Project

Yoga in Schools Yoga Alliance Webinar May 5, 2020 Sat Bir S. Khalsa, Ph.D. Assistant Professor

Lecture 28/Chapters 22 & 23 1 measurement (quan) [pop sd known or sample large]: z test

On the Behavior of Substitution-based Reversible Circuit Synthesis Algorithms: Investigation and

Automation for Separation with CDOs: Dynamic Aircraft Arrival Routes Tatiana Polishchuk, LiU

Modernes SQL Wie PostgreSQL die Konkurrenz aussticht @MarkusWinand PGConf.de - 2018-04-13

EPIC MRA STATEWIDE POLL OF ACTIVE AND LIKELY NOVEMBER 2018 VOTERS [600 SAMPLE - ERROR 4.0%]

Oliviers Ricci curvature and applications Sunhyuk Lim Ohio State University lim.991@osu.edu

lti 1 (typically) Unsupervised learning in NLP - PowerPoint PPT Presentation

Concavity and Initialization for Unsupervised Dependency Parsing Kevin Gimpel Noah A. Smith lti 1 (typically) Unsupervised learning in NLP non-convex optimization lti 2 Dependency Model with Valence (Klein & Manning,

Models for LTI systems LTI system stands for linear time invariant system Model describing LTI

Topic 2: LTI Systems and Convolution Response of LTI Systems Impulse response and unit

SIMPLE &amp; LEAN PRODUCER Expanding Production and Reducing Costs Health and Safety Update: No

lti The Goal Input: educational text Output: quiz lti The Goal Input:

C. H. Perez &amp; Associates C Consulting E lti E ngineers, Inc. i I FDOT District Four

CMU LTI @ KBP 2015 Event Track Zhengzhong Liu Dheeru Dua Jun Araki Teruko Mitamura Eduard Hovy

dt &lt; | ( ) | h t (this has to do with system stability system stability)

INC 212 Signals and systems Lecture#4: Frequency response of LTI systems Assoc. Prof. Benjamas

M u lti v ariable logistic regression G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON Ita

Textual Predictors of Bill Survival in Congressional Committees Tae Yano , LTI, CMU Noah Smith ,

Graph-Based Lexicon Expansion with Sparsity-Inducing Penalties Dipanjan Das , LTI, CMU Google

Signal and Systems Chapter 2: LTI Systems Representation of DT signals in terms of shifted unit

dt &lt; | ( ) | h t (this has to do with system stability system stability (ECE

Representation of LTI Systems Prof. Seungchul Lee Industrial AI Lab. Transfer Function

VASCO (VAcuum Stability COde) : multi-gas code to calculate gas density lti d t l l t d it

Grating Incident resulting in LTI Agenda Background and Medical Condition Incident

Pa#ent safety accountability Lisa McGiffert Consumers Union Safe Pa#ent Project

Yoga in Schools Yoga Alliance Webinar May 5, 2020 Sat Bir S. Khalsa, Ph.D. Assistant Professor

Lecture 28/Chapters 22 &amp; 23 1 measurement (quan) [pop sd known or sample large]: z test

On the Behavior of Substitution-based Reversible Circuit Synthesis Algorithms: Investigation and

Automation for Separation with CDOs: Dynamic Aircraft Arrival Routes Tatiana Polishchuk, LiU

Modernes SQL Wie PostgreSQL die Konkurrenz aussticht @MarkusWinand PGConf.de - 2018-04-13

EPIC MRA STATEWIDE POLL OF ACTIVE AND LIKELY NOVEMBER 2018 VOTERS [600 SAMPLE - ERROR 4.0%]

Oliviers Ricci curvature and applications Sunhyuk Lim Ohio State University lim.991@osu.edu

SIMPLE & LEAN PRODUCER Expanding Production and Reducing Costs Health and Safety Update: No

C. H. Perez & Associates C Consulting E lti E ngineers, Inc. i I FDOT District Four

dt < | ( ) | h t (this has to do with system stability system stability)

dt < | ( ) | h t (this has to do with system stability system stability (ECE

Lecture 28/Chapters 22 & 23 1 measurement (quan) [pop sd known or sample large]: z test