1 min 1-1 P ( W ) , W = w ; w ; : : : ; w 1 2 n - PowerPoint PPT Presentation

Exploiting Syntactic Structure for Language Modeling � Basic Language Modeling: Ciprian Chelba, Frederick Jelinek � Model Component Parameterization ☞ A Structured Language Model: � Pruning Strategy – Language Model Requirements � Word Level Probability Assignment – Word and Structure Generation � Model Statistics Reestimation – Research Issues: – Model Performance The Johns Hopkins University

� give people an outline so that they know what’s going on 1 min 1-1

P ( W ) , W = w ; w ; : : : ; w 1 2 n Basic Language Modeling Estimate the source probability P ( w =w : : : w ) ; � 2 � ; w 2 V � i 1 i � 1 i from a training corpus — millions of words of text chosen for its similarity to the sentences expected at run-time. ! N X Parametric conditional models 1 P P L ( M ) = exp � ln P ( w j w : : : w ) [ ] M i 1 i � 1 N i =1 Perplexity(PPL) P ( w j w : : : w ) > � M i 1 i � 1 ✔ different than maximum likelihood estimation: the test data is not seen dur- ing the model estimation process; ✔ good models are smooth: The Johns Hopkins University

� Source modeling; classical problem in information theory � give interpretation for perplexity as expected average length of list of equi- probable words; Shannon code-length; 3 min 2-1

� Generalize trigram modeling (local) by taking advantage of sentence struc- � Use exposed heads h (words w and their corresponding non-terminal tags Exploiting Syntactic Structure for Language Modeling l ) for prediction: P ( w j T ) = P ( w j h ( T ) ; h ( T )) i i i � 2 i � 1 i ture (influence by more distant past) T W i is the partial hidden structure, with head assignment, provided to i ended_VP’ with_PP loss_NP of_PP contract_NP loss_NP cents_NP the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN 7_CD cents_NNS after The Johns Hopkins University

� point out originality of approach; � explain clearly what headwords are; � difference between trigram/slm: surface/deep modeling of the source; give � hidden nature of the parses; cannot decide on a single best parse for a word � need to weight them according to how likely they are - probabilistic model; example with removed constituent again; show that they make intuitively better predictors for the following word; prefix, not even at the end of sentence; 6 min 3-1

� Model must operate left-to-right: P ( w =w : : : w ) i 1 i � 1 � In hypothesizing hidden structure, the model can use only word-prefix W ; i w ; :::; w ; :::; w 0 i n +1 as all conventional parsers Language Model Requirements � Model complexity must be limited; even trigram model faces critical data � Model will assign joint probability to sequences of words and hidden parse i.e. , not the complete sentence do! P ( T ; W ) i i sparseness problems structure: The Johns Hopkins University

x 8 min 4-1

ended_VP’ with_PP loss_NP of_PP cents_NP contract_NP loss_NP : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; : : : ; the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN 7_CD cents_NNS adjoin-left-VP’ ; null ; The Johns Hopkins University

ended_VP’ with_PP loss_NP of_PP cents_NP contract_NP loss_NP : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; : : : ; the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN _NNS 7_CD cents adjoin-left-VP’ ; null ;

predict word ended_VP’ PREDICTOR TAGGER with_PP loss_NP null tag word PARSER of_PP adjoin_{left,right} cents_NP contract_NP loss_NP : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; : : : ; the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN 7_CD cents_NNS adjoin-left-VP’ ; null ;

� just one of the possible continuations for one of the possible parses of the � prepare next slide using FSM; explain that it is merely an encoding of the prefix; word prefix and the tree structure; 11 min 5-1

P ( T ; W ) = n +1 n +1 n +1 Y P ( w j h ; h ) P ( g j w ; h :tag ; h :tag ) P ( T j w ; g ; T ) i � 2 � 1 i i � 1 � 2 i i i i � 1 | Word and Structure Generation {z } | {z } | {z } i =1 � The predictor generates the next word w P ( w = v j h ; h ) i with probability i � 2 � 1 � The tagger attaches tag g w i to the most recently generated word i with P ( g j w ; h :tag ; h :tag ) i i � 1 � 2 parser tagger predictor � The parser builds the partial parse T T ; w g i from i � 1 i , and i in a series of a is made with probability P ( a j h ; h ) ; � 2 � 1 a 2 f (adjoin-left, NTtag) , (adjoin-right, NTtag) , null g probability moves ending with null , where a parser move The Johns Hopkins University

� we have described an encoding of a word sequence with a parse tree; � to get a probabilistic model assign a probability to each elementary action in the encoding 13 min 6-1

� Model component parameterization — equivalence classifications for model P ( w = v j h ; h ) ; P ( g j w ; h :tag ; h :tag ) ; P ( a j h ; h ) i � 2 � 1 i i � 1 � 2 � 2 � 1 Research Issues � Huge number of hidden parses — need to prune it by discarding the unlikely components: � Word level probability assignment — calculate P ( w =w : : : w ) i 1 i � 1 � Model statistics estimation — unsupervised algorithm for maximizing P ( W ) ones (minimizing perplexity) The Johns Hopkins University

everything’s on the slide 14 min 7-1

k T W jf T gj � O (2 ) ; k for a given word prefix k is k Pruning Strategy � the hypotheses are ranked according to ln( P ( W ; T )) Number of parses k k � each stack contains partial parses constructed by the same number of parser Prune most parses without discarding the most likely ones for a given sentence Synchronous Multi-Stack Pruning Algorithm � maximum number of stack entries operations � log-probability threshold The width of the pruning is controlled by: The Johns Hopkins University

x 15 min 8-1

Pruning Strategy (k) (k’) (k+1) 0 parser op 0 parser op 0 parser op k predict. k+1 predict. k+1 predict. p parser op p parser op p parser op k predict. k+1 predict. k+1 predict. p+1 parser p+1 parser p+1 parser k+1 predict. k predict. k+1 predict. P_k parser P_k parser P_k parser k predict. k+1 predict. k+1 predict. P_k+1parser P_k+1parser k+1 predict. k+1 predict. word predictor and tagger null parser transitions parser adjoin/unary transitions The Johns Hopkins University

� we want to find the most probable set of parses that are extensions of the � there is an upper bound on the number of stacks at a given input position � hypotheses in stack 0 differ according to their POS sequences ones currently in the stacks 17 min 9-1

k + 1 in the input sentence X P ( w =W ) = P ( w =W T ) � � ( W ; T ) k +1 k k +1 k k k k Word Level Probability Assignment T 2 S k k The probability assignment for the word at position � S k k is the set of all parses present in the stacks at the current stage must be made using: � interpolation weights � ( W ; T ) must satisfy: k k X � ( W ; T ) = 1 k k T 2 S k k � : W X � ( W ; T ) = P ( W T ) = P ( W T ) k k k k k k T 2 S k k in order to ensure a proper probability over strings The Johns Hopkins University

� point out consistency of estimate: when summing over all parses we get the actual probability value according to our model. 19 min 10-1

P ( w = v j h ; h ) ; P ( g j w ; h :tag ; h :tag ) ; P ( a j h ; h ) i � 2 � 1 i i � 1 � 2 � 2 � 1 Model Parameter Reestimation Need to re-estimate model component probabilities such that we decrease the model perplexity. 1 N � We retain the N “best” parses f T ; : : : ; T g for the complete sentence W � The hidden events in the EM algorithm are restricted to those occurring in Modified Expectation-Maximization(EM) algorithm: � We seed re-estimation process with statistics gathered from manually parsed the N “best” parses sentences The Johns Hopkins University

� point out goal of re-estimation � warn about need to know the E-M algorithm; � explain what a treebank is and why/how we can initialize from treebank 21 min 11-1

1 min 1-1 P ( W ) , W = w ; w ; : : : ; w 1 2 n - PowerPoint PPT Presentation

Exploiting Syntactic Structure for Language Modeling Basic Language Modeling: Ciprian Chelba, Frederick Jelinek Model Component Parameterization A Structured Language Model: Pruning Strategy Language Model Requirements Word

1 min 2 min 3 min www.matsgroup.info 1 min 2 min 3 min www.matsgroup.info 1 min 2 min 3

Class 4 @rwdkent Overview Current Events (10 min) Break (5 min) Explore RWD (25 min) CSS

CENTRE-BERCY 5 Min 10 Min 45 Min 55 Min DESTINATION PARIS BERCY ACCORHOTELS ARENA THE SEINE

procedure SERIAL MIN ( A , n ) 1. 2. begin 3. min = A [ 0 ] ; 4. for i := 1 to n 1 do 5.

Lizzi Meister ISJL Community Engagement Fellows Set Induction (5 min) Introduction to the

In the area: 400 meter to beach 5 min to supermarket 10 min to Guzelyurt 55

October 25, 2017 Aja Philp & Kerry Hamilton, Planners Workshop Overview 5 min - Welcome -

E RA- MIN 2 Sta rting De c 1 st 2016 2 About ERA MIN 2 ERA MIN 2 is an ERA NET

MA111: Contemporary mathematics Task Duration Finish first A: 6 min Nothing B: 5 min

Meta-Learning Lake 2019 & McCoy et al. 2020 By Joe O'Connor, Abby Bertics, and Ferran Alet

AUDITORIUM 5 KM 11 MIN RIDE PARCO DELLA MUSICA 8 KM 15 MIN RIDE GEMELLI HOSPITAL

Fatalities by time after crash 35% 31% (0-9 Min.) (90+ Min.) 34% (10-90 Min.) GSM/CDMA GPS

FRAME THE PROBLEM Our Agenda 5 Introductions, Curriculum Overview min 30 Forming Measureable

IAAC Parallel - 16.58 min FABRICATION CLASS 02.// MILLING 01. Milling Strategy --- Prof.

CREATE AND USE SURVEYS Our Agenda 5 Introductions, Curriculum Overview min 5 Review and

May 4 th , 2017 Aja Philp & Kerry Hamilton, Planners Workshop Overview 10 min - Welcome

Unsupervised Morpheme Analysis Competition 3: Statistical Machine Translation Mikko Kurimo, Sami

Syntactically Guided Neural Machine Translation Felix Stahlberg, Eva Hasler, Aurelien Waite, and

Syntactic Translation Lattices Felix Stahlberg, Adria de Gispert, Eva Hasler, and Bill Byrne

The Local Amsterdam Cultural Heritage Linked Open Data Network Lukas Koster ( Library of the

Exploiting Syntactic Structure for Language Modeling Ciprian Chelba, Frederick Jelinek

EAPCI 2018 Expert Consensus Document on Clinical Use of Intracoronary Imaging Giulio Guagliumi,

Outline Conditionals, Questions and Meaning Background 1 The Interrogative Link 2 William

This presentation, the Reopening Guide and a recording of this session will be available at:

1 min 1-1 P ( W ) , W = w ; w ; : : : ; w 1 2 n - PowerPoint PPT Presentation

Exploiting Syntactic Structure for Language Modeling Basic Language Modeling: Ciprian Chelba, Frederick Jelinek Model Component Parameterization A Structured Language Model: Pruning Strategy Language Model Requirements Word

1 min 2 min 3 min www.matsgroup.info 1 min 2 min 3 min www.matsgroup.info 1 min 2 min 3

Class 4 @rwdkent Overview Current Events (10 min) Break (5 min) Explore RWD (25 min) CSS

CENTRE-BERCY 5 Min 10 Min 45 Min 55 Min DESTINATION PARIS BERCY ACCORHOTELS ARENA THE SEINE

procedure SERIAL MIN ( A , n ) 1. 2. begin 3. min = A [ 0 ] ; 4. for i := 1 to n 1 do 5.

Lizzi Meister ISJL Community Engagement Fellows Set Induction (5 min) Introduction to the

In the area: 400 meter to beach 5 min to supermarket 10 min to Guzelyurt 55

October 25, 2017 Aja Philp &amp; Kerry Hamilton, Planners Workshop Overview 5 min - Welcome -

E RA- MIN 2 Sta rting De c 1 st 2016 2 About ERA MIN 2 ERA MIN 2 is an ERA NET

MA111: Contemporary mathematics Task Duration Finish first A: 6 min Nothing B: 5 min

Meta-Learning Lake 2019 &amp; McCoy et al. 2020 By Joe O'Connor, Abby Bertics, and Ferran Alet

AUDITORIUM 5 KM 11 MIN RIDE PARCO DELLA MUSICA 8 KM 15 MIN RIDE GEMELLI HOSPITAL

Fatalities by time after crash 35% 31% (0-9 Min.) (90+ Min.) 34% (10-90 Min.) GSM/CDMA GPS

FRAME THE PROBLEM Our Agenda 5 Introductions, Curriculum Overview min 30 Forming Measureable

IAAC Parallel - 16.58 min FABRICATION CLASS 02.// MILLING 01. Milling Strategy --- Prof.

CREATE AND USE SURVEYS Our Agenda 5 Introductions, Curriculum Overview min 5 Review and

May 4 th , 2017 Aja Philp &amp; Kerry Hamilton, Planners Workshop Overview 10 min - Welcome

Unsupervised Morpheme Analysis Competition 3: Statistical Machine Translation Mikko Kurimo, Sami

Syntactically Guided Neural Machine Translation Felix Stahlberg, Eva Hasler, Aurelien Waite, and

Syntactic Translation Lattices Felix Stahlberg, Adria de Gispert, Eva Hasler, and Bill Byrne

The Local Amsterdam Cultural Heritage Linked Open Data Network Lukas Koster ( Library of the

Exploiting Syntactic Structure for Language Modeling Ciprian Chelba, Frederick Jelinek

EAPCI 2018 Expert Consensus Document on Clinical Use of Intracoronary Imaging Giulio Guagliumi,

Outline Conditionals, Questions and Meaning Background 1 The Interrogative Link 2 William

This presentation, the Reopening Guide and a recording of this session will be available at:

October 25, 2017 Aja Philp & Kerry Hamilton, Planners Workshop Overview 5 min - Welcome -

Meta-Learning Lake 2019 & McCoy et al. 2020 By Joe O'Connor, Abby Bertics, and Ferran Alet

May 4 th , 2017 Aja Philp & Kerry Hamilton, Planners Workshop Overview 10 min - Welcome