discriminative approaches to statistical parsing
play

Discriminative approaches to Statistical Parsing Mark Johnson - PowerPoint PPT Presentation

Discriminative approaches to Statistical Parsing Mark Johnson Brown University University of Tokyo, 2004 Joint work with Eugene Charniak (Brown) and Michael Collins (MIT) Supported by NSF grants LIS 9720368 and IIS0095940 1 Talk outline


  1. Discriminative approaches to Statistical Parsing Mark Johnson Brown University University of Tokyo, 2004 Joint work with Eugene Charniak (Brown) and Michael Collins (MIT) Supported by NSF grants LIS 9720368 and IIS0095940 1

  2. Talk outline • A typology of approaches to parsing • Applications of parsers • Representations and features of statistical parsers • Estimation (training) of statistical parsers – maximum likelihood (generative) estimation – maximum conditional likelihood (discriminative) estimation • Experiments with a discriminatively trained reranking parser • Advantages and disadvantages of generative and discriminative training • Conclusions and future work 2

  3. Grammars and parsing • A (formal) language is a set of strings – For most practical purposes, human languages are infinite sets of strings – In general we are interested in the mapping from surface form to meaning • A grammar is a finite description of a language – Usually assigns each string in a language a description (e.g., parse tree, semantic representation) • Parsing is the process of characterizing (recovering) the descriptions of a string • Most grammars of human languages are either manually constructed or extracted automatically from an annotated corpus – Linguistic expertise is necessary for both! 3

  4. Manually constructed grammars Examples: Lexical-functional grammar (LFG), Head-driven phrase-structure grammar (HPSG), Tree-adjoining grammar (TAG) • Linguistically inspired – Deals with linguistically interesting phenomena – Ignores boring (or difficult!) but frequent constructions – Often explicitly models the form-meaning mapping • Each theory usually has its own kind of representation ⇒ Difficult to compare different approaches • Constructing broad-coverage grammars is hard and unrewarding • Probability distributions can be defined over their representations • Often involve long-distance constraints ⇒ Computationally expensive and difficult 4

  5. TURN SENTENCE ID BAC002 E SEGMENT ANIM + CASE ACC ROOT PERIOD NUM PL OBJ PERS 1 Sadj . PRED PRO PRON-FORM WE PRON-TYPE PERS S 9 PASSIVE − PRED LET � 2,10 � 9 VPv STMT-TYPE IMPERATIVE PERS 2 V NP VPv SUBJ PRED PRO PRON-TYPE NULL let PRON V NP 2 TNS-ASP MOOD IMPERATIVE us take DATEP ANIM − N COMMA DATEnum NUMBER ORD NTYPE Tuesday , D NUMBER TIME DATE NUM SG APP the fifteenth PRED fifteen SPEC-FORM THE SPEC SPEC-TYPE DEF OBJ CASE ACC XCOMP GEND NEUT GRAIN COUNT NTYPE PROPER DATE TIME DAY NUM SG PERS 3 PRED TUESDAY 13 PASSIVE − PRED TAKE � 9,13 � SUBJ 9 10 5

  6. Corpus-derived grammars • Grammar is extracted automatically from a large linguistically annotated corpus – Focuses on frequently occuring constructions – Only models phenomena that can be (easily) annotated – Typically ignores semantics and most of the rich details of linguistic theories • Different models extracted from the same corpus can usually be compared • Constructing corpora is hard, unrewarding work • Generative models usually only involve local constraints – Dynamic programming possible, but usually involves heuristic search 6

  7. Sample Penn treebank tree ROOT S NP-SBJ VP . NNP NNP NNP VBD NP PP-DIR PP-DIR . BELL INDUSTRIES Inc. increased PRP$ NN TO NP IN NP its quarterly to CD NNS from NP NP-ADV 10 cents CD NNS DT NN seven cents a share 7

  8. Applications of (statistical) parsers 1. Applications that use syntactic parse trees • information extraction • (short answer) question answering • summarization • machine translation 2. Applications that use the probability distribution over strings or trees (parser-based language models) • speech recognition and related applications • machine translation 8

  9. PCFG representations and features S NP VP NNP VB NP ADVP George eats NN RB pizza quickly 0.14: VP → VB NP ADVP • Probabilistic context-free grammars (PCFGs) associate a rule probability p ( r ) with each rule ⇒ features are local trees • Probability of a tree y is P( y ) = � r ∈ y p ( r ) = � r p ( r ) f r ( y ) where f r ( y ) is the number of times r appears in y • Probability of a string x is P( x ) = � y ∈Y ( x ) P( y ) 9

  10. Lexicalized PCFGs S sank NP VP torpedo sank DT NN VB NP the torpedo sank boat the torpedo sank DT NN the boat the boat 0.02: VP sank → VB sank NP boat • Head annotation captures subcategorization and head-to-head dependencies • Sparse data is a serious problem: smoothing is essential! 10

  11. Modern (generative) statistical parsers S VB:sank NP VP NN:torpedo VB:sank DT NN VB NP DT:the NN:torpedo VB:sank NN:boat the torpedo sank DT NN DT:the NN:boat the boat • Generates a tree via a very large number of small steps (generates NP, then NN, then boat) • Each step in this branching process conditions on a large number of (already generated) variables • Sparse data is the major problem: smoothing is essential! 11

  12. Estimating PCFGs from visible data S S S NP VP NP VP NP VP rice grows rice grows corn grows   S       P = 2 / 3 Rule Count Rel Freq NP VP     S → NP VP 3 1 rice grows NP → rice 2 2 / 3   S NP → corn 1 1 / 3       VP → grows 3 1 P = 1 / 3 NP VP     corn grows 12

  13. Why is the PCFG MLE so easy to compute? y i Y • Visible training data D = ( y 1 , . . . , y n ), where y i is a parse tree � n • The MLE is ˆ p = argmax p i =1 P p ( y i ) • It is easy to compute because PCFGs are always normalized, i.e., Z = � � r p ( r ) f r ( y ) = 1, y ∈Y where Y is the set of all trees generated by the grammar 13

  14. Non-local constraints and the PCFG MLE S S S NP VP NP VP NP VP rice grows rice grows bananas grow   S     NP VP P  = 4 / 9   rule count rel freq  S → NP VP 3 1 rice grows NP → rice 2 2 / 3   S NP → bananas 1 1 / 3     VP → grows 2 2 / 3 NP VP P   =  1 / 9  VP → grow 1 1 / 3 bananas grow Z = 5 / 9 14

  15. Renormalization S S S NP VP NP VP NP VP rice grows rice grows bananas grow   S     NP VP P  = 4 / 9 4 / 5   rule count rel freq  S → NP VP 3 1 rice grows NP → rice 2 2 / 3   S NP → bananas 1 1 / 3     VP → grows 2 2 / 3 NP VP P    = 1 / 9 1 / 5  VP → grow 1 1 / 3 bananas grow Z = 5 / 9 15

  16. Other values do better! S S S NP VP NP VP NP VP rice grows rice grows bananas grow   S     rule count rel freq NP VP P  = 2 / 6 2 / 3    S → NP VP 3 1 rice grows NP → rice 2 2 / 3   NP → bananas 1 1 / 3 S   VP → grows 2 1 / 2   NP VP P   =  1 / 6 1 / 3 VP → grow 1 1 / 2  bananas grow (Abney 1997) Z = 3 / 6 16

  17. Make dependencies local – GPSG-style   rule count rel freq S NP VP S → 2 2 / 3   + singular + singular   NP VP   P = 2 / 3 +singular +singular   NP VP   S → 1 1 / 3 + plural + plural rice grows NP + singular → rice 2 1   S NP + plural → bananas 1 1     NP VP   P = 1 / 3 +plural +plural VP   + singular → grows 2 1   bananas grow VP + plural → grow 1 1 17

  18. Maximum entropy or log linear models • Y = set of syntactic structures (not necessarily trees) • f j ( y ) = number of occurences of j th feature in y ∈ Y (these features need not be conventional linguistic features) • w j are “feature weight” parameters m � S w ( y ) = w j f j ( y ) j =1 V w ( y ) = exp S w ( y ) � Z w = V w ( y ) y ∈Y m � V w ( y ) 1 P w ( y ) = = exp w j f j ( y ) Z w Z w j =1 m � log P λ ( y ) = w j f j ( y ) − log Z w j =1 18

  19. PCFGs are log-linear models Y = set of all trees generated by grammar G f j ( y ) = number of times the j th rule is used in y ∈ Y p ( r j ) = probability of j th rule in G Choose w j = log p ( r j ), so p ( r j ) = exp w j   S     NP VP   f = [ 1 , 1 , 0 , 1 , 0 ]   ���� ���� ���� ���� ����   S → NP VP NP → rice NP → bananas VP → grows VP → grow rice grows m m m � � � p ( r j ) f j ( y ) = (exp w j ) f j ( y ) = exp( P w ( y ) = w j f j ( ω )) j =1 j =1 j =1 So a PCFG is just a log linear model with Z = 1. 19

  20. Max likelihood estimation of log linear models Visible training data D = ( y 1 , . . . , y n ), where y i ∈ Y is a tree y i Y w ˆ = argmax L D ( w ) , where w n n � � log L D ( w ) = log P w ( y i ) = ( S w ( y i ) − log Z w ) i =1 i =1 • In general no closed form solution ⇒ optimize log L D ( w ) numerically. • Calculating Z w involves summing over all parses of all strings ⇒ computationally intractible (Abney suggests Monte Carlo) 20

Recommend


More recommend