Features of Statistical Parsers Mark Johnson Brown Laboratory for - PowerPoint PPT Presentation

Features of Statistical Parsers Mark Johnson Brown Laboratory for Linguistic Information Processing CoNLL 2005 1

Features of Statistical Parsers Confessions of a bottom-feeder: Dredging in the Statistical Muck Mark Johnson Brown Laboratory for Linguistic Information Processing CoNLL 2005 2

Features of Statistical Parsers Confessions of a bottom-feeder: Dredging in the Statistical Muck Mark Johnson Brown Laboratory for Linguistic Information Processing CoNLL 2005 With much help from Eugene Charniak , Michael Collins and Matt Lease 3

Outline • Goal: find features for identifying good parses • Why is this difficult with generative statistical models? • Reranking framework • Conditional versus joint estimation • Features for parse ranking • Estimation procedures • Experimental set-up • Feature selection and evaluation 4

Features for accurate parsing • Accurate parsing requires good features ⇒ need a flexible method for evaluating a wide range of features • parse ranking framework is current best method for doing this + works with virtually any kind of representation + features can encode virtually any kind of information (syntactic, lexical semantics, prosody, etc.) + can exploit the currently best-available parsers − efficient algorithms are hard(-er) to design and implement − fishing expedition 5

Why not a generative statistical parser? • Statistical parsers (Charniak, Collins) generate parses node by node • Each step is conditioned on the structure already generated S NP VP . PRP VBD NP . He raised DT NN the price • Encoding dependencies is as difficult as designing a feature-passing grammar (GPSG) • Smoothing interacts in mysterious ways with these encodings • Conditional estimation should produce better parsers with our current lousy models 6

Linear ranking framework sentence s • Generate n candidate parses T c ( s ) for each sentence s n -best parser • Map each parse t ∈ T c ( s ) to a parses T c ( s ) t 1 . . . t n real-valued feature vector apply feature fns f ( t ) = ( f 1 ( t ) , . . . , f m ( t )) f ( t 1 ) f ( t n ) feature vectors . . . • Each feature f j is associated with a weight w j linear combination • The highest scoring parse w · f ( t 1 ) . . . w · f ( t n ) parse scores ^ t = argmax w · f ( t ) argmax t ∈T c ( s ) “best” parse for s is predicted correct 7

Linear ranking example w = (− 1, 2, 1 ) Candidate parse tree t features f ( t ) parse score w · f ( t ) ( 1, 3, 2 ) t 1 7 ( 2, 2, 1 ) t 2 3 . . . . . . . . . • Parser designer specifies feature functions f = ( f 1 , . . . , f m ) • Feature weights w = ( w 1 , . . . , w m ) specify each feature’s “importance” • n -best parser produces trees T c ( s ) for each sentence s • Feature functions f apply to each tree t ∈ T c ( s ) , producing feature values f ( t ) = ( f 1 ( t ) , . . . , f m ( t )) • Return highest scoring tree m � ^ t ( s ) = argmax w · f ( t ) = argmax w j f j ( t ) t t j = 1 8

Linear ranking, statistics and machine learning • Many models define the best candidate ^ t in terms of a linear combination of feature values w · f ( t ) – Exponential, Log-linear, Gibbs models, MaxEnt 1 P ( t ) Z exp w · f ( t ) = � Z = exp w · f ( t ) (partition function) t ∈T log P ( t ) = w · f ( t ) − log Z – Perceptron algorithm (including averaged version) – Support Vector Machines – Boosted decision stubs 9

PCFGs are exponential models f j ( t ) = number of times the j th rule is used in t = log p j , where p j is probability of j th rule w j   S     NP VP   f = [ 1 , 1 , 0 , 1 , 0 ]   ��   S → NP VP NP → rice VP → grows VP → grow NP → bananas rice grows � � � exp ( w j ) f j ( t ) = p f j ( t ) P PCFG ( t ) = = exp w j f j ( t ) j j j j � = exp w j f j ( t ) = exp w · f ( t ) j So a PCFG is just a special kind of exponential model with Z = 1 . 10

Features in linear ranking models • Features can be any real-valued function of parse t and sentence s – counts of number of times a particular structure appears in t – log probabilities from other models ∗ log P c ( t ) is our most useful feature! ∗ generalizes reference distributions of MaxEnt models • Subtracting a constant c ( s ) from a feature’s value doesn’t affect difference between parse scores in a linear model w · ( f ( t 1 ) − c ( s )) − w · ( f ( t 2 ) − c ( s )) = w · f ( t 1 ) − w · f ( t 2 ) – features that don’t vary on T c ( s ) are useless – subtract most frequently occuring value c j ( s ) for each feature f j in sentence s ⇒ sparser feature vectors 11

Getting the feature weights f ( t ⋆ ( s )) { f ( t ) : t ∈ T c ( s ) , t � = t ⋆ ( s ) } s sentence 1 ( 1, 3, 2 ) ( 2, 2, 3 ) ( 3, 1, 5 ) ( 2, 6, 3 ) sentence 2 ( 7, 2, 1 ) ( 2, 5, 5 ) sentence 3 ( 2, 4, 2 ) ( 1, 1, 7 ) ( 7, 2, 1 ) . . . . . . . . . • n -best parser produces trees T c ( s ) for each sentence s • Treebank gives correct tree t ⋆ ( s ) ∈ T c ( s ) for sentence s • Feature functions f apply to each tree t ∈ T c ( s ) , producing feature values f ( t ) = ( f 1 ( t ) , . . . , f m ( t )) • Machine learning algorithm selects feature weights w to prefer t ⋆ ( s ) (e.g., so w · f ( t ⋆ ( s )) is greater than w · f ( t ′ ) for other t ′ ∈ T c ( s ) ) 12

Conditional ML estimation of w • Conditional ML estimation selects w to make t ⋆ ( s ) as likely as possible compared to the trees in T c ( s ) • Same as conditional MaxEnt estimation 1 P w ( t | s ) Z w ( s ) exp w · f ( t ) exponential model = � exp w · f ( t ′ ) Z w ( s ) = t ′ ∈T c ( s ) = (( s 1 , t ⋆ 1 ) , . . . , ( s n , t ⋆ n )) treebank training data D n � L D ( w ) = P w ( t ⋆ i | s i ) conditional likelihood of D i = 1 = argmax L D ( w ) w � w 13

(Joint) MLE for exponential models is hard = ( t ⋆ 1 , . . . , t ⋆ n ) D n � t ⋆ L D ( w ) = P w ( t ⋆ i ) i T i = 1 w = argmax L D ( w ) � w � 1 exp w · f ( t ′ ) P w ( t ) = exp w · f ( t ) , Z w = Z w t ′ ∈T • Joint MLE selects w to make t ⋆ i as likely as possible • T is set of all possible parses for all possible strings • T is infinite ⇒ cannot be enumerated ⇒ Z w cannot be calculated • For a PCFG, Z w and hence � w are easy to calculate, but . . . • in general ∂L D /∂w j and Z w are intractable analytically and numerically • Abney (1997) suggests a Monte-Carlo calculation method 14

Conditional MLE is easier • The conditional likelihood of w is the conditional probability of the hidden part of the data (syntactic structure) t ⋆ given its visible part (yield or terminal string) s • The conditional likelihood can be numerically optimized because T c ( s ) can be enumerated (by a parser) T ( s i ) t ⋆ i (( t ⋆ 1 , s 1 ) . . . , ( t ⋆ D = n , s n )) n � P w ( t ⋆ L D ( w ) = i | s i ) i = 1 T = argmax L D ( w ) w � w � 1 exp w · f ( t ′ ) P ( t | s ) = Z w ( s ) exp w · f ( t ) , Z w ( s ) = t ′ ∈T c ( s ) 15

Conditional vs joint estimation • Joint MLE maximizes probability of training trees and strings – Generative statistical parsers usually use joint MLE – Joint MLE is simple to compute (relative frequency) • Conditional MLE maximizes probability of trees given strings – Conditional estimation uses less information from the data – learns nothing from distribution of strings – ignores unambiguous sentences (!) P ( t, s ) = P ( t | s ) P ( s ) • Joint MLE should be better (lower variance) if your model correctly predicts the distribution of parses and strings – Any good probabilistic models of semantics and discourse? 16

Conditional vs joint MLE for PCFGs VP VP V NP VP PP see NP PP VP V NP P NP N P NP V see N with N people with N 100 × 2 × 1 × run people telescopes telescopes . . . × 2/105 × . . . . . . × 1/7 × . . . . . . × 2/7 × . . . . . . × 1/7 × . . . Rule count rel freq better vals 100 100/105 4/7 VP → V VP → V NP 3 3/105 1/7 2 2/105 2 / 7 VP → VP PP 6 6/7 6/7 NP → N 1 1/7 1 / 7 NP → NP PP 17

Regularization • Overlearning ⇒ add regularization R that penalizes “complex” models • Useful with a wide range of objective functions w = argmin Q ( w ) + R ( w ) � w Q ( w ) = − log L D ( w ) (objective function) � | w j | p (regularizer) R ( w ) = c j � P w ( t ⋆ L D ( w ) = i | s i ) i • p = 2 known as the Gaussian prior • p = 1 known as the Laplacian or exponential prior – sparse solutions – requires special care in optimization (Kazama and Tsujii, 2003) 18

If candidate parses don’t include correct parse • If T c ( s ) doesn’t include t ⋆ ( s ) , choose parse t + ( s ) in T c ( s ) closest to t ⋆ ( s ) • Maximize conditional likelihood of ( t + 1 , . . . , t + n ) • Closest parse t + t + i = argmax t ∈T ( s i ) F t ⋆ i ( t ) i T c ( s i ) t ⋆ i – F t ⋆ ( t ) is f-score of t relative to t ⋆ • w chosen to maximize the regularized log conditional likelihood of t + T i � P w ( t + L D ( w ) = i | s i ) i 19

Features of Statistical Parsers Mark Johnson Brown Laboratory for - PowerPoint PPT Presentation

Features of Statistical Parsers Mark Johnson Brown Laboratory for Linguistic Information Processing CoNLL 2005 1 Features of Statistical Parsers Confessions of a bottom-feeder: Dredging in the Statistical Muck Mark Johnson Brown Laboratory

Scanners and parsers COMP 520 Fall 2010 Scanners and Parsers (2) A scanner or lexer transforms a

LR Parsing Compiler Design CSE 504 Shift-Reduce Parsing 1 LR Parsers 2 SLR and LR(1) Parsers

Objectives Combinator Parsing Show how to build complex parsers by composing simpler parsers.

XML Parsers Asst. Prof. Dr. Kanda Runapongsa Saikaew (krunapon@kku.ac.th) Dept. of Computer

Features of Statistical Parsers Preliminary results Mark Johnson Brown University TTI, October

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Instruction Parsers Nathan Jay Paradyn Project Scalable Tools Workshop Granlibakken, California

Dependency and Phrasal Parsers of the Czech Language: A Comparison ak 1 , Tom s Holan 2 ,

Shift-Reduce Parsers for Transition Networks Luca Breveglieri Stefano Crespi Reghizzi Angelo

Training Deterministic Parsers with Non-Deterministic Oracles by Yoav Goldberg and Joakim

CS406: Compilers Spring 2020 Week 5: Parsers, AST, and Semantic Routines 1 Recap 2 3

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Bootstrapping Statistical Parsers from Small Datasets Anoop Sarkar Department of Computing

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

BLOGGING How to blog well FEATURES OF A BLOG... FEATURES OF A BLOG... Chronological

Statistical presentation Statistical presentation Statistical tabulations by age, sex and 3 digit

BUILDING OPPORTUNITIES John Randolph, Paul Ely, Sco2 Kerklo, Erin

The Magic of ELFs Mark Zhandry Princeton University (Work done

A Bibliographers Toolbox Nelson H. F . Beebe Department of Mathematics University of Utah

Tail Recursion Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

Presented by Amel Benna Agenda Background 1. Data Integration issues 2. Our Approach for Data

Can Increasing Input Dimensionality Improve Deep Reinforcement Learning? Kei Ota 1 , Tomoaki Oiki

LArG4 refactoring status and plan Hans Wenzel 6 th August 2018 Requirements Separate

3 3.1 Grammars and Sentence Structure 3.2 What Makes a Good Grammar 3.3 A Top-Down Parser 3.4 A