Features of Statistical Parsers Mark Johnson Brown Laboratory for Linguistic Information Processing CoNLL 2005 1
Features of Statistical Parsers Confessions of a bottom-feeder: Dredging in the Statistical Muck Mark Johnson Brown Laboratory for Linguistic Information Processing CoNLL 2005 2
Features of Statistical Parsers Confessions of a bottom-feeder: Dredging in the Statistical Muck Mark Johnson Brown Laboratory for Linguistic Information Processing CoNLL 2005 With much help from Eugene Charniak , Michael Collins and Matt Lease 3
Outline • Goal: find features for identifying good parses • Why is this difficult with generative statistical models? • Reranking framework • Conditional versus joint estimation • Features for parse ranking • Estimation procedures • Experimental set-up • Feature selection and evaluation 4
Features for accurate parsing • Accurate parsing requires good features ⇒ need a flexible method for evaluating a wide range of features • parse ranking framework is current best method for doing this + works with virtually any kind of representation + features can encode virtually any kind of information (syntactic, lexical semantics, prosody, etc.) + can exploit the currently best-available parsers − efficient algorithms are hard(-er) to design and implement − fishing expedition 5
Why not a generative statistical parser? • Statistical parsers (Charniak, Collins) generate parses node by node • Each step is conditioned on the structure already generated S NP VP . PRP VBD NP . He raised DT NN the price • Encoding dependencies is as difficult as designing a feature-passing grammar (GPSG) • Smoothing interacts in mysterious ways with these encodings • Conditional estimation should produce better parsers with our current lousy models 6
Linear ranking framework sentence s • Generate n candidate parses T c ( s ) for each sentence s n -best parser • Map each parse t ∈ T c ( s ) to a parses T c ( s ) t 1 . . . t n real-valued feature vector apply feature fns f ( t ) = ( f 1 ( t ) , . . . , f m ( t )) f ( t 1 ) f ( t n ) feature vectors . . . • Each feature f j is associated with a weight w j linear combination • The highest scoring parse w · f ( t 1 ) . . . w · f ( t n ) parse scores ^ t = argmax w · f ( t ) argmax t ∈T c ( s ) “best” parse for s is predicted correct 7
Linear ranking example w = (− 1, 2, 1 ) Candidate parse tree t features f ( t ) parse score w · f ( t ) ( 1, 3, 2 ) t 1 7 ( 2, 2, 1 ) t 2 3 . . . . . . . . . • Parser designer specifies feature functions f = ( f 1 , . . . , f m ) • Feature weights w = ( w 1 , . . . , w m ) specify each feature’s “importance” • n -best parser produces trees T c ( s ) for each sentence s • Feature functions f apply to each tree t ∈ T c ( s ) , producing feature values f ( t ) = ( f 1 ( t ) , . . . , f m ( t )) • Return highest scoring tree m � ^ t ( s ) = argmax w · f ( t ) = argmax w j f j ( t ) t t j = 1 8
Linear ranking, statistics and machine learning • Many models define the best candidate ^ t in terms of a linear combination of feature values w · f ( t ) – Exponential, Log-linear, Gibbs models, MaxEnt 1 P ( t ) Z exp w · f ( t ) = � Z = exp w · f ( t ) (partition function) t ∈T log P ( t ) = w · f ( t ) − log Z – Perceptron algorithm (including averaged version) – Support Vector Machines – Boosted decision stubs 9
PCFGs are exponential models f j ( t ) = number of times the j th rule is used in t = log p j , where p j is probability of j th rule w j S NP VP f = [ 1 , 1 , 0 , 1 , 0 ] ���� ���� ���� ���� ���� S → NP VP NP → rice VP → grows VP → grow NP → bananas rice grows � � � exp ( w j ) f j ( t ) = p f j ( t ) P PCFG ( t ) = = exp w j f j ( t ) j j j j � = exp w j f j ( t ) = exp w · f ( t ) j So a PCFG is just a special kind of exponential model with Z = 1 . 10
Features in linear ranking models • Features can be any real-valued function of parse t and sentence s – counts of number of times a particular structure appears in t – log probabilities from other models ∗ log P c ( t ) is our most useful feature! ∗ generalizes reference distributions of MaxEnt models • Subtracting a constant c ( s ) from a feature’s value doesn’t affect difference between parse scores in a linear model w · ( f ( t 1 ) − c ( s )) − w · ( f ( t 2 ) − c ( s )) = w · f ( t 1 ) − w · f ( t 2 ) – features that don’t vary on T c ( s ) are useless – subtract most frequently occuring value c j ( s ) for each feature f j in sentence s ⇒ sparser feature vectors 11
Getting the feature weights f ( t ⋆ ( s )) { f ( t ) : t ∈ T c ( s ) , t � = t ⋆ ( s ) } s sentence 1 ( 1, 3, 2 ) ( 2, 2, 3 ) ( 3, 1, 5 ) ( 2, 6, 3 ) sentence 2 ( 7, 2, 1 ) ( 2, 5, 5 ) sentence 3 ( 2, 4, 2 ) ( 1, 1, 7 ) ( 7, 2, 1 ) . . . . . . . . . • n -best parser produces trees T c ( s ) for each sentence s • Treebank gives correct tree t ⋆ ( s ) ∈ T c ( s ) for sentence s • Feature functions f apply to each tree t ∈ T c ( s ) , producing feature values f ( t ) = ( f 1 ( t ) , . . . , f m ( t )) • Machine learning algorithm selects feature weights w to prefer t ⋆ ( s ) (e.g., so w · f ( t ⋆ ( s )) is greater than w · f ( t ′ ) for other t ′ ∈ T c ( s ) ) 12
Conditional ML estimation of w • Conditional ML estimation selects w to make t ⋆ ( s ) as likely as possible compared to the trees in T c ( s ) • Same as conditional MaxEnt estimation 1 P w ( t | s ) Z w ( s ) exp w · f ( t ) exponential model = � exp w · f ( t ′ ) Z w ( s ) = t ′ ∈T c ( s ) = (( s 1 , t ⋆ 1 ) , . . . , ( s n , t ⋆ n )) treebank training data D n � L D ( w ) = P w ( t ⋆ i | s i ) conditional likelihood of D i = 1 = argmax L D ( w ) w � w 13
(Joint) MLE for exponential models is hard = ( t ⋆ 1 , . . . , t ⋆ n ) D n � t ⋆ L D ( w ) = P w ( t ⋆ i ) i T i = 1 w = argmax L D ( w ) � w � 1 exp w · f ( t ′ ) P w ( t ) = exp w · f ( t ) , Z w = Z w t ′ ∈T • Joint MLE selects w to make t ⋆ i as likely as possible • T is set of all possible parses for all possible strings • T is infinite ⇒ cannot be enumerated ⇒ Z w cannot be calculated • For a PCFG, Z w and hence � w are easy to calculate, but . . . • in general ∂L D /∂w j and Z w are intractable analytically and numerically • Abney (1997) suggests a Monte-Carlo calculation method 14
Conditional MLE is easier • The conditional likelihood of w is the conditional probability of the hidden part of the data (syntactic structure) t ⋆ given its visible part (yield or terminal string) s • The conditional likelihood can be numerically optimized because T c ( s ) can be enumerated (by a parser) T ( s i ) t ⋆ i (( t ⋆ 1 , s 1 ) . . . , ( t ⋆ D = n , s n )) n � P w ( t ⋆ L D ( w ) = i | s i ) i = 1 T = argmax L D ( w ) w � w � 1 exp w · f ( t ′ ) P ( t | s ) = Z w ( s ) exp w · f ( t ) , Z w ( s ) = t ′ ∈T c ( s ) 15
Conditional vs joint estimation • Joint MLE maximizes probability of training trees and strings – Generative statistical parsers usually use joint MLE – Joint MLE is simple to compute (relative frequency) • Conditional MLE maximizes probability of trees given strings – Conditional estimation uses less information from the data – learns nothing from distribution of strings – ignores unambiguous sentences (!) P ( t, s ) = P ( t | s ) P ( s ) • Joint MLE should be better (lower variance) if your model correctly predicts the distribution of parses and strings – Any good probabilistic models of semantics and discourse? 16
Conditional vs joint MLE for PCFGs VP VP V NP VP PP see NP PP VP V NP P NP N P NP V see N with N people with N 100 × 2 × 1 × run people telescopes telescopes . . . × 2/105 × . . . . . . × 1/7 × . . . . . . × 2/7 × . . . . . . × 1/7 × . . . Rule count rel freq better vals 100 100/105 4/7 VP → V VP → V NP 3 3/105 1/7 2 2/105 2 / 7 VP → VP PP 6 6/7 6/7 NP → N 1 1/7 1 / 7 NP → NP PP 17
Regularization • Overlearning ⇒ add regularization R that penalizes “complex” models • Useful with a wide range of objective functions w = argmin Q ( w ) + R ( w ) � w Q ( w ) = − log L D ( w ) (objective function) � | w j | p (regularizer) R ( w ) = c j � P w ( t ⋆ L D ( w ) = i | s i ) i • p = 2 known as the Gaussian prior • p = 1 known as the Laplacian or exponential prior – sparse solutions – requires special care in optimization (Kazama and Tsujii, 2003) 18
If candidate parses don’t include correct parse • If T c ( s ) doesn’t include t ⋆ ( s ) , choose parse t + ( s ) in T c ( s ) closest to t ⋆ ( s ) • Maximize conditional likelihood of ( t + 1 , . . . , t + n ) • Closest parse t + t + i = argmax t ∈T ( s i ) F t ⋆ i ( t ) i T c ( s i ) t ⋆ i – F t ⋆ ( t ) is f-score of t relative to t ⋆ • w chosen to maximize the regularized log conditional likelihood of t + T i � P w ( t + L D ( w ) = i | s i ) i 19
Recommend
More recommend