Transition-based Dependency Parsing with Selectional Branching Presented at the 4th workshop on Statistical Parsing in Morphologically Rich Languages October 18th, 2013 Jinho D. Choi University of Massachusetts Amherst
Greedy vs. Non-greedy Parsing • Greedy parsing - Considers only one head for each token. - Generates one parse tree per sentence. - e.g., transition-based parsing (2 ms / sentence). • Non-greedy parsing - Considers multiple heads for each token. - Generates multiple parse trees per sentence. - e.g., transition-based parsing with beam search, graph- based parsing, linear programming, dual decomposition ( ≥ 93%). 2
Motivation • How often do we need non-greedy parsing? - Our greedy parser performs as accurately as our non- greedy parser about 64% of the time. - This gap is even closer when they are evaluated on non- benchmark data (e.g., twits, chats, blogs). • Many applications are time sensitive. - Some applications need at least one complete parse tree ready given a limited time period (e.g., search, dialog, Q/ A). • Hard sentences are hard for any parser! - Considering more heads does not always guarantee more accurate parse results. 3
Transition-based Parsing • Transition-based dependency parsing (greedy) - Considers one transition for each parsing state. … S S S t 1 t ′ t 1 t ′ T … … What if t ′ is not the correct transition? t L t L 4
Transition-based Parsing • Transition-based dependency parsing with beam search - Considers b -num. of transitions for each block of parsing … S S 1 S 1 t 1 t ′ 1 t 11 t ′ 1 T 1 … … … t 1L … … S b S b t ′ b t b 1 t ′ b T b … t L t bL 5
Selectional Branching • Issues with beam search - Generates the fixed number of parse trees no matter how easy/hard the input sentence is. - Is it possible to dynamically adjust the beam size for each individual sentence? • Selectional branching - One-best transition sequence is found by a greedy parser. - Collect k -best state-transition pairs for each low confidence transition used to generate the one-best sequence. - Generate transition sequences from the b -1 highest scoring state-transition pairs in the collection. 6
Selectional Branching … S 1 S 2 S n t 11 t ′ 11 t 21 t ′ 21 T … … low low confident? confident? t 1 L t 2 L … … … λ = S 1 S 1 S 2 S 2 t ′ 12 t ′ 1 k t ′ 22 t ′ 2 k Pick b -1 number of pairs with the highest scores. For our experiments, k = 2 is used. 7
Selectional Branching λ = S 1 S 2 S 3 t ′ 12 t ′ 22 t ′ 32 … S 1 S 2 t ′ 12 S a T … S 2 S 3 t ′ 22 S b T … S 3 S 4 S c t ′ 32 T Carries on parsing states from the one-best sequence. Guarantees to generate fewer trees than beam search when | λ | ≤ b . 8
Low Confidence Transition • Let C 1 be a classifier that finds the highest scoring transition given the parsing state x . C 1 ( x ) = arg max y 2 Y { f ( x, y ) } exp( w · Φ ( x, y )) f ( x, y ) = P y 0 2 Y exp( w · Φ ( x, y 0 )) • Let C k be a classifier that finds the k -highest scoring transitions given the parsing state x and the margin m . C k ( x, m ) = K arg max y 2 Y { f ( x, y ) } f ( x, C 1 ( x )) − f ( x, y ) ≤ m s . t . • The highest scoring transition C 1 ( x ) is low confident if |C k ( x, m ) | > 1 . 9
Experiments • Parsing algorithm (Choi & McCallum, 2013) - Hybrid between Nivre’s arc-eager and list-based algorithms. - Projective parsing: O( n ) . - Non-projective parsing: expected linear time. • Features - Rich non-local features from Zhang & Nivre, 2011. - For languages with coarse-grained POS tags, feature templates using fine-grained POS tags are replicated. - For languages with morphological features, morphologies of σ [0] and β [0] are used as unigram features. 10
Number of Transitions • # of transitions performed with respect to beam sizes. 1,200,000 1,000,000 Transitions 800,000 600,000 400,000 200,000 0 0 10 20 30 40 50 60 80 70 Beam size = 1, 2, 4, 8, 16, 32, 64, 80 11
Projective Parsing • The benchmark setup using WSJ. Approach USA LAS Time b t = 80, b d = 80 92.96 91.93 0.009 b t = 80, b d = 64 92.96 91.93 0.009 92.96 91.94 0.009 b t = 80, b d = 32 b t = 80, b d = 16 92.96 91.94 0.008 b t = 80, b d = 8 92.89 91.87 0.006 92.76 91.76 0.004 b t = 80, b d = 4 b t = 80, b d = 2 92.56 91.54 0.003 b t = 80, b d = 1 92.26 91.25 0.002 b t = 1, b d = 1 92.06 91.05 0.002 12
Projective Parsing • The benchmark setup using WSJ. Approach USA LAS Time 92.96 91.93 0.009 b t = 80, b d = 80 92.1 Zhang & Clark, 2008 92.1 0.04 Huang & Sagae, 2010 92.9 91.8 0.03 Zhang & Nivre, 2011 93.38 92.44 0.4 Bohnet & Nivre, 2012 90.9 McDonald et al., 2005 91.5 McDonald & Pereira, 2006 92.7 Sagae & Lavie, 2006 93.04 Koo & Collins, 2010 93.06 91.86 Zhang & McDonald, 2012 93.26 Martins et al., 2010 93.8 Rush et al., 2010 13
Non-projective Parsing • CoNLL-X shared task data Danish Dutch Slovene Swedish Approach LAS UAS LAS UAS LAS UAS LAS UAS 87.27 91.36 82.45 85.33 77.46 84.65 86.8 91.36 b t = 80, b d = 80 86.75 91.04 80.75 83.59 75.66 83.29 86.32 91.12 b t = 80, b d = 1 84.77 89.8 78.59 81.35 70.3 78.72 84.58 89.5 Nivre et al., 2006 84.79 90.58 79.19 83.57 73.44 83.17 82.55 88.93 McDonald et al., 2006 84.2 - - - 75.2 - - - Nivre, 2009 85.17 90.1 - - - - 83.55 89.3 F.-Gonz. & G.-Rodr., 2012 86.67 - 81.63 - 75.94 - 84.66 - Nivre & McDonald, 2008 - 91.5 - 84.91 - 85.53 - 89.8 Martins et al., 2010 14
SPMRL 2013 Shared Task • Baseline results provided by ClearNLP . 5K Full Language LAS UAS LS LAS UAS LS Arabic 81.72 84.46 93.41 84.19 86.48 94.43 Basque 78.01 84.62 82.71 79.16 85.32 83.63 French 73.39 85.3 81.42 74.51 86.41 82 German 82.58 85.36 90.49 86.73 88.8 92.95 Hebrew 75.09 81.74 82.84 - - - Hungarian 81.98 86.09 88.26 82.68 86.56 88.8 Korean 76.28 80.39 87.32 83.55 86.82 92.39 Polish 80.64 88.49 86.47 81.12 89.24 86.59 Swedish 80.96 86.48 85.1 - - - 15
Conclusion • Selectional branching - Uses confidence estimates to decide when to employ a beam. - Shows comparable accuracy against traditional beam search. - Gives faster speed against any other non-greedy parsing. • ClearNLP - Provides several NLP tools including morphological analyzer, dependency parser, semantic role labeler, etc. - Webpage: clearnlp.com. 16
Recommend
More recommend