Higher-order Coreference Resolution with Coarse-to-fine Inference Kenton Lee * Luheng He Luke Zettlemoyer University of Washington * Now at Google 1
Coreference Resolution It’s because of what both of you are doing to have things change. I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well. Example from Wiseman et al. (2016) 2
Coreference Resolution It’s because of what both of you are doing to have things change. I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well. Example from Wiseman et al. (2016) 3
Coreference Resolution It’s because of what both of you are doing to have things change. I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well. Example from Wiseman et al. (2016) 4
Recent Trends in Coreference Resolution End-to-end models have achieved large improvements Advantages Disadvantages • Conceptually simple • Computationally expensive • Minimal feature engineering • Very little “reasoning” involved 5
Contributions • Address a modeling challenge: • Enable higher-order (multi-hop) coreference • Address a computational challenge: • Coarse-to-fine inference with a factored model 6
Contributions • Address a modeling challenge: • Enable higher-order (multi-hop) coreference • Address a computational challenge: • Coarse-to-fine inference with a factored model 7
Existing Approach: Span-ranking Model Lee et al. 2017 (EMNLP): Consider all possible spans in the document: • 1 < i < n Compute neural span representations: • h ( i ) Estimate probability distribution over possible antecedents: • P ( y i | h ) ✏ 8
Limitations of a First Order Model I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Local information not sufficient Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well. Example from Wiseman et al. (2016) 9
Limitations of a First Order Model I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. Absolutely. Global structure reveals inconsistency Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well. Example from Wiseman et al. (2016) 10
Higher-order Model Let span representations softly condition on previous decisions • 11
Higher-order Model Let span representations softly condition on previous decisions • For each iteration: • Estimation antecedent distribution • Attend over possible antecedents • Merge every span representation with its expected antecedent • 12
Higher-order Model I think that’s what’s… Go ahead Linda. P ( y all of you | h ) Thanks goes to you and to the media to help us. Absolutely. I Linda you ε Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well. 13
Higher-order Model I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. P ( y you | h ) Absolutely. I Linda ε Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well. 14
Higher-order Model I think that’s what’s… Go ahead Linda. Thanks goes to you and to the media to help us. P ( y you | h ) Absolutely. I Linda ε Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well. Learn a representation of “you” w.r.t. “I” 15
Higher-order Model P ( y all of you | h ) I think that’s what’s… Go ahead Linda. I Linda you ε Thanks goes to you and to the media to help us. Absolutely. Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well. 16
Higher-order Model P ( y all of you | h ) I think that’s what’s… Go ahead Linda. I Linda you ε Thanks goes to you and to the media to help us. Absolutely. P ( y all of you | h 0 ) Obviously we couldn’t seem loud enough to bring the attention, so our hat is off to all of you as well. I Linda you ε 17
Higher-order Model Let span representations softly condition on previous decisions • Iterative inference to compute h n ( i ) • 18
Higher-order Model Let span representations softly condition on previous decisions • Iterative inference to compute : h n ( i ) • Base case: (from the baseline) h 0 ( i ) = h ( i ) •
Higher-order Model Let span representations softly condition on previous decisions • Iterative inference to compute : h n ( i ) • Base case: (from the baseline) h 0 ( i ) = h ( i ) • Recursive case: • X (attention mechanism) a n ( i ) = P ( y i | h n − 1 ) h n − 1 ( i ) y i 20
Higher-order Model Let span representations softly condition on previous decisions • Iterative inference to compute : h n ( i ) • Base case: (from the baseline) h 0 ( i ) = h ( i ) • Recursive case: • X (attention mechanism) a n ( i ) = P ( y i | h n − 1 ) h n − 1 ( i ) y i (forget gates) f n ( i ) = σ ( W [ a n ( i ) , h n − 1 ( i )]) 21
Higher-order Model Let span representations softly condition on previous decisions • Iterative inference to compute : h n ( i ) • Base case: (from the baseline) h 0 ( i ) = h ( i ) • Recursive case: • X (attention mechanism) a n ( i ) = P ( y i | h n − 1 ) h n − 1 ( i ) y i (forget gates) f n ( i ) = σ ( W [ a n ( i ) , h n − 1 ( i )]) h n ( i ) = f n ( i ) � a n ( i ) + (1 � f n ( i )) � h n − 1 ( i ) 22
Higher-order Model Let span representations softly condition on previous decisions • Iterative inference to compute : h n ( i ) • Base case: (from the baseline) h 0 ( i ) = h ( i ) • Recursive case: • X (attention mechanism) a n ( i ) = P ( y i | h n − 1 ) h n − 1 ( i ) y i (forget gates) f n ( i ) = σ ( W [ a n ( i ) , h n − 1 ( i )]) h n ( i ) = f n ( i ) � a n ( i ) + (1 � f n ( i )) � h n − 1 ( i ) Final result: • P ( y i | h n ) 23
Higher-order Model Let span representations softly condition on previous decisions • Iterative inference to compute : h n ( i ) • Base case: (from the baseline) h 0 ( i ) = h ( i ) • Recursive case: • X (attention mechanism) a n ( i ) = P ( y i | h n − 1 ) h n − 1 ( i ) y i (forget gates) f n ( i ) = σ ( W [ a n ( i ) , h n − 1 ( i )]) Final coreference decision conditions on clusters of size n + 2 h n ( i ) = f n ( i ) � a n ( i ) + (1 � f n ( i )) � h n − 1 ( i ) Final result: • P ( y i | h n ) 24
Recent Trends in Coreference Resolution End-to-end models have achieved large improvements Disadvantages Advantages • Conceptually simple •Computationally expensive • Minimal feature engineering • Very little “reasoning” involved 25
Recent Trends in Coreference Resolution End-to-end models have achieved large improvements Disadvantages Advantages • Conceptually simple •Computationally expensive • Minimal feature engineering • Very little “reasoning” involved 2nd order model already runs out of memory 26
Contributions • Address a modeling challenge: • Enable higher-order (multi-hop) coreference • Address a computational challenge: • Coarse-to-fine inference with a factored model 27
Computational Challenge It’s because of what both of you are doing to have things change. • Mention candidates just for exposition 28
Computational Challenge It’s because of what both of you are doing to have things change. • Mention candidates just for exposition • O(n 2 ) spans to consider in practice 29
Computational Challenge It’s because of what both of you are doing to have things change. • Mention candidates just for exposition • O(n 2 ) spans to consider in practice • O(n 4 ) coreference links to consider 30
Coarse-to-fine Inference P ( y i | h ) = softmax( s ( i, y i , h )) 31
Coarse-to-fine Inference P ( y i | h ) = softmax( s ( i, y i , h )) Existing scoring function: s ( i, j, h ) = FFNN ( h ( i )) + FFNN ( h ( j )) Mention scores + FFNN ( h ( i ) , h ( j ) , h ( i ) � h ( j )) Antecedent scores 32
Coarse-to-fine Inference P ( y i | h ) = softmax( s ( i, y i , h )) Coarse-to-fine scoring function: s ( i, j, h ) = FFNN ( h ( i )) + FFNN ( h ( j )) Mention scores + h ( i ) > W c h ( j ) Cheap/inaccurate antecedent scores + FFNN ( h ( i ) , h ( j ) , h ( i ) � h ( j )) Antecedent scores 33
Recommend
More recommend