A Neural Attention Model for Abstractive Sentence Summarization Alexander Rush Sumit Chopra Jason Weston Facebook AI Research Harvard SEAS Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 1 / 42
Sentence Summarization Source Russian Defense Minister Ivanov called Sunday for the creation of a joint front for combating global terrorism. Target Russia calls for joint front against terrorism. Summarization Phenomena: Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 2 / 42
Sentence Summarization Source Russian Defense Minister Ivanov called Sunday for the creation of a joint front for combating global terrorism. Target Russia calls for joint front against terrorism. Summarization Phenomena: Generalization Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 2 / 42
Sentence Summarization Source Russian Defense Minister Ivanov called Sunday for the creation of a joint front for combating global terrorism. Target Russia calls for joint front against terrorism. Summarization Phenomena: Generalization Deletion Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 2 / 42
Sentence Summarization Source Russian Defense Minister Ivanov called Sunday for the creation of a joint front for combating global terrorism. Target Russia calls for joint front against terrorism. Summarization Phenomena: Generalization Deletion Paraphrase Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 2 / 42
Types of Sentence Summary [Not Standardized] Compressive : deletion-only Russian Defense Minister Ivanov called Sunday for the creation of a joint front for combating global terrorism. Extractive : deletion and reordering Abstractive : arbitrary transformation Russia calls for joint front against terrorism. Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 3 / 42
Elements of Human Summary Jing 2002 Phenomenon Abstract Compress Extract (1) Sentence Reduction � � � (2) Sentence Combination � � � (3) Syntactic Transformation � � (4) Lexical Paraphrasing � (5) Generalization or Specification � (6) Reordering � � Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 4 / 42
Related Work: Ext/Abs Sentence Summary Syntax-Based [Dorr, Zajic, and Schwartz 2003; Cohn and Lapata 2008; Woodsend, Feng, and Lapata 2010] Topic-Based [Zajic, Dorr, and Schwartz 2004] Machine Translation-Based [Banko, Mittal, and Witbrock 2000] Semantics-Based [Liu et al. 2015] Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 5 / 42
Related Work: Attention-Based Neural MT Bahdanau, Cho, and Bengio 2014 Use attention (“soft alignment”) over source to determine next word. Robust to longer sentences versus encoder-decoder style models. No explicit alignment step, trained end-to-end. Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 6 / 42
A Neural Attention Model for Summarization Question: Can a data-driven model capture abstractive phenomenon necessary for summarization without explicit representations? Properties: Utilizes a simple attention-based neural conditional language model. No syntax or other pipelining step, strictly data-driven. Generation is fully abstractive. Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 7 / 42
Attention-Based Summarization (ABS) Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 8 / 42
Model Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 9 / 42
Summarization Model Notation: x ; Source sentence of length M with M >> N y ; Summarized sentence of length N (we assume N is given) Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 10 / 42
Summarization Model Notation: x ; Source sentence of length M with M >> N y ; Summarized sentence of length N (we assume N is given) Past work: Noisy-channel summary [Knight and Marcu 2002] arg max log p ( y | x ) = arg max log p ( y ) p ( x | y ) y y Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 10 / 42
Summarization Model Notation: x ; Source sentence of length M with M >> N y ; Summarized sentence of length N (we assume N is given) Past work: Noisy-channel summary [Knight and Marcu 2002] arg max log p ( y | x ) = arg max log p ( y ) p ( x | y ) y y Neural machine translation: Direct neural-network parameteriziation p ( y i +1 | y c , x ; θ ) ∝ exp( NN ( x , y c ; θ )) where y i +1 is the current word and y c is the context Most neural MT is non-Markovian, i.e. y c is full history (RNN, LSTM) [Kalchbrenner and Blunsom 2013; Sutskever, Vinyals, and Le 2014; Bahdanau, Cho, and Bengio 2014] Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 10 / 42
Feed-Forward Neural Language Model Bengio et al. 2003 p ( y i + 1 | x , y c ; θ ) V h U ˜ y c E y c x y c ˜ = [ Ey i − C +1 , . . . , Ey i ] , h = tanh( U˜ y c ) , p ( y i +1 | y c , x ; θ ) ∝ exp( Vh ) . Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 11 / 42
Feed-Forward Neural Language Model Bengio et al. 2003 p ( y i + 1 | x , y c ; θ ) V W h src U ˜ y c E y c x ˜ y c = [ Ey i − C +1 , . . . , Ey i ] , h = tanh( U˜ y c ) , p ( y i +1 | y c , x ; θ ) ∝ exp( Vh + W src ( x , y c )) . Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 11 / 42
Source Model 1: Bag-of-Words Model src 1 p ˜ x F y c x ˜ x = [ Fx 1 , . . . , Fx M ] , p = [1 / M , . . . , 1 / M ] , [Uniform Distribution] p ⊤ ˜ src 1 ( x , y c ) = x . Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 12 / 42
Source Model 2: Convolutional Model Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 13 / 42
Source Model 3: Attention-Based Model y ′ ˜ ˜ x c F G y c x x ˜ = [ Fx 1 , . . . , Fx M ] , y ′ ˜ = [ Gy i − C +1 , . . . , Gy i ] , c Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 14 / 42
Source Model 3: Attention-Based Model p P y ′ ˜ ˜ x c F G y c x ˜ x = [ Fx 1 , . . . , Fx M ] , y ′ ˜ = [ Gy i − C +1 , . . . , Gy i ] , c y ′ p ∝ exp( ˜ xP˜ c ) , [Attention Distribution] Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 14 / 42
Source Model 3: Attention-Based Model src 3 p ¯ x P y ′ ˜ x ˜ c F G y c x ˜ x = [ Fx 1 , . . . , Fx M ] , y ′ ˜ = [ Gy i − C +1 , . . . , Gy i ] , c y ′ p ∝ exp( ˜ xP˜ c ) , [Attention Distribution] i +( Q − 1) / 2 � ∀ i ¯ x i = ˜ x i / Q , [Local Smoothing] q = i − ( Q − 1) / 2 p ⊤ ¯ src 3 ( x , y c ) = x . Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 14 / 42
ABS Example [ � s � Russia calls] for y c y i +1 x Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 15 / 42
ABS Example [ � s � Russia calls for] joint y c y i +1 x Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 15 / 42
ABS Example [ � s � Russia calls for joint] front y c y i +1 x Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 15 / 42
ABS Example � s � [Russia calls for joint front] against y c y i +1 x Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 15 / 42
ABS Example � s � Russia [calls for joint front against] terrorism y c y i +1 x Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 15 / 42
ABS Example � s � Russia calls [for joint front against terrorism] . y c y i +1 x Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 15 / 42
Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 16 / 42
Headline Generation Training Set Graff et al. 2003; Napoles, Gormley, and Van Durme 2012 Use Gigaword dataset. Total Sentences 3.8 M Newswire Services 7 Source Word Tokens 119 M Source Word Types 110 K Average Source Length 31 . 3 tokens Summary Word Tokens 31 M Summary Word Types 69 K Average Summary Length 8 . 3 tokens Average Overlap 4 . 6 tokens Average Overlap in first 75 2 . 6 tokens Comp with [Filippova and Altun 2013] 250K compressive pairs (although Filippova et al. 2015 2 million) Training done with mini-batch stochastic gradient descent. Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 17 / 42
Generation: Beam Search russia calls for joint defense minister calls joint joint front calls terrorism russia calls for terrorism . . . Markov assumption allows for hypothesis recombination. Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 18 / 42
Extension: Extractive Tuning Low-dim word embeddings unaware of exact matches. Log-linear parameterization: N − 1 exp( α ⊤ � p ( y | x ; θ, α ) ∝ f ( y i +1 , x , y c )) . i =0 Features f : Model score (neural model) 1 Unigram overlap 2 Bigram overlap 3 Trigram overlap 4 Word out-of-order 5 Similar to rare-word issue in neural MT [Luong et al. 2015] Use MERT for estimating α as post-processing (not end-to-end) Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 19 / 42
Results Rush, Chopra, Weston (Facebook AI) Neural Abstractive Summarization 20 / 42
Recommend
More recommend