User-focused Multi-document Summarization with Paragraph Clustering and Sentence-type Filtering † , Koji †, , †† †† ,and Noriko †, , †† †† Yohei Seki Seki † , Koji Eguchi Eguchi † ,and Noriko Kando Kando † Yohei † The Graduate University for Advanced Studies † The Graduate University for Advanced Studies †† National Institute of Informatics †† National Institute of Informatics NTCIR Workshop 4 Meeting June 2, 2004 2004 NTCIR Workshop 4 Meeting June 2, 1 1
Talk Outline Talk Outline : Objective : User- -focused Summarization focused Summarization Objective User 1. 1. Analysis: Compare Paragraph Clustering- - Analysis: Compare Paragraph Clustering 2. 2. based Summarization Strategies based Summarization Strategies Proposal: Responsiveness Improvement with Proposal: Responsiveness Improvement with 3. 3. Sentence- -type Filtering for each Cluster type Filtering for each Cluster Sentence Conclusions Conclusions 4. 4.
: Objective : Objective User- -focused Summarization focused Summarization User � Two goals � Two goals 1. User User- -focused interactive summarization focused interactive summarization 1. for topical requirements for topical requirements � Approach: Paragraph Clustering � Approach: Paragraph Clustering- -based based Summarization Summarization 2. To produce knowledge To produce knowledge- -focused focused 2. summaries (evaluate with question- - summaries (evaluate with question answering responsiveness) answering responsiveness) � Approach: Sentence � Approach: Sentence- -type Filtering type Filtering 3 3
( = Topic + Summary Viewpoint ( = Topic + Summary Viewpoint ) Type ) -Specified Summarization Specified Summarization Type - Topics + Summary Types User A (Opinion!) Extract Sentences × Extract Sentences User B (Knowledge!) Does not Match Information Needs Document Sets Different Summaries By Different Information Needs
Multi- -Document Summarization Document Summarization Multi with Document Clustering Document Clustering with • • “Document clustering techniques Document clustering techniques” ” partition a partition a “ set of objects into clusters set of objects into clusters • • Closely associated documents tend to be Closely associated documents tend to be relevant to the same request [cluster relevant to the same request [cluster hypothesis] hypothesis] • • Extract one or two representative elements Extract one or two representative elements (sentences) from each cluster to produce (sentences) from each cluster to produce summaries summaries • • Topical Requirements: Select sentences from Topical Requirements: Select sentences from clusters in an order similar to queries clusters in an order similar to queries 5 5
Talk Outline Talk Outline : Objective : User- -focused Summarization focused Summarization Objective User 1. 1. Analysis: Compare Paragraph Clustering- - Analysis: Compare Paragraph Clustering 2. 2. based Summarization Strategies based Summarization Strategies Proposal: Responsiveness Improvement with Proposal: Responsiveness Improvement with 3. 3. Sentence- -type Filtering for each Cluster type Filtering for each Cluster Sentence Conclusions Conclusions 4. 4.
Comparison: Paragraph Clustering- - Comparison: Paragraph Clustering based Summarization Strategies based Summarization Strategies • • Six clustering options Six clustering options 1. Cluster units Cluster units 1. 2. Features and Cluster Similarities Features and Cluster Similarities 2. 3. Clustering algorithm Clustering algorithm 3. 4. Cluster size Cluster size 4. 5. Sentence extraction clues Sentence extraction clues 5. 6. Queries Queries 6. 7 7
1. Cluster Units: Paragraph 1. Cluster Units: Paragraph Related Work: Clustering for Summarization Related Work: Clustering for Summarization • Stein et al. (1999): Cluster source documents by • Stein et al. (1999): Cluster source documents by single single document summaries document summaries • M. • M. Moens Moens (2000): Cluster source documents by (2000): Cluster source documents by paragraph paragraph units units • Boros • Boros et al. (2001): Cluster source documents by et al. (2001): Cluster source documents by sentence sentence units units Our approach (interactive summarization) Our approach (interactive summarization) • Sentence features • Sentence features were too sparse to make feature vectors were too sparse to make feature vectors • Document sizes • Document sizes were too small compared to summary sizes were too small compared to summary sizes ⇒ Cluster source documents by ⇒ Cluster source documents by paragraph paragraph units units 8 8
2. Feature and Cluster Distance 2. Feature and Cluster Distance Vector- -length normalization does not work well for length normalization does not work well for Vector short documents (paragraphs in this research). short documents (paragraphs in this research). 1. Feature vector 1. Feature vector • Normalized term frequency vs vs unnormalized unnormalized • Normalized term frequency (raw) term frequency (raw) term frequency 2. Cluster distance measure 2. Cluster distance measure • Euclidean vs vs cosine cosine • Euclidean E u c l i d e a n 1 - c o s θ E u c l i d e a n T F N o r m a l i z e d T F C o v e r a g e 0 . 3 5 8 0 . 3 0 7 0 . 3 1 7 P r e c i s i o n 0 . 5 2 2 0 . 3 9 8 0 . 4 2 9 Unnormalized TF and Euclidean Distance performed well significantly 9 9
3. Cluster Algorithm: Ward’ ’s Method s Method 3. Cluster Algorithm: Ward Compare three agglomerative clustering methods: Compare three agglomerative clustering methods: complete- -link, group link, group- -average, and Ward average, and Ward’ ’s method s method complete C o m p l e t e L i n k G r o u p A v e r a g e W a r d ' s m e t h o d C o v e r a g e 0 . 3 5 8 0 . 3 1 4 0 . 3 6 4 P r e c i s i o n 0 . 5 2 2 0 . 4 9 9 0 . 5 1 8 The summary resultant with ``Ward’s method” performed better significantly than ``group average method’’. 10 10
4. Cluster Size 4. Cluster Size Change cluster size according to Change cluster size according to number of sentences extracted number of sentences extracted × 1 × 1 . 5 × 2 C l u s t e r # f o r L o n g S u m m s × 1 . 5 × 2 × 2 . 5 C l u s t e r # f o r S h o r t S u m m s C o v e r a g e 0 . 3 6 4 0 . 3 5 7 0 . 3 5 3 P r e c i s i o n 0 . 5 1 8 0 . 5 4 3 0 . 5 6 5 Small cluster size performs better, but not significantly improved 11 11
5. Sentence Extraction Clues 5. Sentence Extraction Clues Compare summarization with three sentence extraction clues: T i t l e Y e s Y e s N o Y e s T e r m F r e q u e n c y Y e s Y e s Y e s N o P o s i t i o n N o Y e s N o N o C o v e r a g e 0 . 3 3 9 0 . 3 2 2 0 . 3 3 8 0 . 3 1 5 P r e c i s i o n 0 . 6 1 4 0 . 6 0 6 0 . 6 1 3 0 . 6 2 3 Position weighting did not work well. Title weighting effect was not clear. Term Frequency performed well. 12 12
6. Queries 6. Queries Compare cluster ordering using Queries and cluster ordering using Total Frequencies C l u s t e r O r d e r i n g S i m i l a r i t y t o Q u e r i e s t o T o t a l F r e q u e n c i e s C o v e r a g e 0 . 3 6 4 0 . 3 3 7 P r e c i s i o n 0 . 5 1 8 0 . 4 5 With queries, coverage improved 0.02 ~ 0.03. 13 13
Talk Outline Talk Outline : Objective : User- -focused Summarization focused Summarization Objective User 1. 1. Analysis: Compare Paragraph Clustering- - Analysis: Compare Paragraph Clustering 2. 2. based Summarization Strategies based Summarization Strategies Proposal: Responsiveness Improvement with Proposal: Responsiveness Improvement with 3. 3. Sentence- -type Filtering for each Cluster type Filtering for each Cluster Sentence Conclusions Conclusions 4. 4.
Five Sentence- -types to Improve User types to Improve User’ ’s s Five Sentence Requirements Requirements We annotate five sentence-types automatically. Two Topical Types •Main Description •Elaboration Three Functional Types •Background •Opinion •Prospective 15 15
Sentence- -type Filtering with Paragraph type Filtering with Paragraph Sentence Clustering- -based Summarization based Summarization Clustering 1.The most heavily weighted sentence in each cluster was extracted. 2.For the second/third weighted sentence in each cluster, the sentence-type information was checked. A) The redundancy of sentence-type for the most weighted sentence in the same cluster was checked. B) If the sentence type was not redundant, we extracted it to produce summaries. 16 16
Recommend
More recommend