Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis Jen-Yuan Yeh 1 , Hao-Ren Ke 2 , and Wei-Pang Yang 1 1 Department of Computer & Information Science, National Chiao-Tung University, Taiwan, R.O.C. 2 Digital Library & Information Section of Library, National Chiao-Tung University, Taiwan, R.O.C.
Outline � Introduction and related work � Modified Corpus-based approach (MCBA) � LSA-based Text Relationship Map approach (LSA+T.R.M.) � Evaluation � Conclusion 2003/9/13 2/36 Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Text summarization � The process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks) [Mani & Bloedorn, 1999] . Documents Compression Ratio Transformation Summaries Synthesis Analysis 2003/9/13 3/36 Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Corpus-based Approach: A Trainable Document Summarizer [Kupiec et al., 1995] vectors Feature Learning Extractor Algorithm Source Labeler Source Rules Test Phase Training Phase Rule Application Summary Machine-generated Test Corpus Training Corpus Summary ( ) ( ) ∏ k ∈ ∈ P f | s S P s S ( ) j = ∈ = j 1 P s S | f , f ,.., f 1 2 k ( ) ∏ k P f j = j 1 2003/9/13 4/36 Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Text Relationship Map (T.R.M.) Approach: Automated Text Structure and Summarization [Salton et al., 1997] P 4 : 3 P 5 : 7 P 3 : 7 P 2 : 2 P 6 : 6 ( ) P 1 : 6 ⋅ P P = i j Sim P , P P 7 : 5 i j P P i j P 8 : 9 Three heuristic methods: P 11 :2 P 9 : 8 •Global bushy path P 10 : 3 •Depth-first path •Segmented bushy path •Each node is represented as P i =( k 1 , k 2 , …, k n ) • P i and P j are judged to be connected when their similarity is greater than the threshold. 2003/9/13 5/36 Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Modified Corpus-based Approach � We use a score function to measure the significance of a sentence. ( ) ( ) ( ) ( ) ( ) ( ) = ⋅ + ⋅ − ⋅ + ⋅ + ⋅ Score s w Score s w Score s w Score s w Score s w Score s Overall 1 f 2 f 3 f 4 f 5 f 1 2 3 4 5 where f represent s "Positio n", f represent s "Positiv e Keyword" , f represent s "Negativ e Keyword" , 1 2 3 f represent s "Central ity", f represent s "Resembl ance to th e Title", and w indicates the impor tance 4 5 i of each fe ature. � Kupiec et al. (1995) computes the probability that a sentence will be included in the summary. ( ) ( ) ∏ k ∈ ∈ P f | s S P s S j ( ) = ∈ = j 1 P s S | f , f ,..., f 1 2 k ( ) ∏ k P f j = j 1 2003/9/13 6/36 Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
f 1 : Position � For a sentence s , the position score is defined as Average ra nk ( PiSj ) ( ) ( ) = ∈ × Score f s P s S|PiSj 1 R where s co mes from P iSj . ( e . g . P 1 S 1 indicates the first sentence of the first paragraph ) R is a rank which implies significan ce of each sentence. 2003/9/13 7/36 Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
f 2 : Positive Keyword � For a sentence s , assume s contains Keyword 1 , Keyword 2 , …, Keyword n , the positive-keyword is defined as 1 ( ) ( ) ∑ = ∈ Score s tf P s S | Keyword ⋅ f i 2 i length ( s ) = i 1 ~ n where tf is the oc currence f req . of Keyword in s . i i 2003/9/13 8/36 Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
f 3 : Negative Keyword � For a sentence s , assume s contains Keyword 1 , Keyword 2 , …, Keyword n , this negative-keyword score is defined as 1 ( ) ( ) ∑ = ⋅ ∉ Score s tf P s S | Keyword f i 3 i length ( s ) = i 1 ~ n where tf is the oc currence freq . of Keywor d in s . k i 2003/9/13 9/36 Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
f 4 : Centrality � For a sentence s , the score is defined as I Keywords i n s Keywords i n other se ntences ( ) = Score f s 4 U Keywords i n s Keywords i n other se ntences 2003/9/13 10/36 Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
f 5 : Resemblance to the Title � For a sentence s , the score is defined as I Keywords i n s Keywords i n Title ( ) = Score f s 5 U Keywords i n s Keywords i n Title 2003/9/13 11/36 Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Word Aggregation for f 2 , f 3 , f 4 , and f 5 � Use Word Co-occurrence to reshape word unit. MI(B,C) > threshold 個人電腦 個人 電腦 ABCD AED Assume A, B, C, D, E are keywords, E is composed of B and C in order, if MI(B, C) > threshold , then replace B and C with E. ( ) P ( x , y ) = MI x , y log [Maosong et al., 1998] × P ( x ) P ( y ) P ( x ) is the probabilit y that x occurs in the corpus; P ( x,y ) is the probabilit y that x and y occurs adjacently in the corpus. 2003/9/13 12/36 Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Train the Score Function by the Genetic Algorithm � Help to find a suitable combination of feature-weights. � Regard ( w 1 ,w 2 ,w 3 ,w 4 ,w 5 ) as a genome, and perform the genetic algorithm (GA) to determine the value of w i . � Fitness: the average f-measure got with the genome when applying on the training corpus. � 100 generations, each with 1000 genomes. 2003/9/13 13/36 Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Summary of Modified Corpus-based Approach � Use a weighted score function to measure the importance of a sentence. � Employ ranked positions to emphasize the significance of sentence positions. � Train the score function by the genetic algorithm to find a suitable combination of feature weights. 2003/9/13 14/36 Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
LSA-based T.R.M. Approach � Combine T.R.M. [Salton et al., 1997] and semantic representations derived by LSA to promote summarization to semantic-level. Latent Semantic Analysis Document Chinese Summarization based on Document Chinese Summary Document Text Relationship Map [Salton et al., 1997] Summary Document Word-by-Sentence Matrix Construction Sentence Sentence Identification Global Bushy Path Singular Value Relationship Construction Decomposition Analysis Word Segmentation & Keyword- Dimension Reduction Sentence Selection Semantic Related Frequency Sentence Link Calculation Sentence Selection Semantic Matrix Text Relationship Map Reconstruction Preprocessing Construction Semantic Model Analysis Semantic Sentence/Word Text Relationship Map Semantic Sentence/Word Text Relationship Map Representations Representations 2003/9/13 15/36 Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Semantic Representations � Represent a document D as a Word-by-Sentence matrix A and apply SVD to A to derive latent semantic structures of D from A . Sentence = ⋅ L S S S a L G ij ij i 1 2 N ⎛ + ⎞ c L W a a a ij = ⎜ ⎟ 1 11 12 1 N L log 1 ij ⎝ ⎠ n = j L A W a a a 2 21 22 2 N = − G 1 E i i M M M O M N 1 ( ) ∑ = − × E f log f ( ) i ij ij L [Bellegarda et al., 1996] W a a a log N M M 1 M 2 MN = j 1 where c ij is the frequency of W i in S j , n j is the number Keyword: Nouns & Verbs of words in S j , and E i is the normalized entropy of W i 2003/9/13 16/36 Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Recommend
More recommend