Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis Jen-Yuan Yeh 1 , Hao-Ren Ke 2 , and Wei-Pang Yang 1 1 Department of Computer & Information Science, National Chiao-Tung University, Taiwan, R.O.C. 2 Digital Library & Information Section of Library, National Chiao-Tung University, Taiwan, R.O.C.
Outline � Introduction and related works � Modified Corpus-based approach � LSA-based Text Relationship Map approach � Evaluations � Conclusions 2002/12/13 2/31 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Outline � Introduction and related works � Modified Corpus-based approach � LSA-based Text Relationship Map approach � Evaluations � Conclusions 2002/12/13 3/31 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Text summarization � The process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks) [Mani99] . Documents Compression Ratio Transformation Summaries Synthesis Analysis 2002/12/13 4/31 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Corpus-based Approach: A Trainable Document Summarizer [Kupiec95] vectors Feature Learning Extractor Algorithm Source Labeler Source Rules Test Phase Training Phase Rule Application Summary Machine-generated Test Corpus Training Corpus Summary ( ) ( ) ∏ k ∈ ∈ P f | s S P s S ( ) j = ∈ = j 1 P s S | f , f ,.., f 1 2 k ( ) ∏ k P f j = j 1 2002/12/13 5/31 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Text Relationship Map (T.R.M.) Approach: Automated Text Structure and Summarization [Salton97] P 4 : 3 P 5 : 7 P 3 : 7 P 2 : 2 P 6 : 6 ( ) ⋅ P 1 : 6 P P i j = Sim P , P i j P 7 : 5 P P i j P 8 : 9 Three heuristic methods: P 11 :2 P 9 : 8 •Global bushy path P 10 : 3 •Depth-first path •Segmented bushy path •Each node is represented as P i =( k 1 , k 2 , …, k n ) • P i and P j are say to be connected when their vector similarity is greater than the threshold. 2002/12/13 6/31 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Outline � Introduction and related works � Modified Corpus-based approach � LSA-based Text Relationship Map approach � Evaluations � Conclusions 2002/12/13 7/31 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Modified Corpus-based Approach � We use a “Score Function” to measure the significance of a sentence. ( ) ( ) ( ) ( ) ( ) ( ) = ⋅ + ⋅ − ⋅ + ⋅ + ⋅ Score s w Score s w Score s w Score s w Score s w Score s Overall 1 f 2 f 3 f 4 f 5 f 1 2 3 4 5 where f represent s "Positio n", f represent s "Positiv e Keyword" , f represent s "Negativ e Keyword" , 1 2 3 f represent s "Resembl ance to th e Title", f represent s "Central ity", and w indicates the impor tance 4 5 i of each fe ature. � Original approach computes the probability that a sentence will be included in the summary. ( ) ( ) ∏ k ∈ ∈ P f | s S P s S j ( ) = ∈ = j 1 P s S | f , f ,..., f 1 2 k ( ) ∏ k P f j = j 1 2002/12/13 8/31 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
f 1 : Position � For a sentence s , this feature-score is obtained as Average ra nk of Posi tion ( ) ( ) i = ∈ × Score s P s S|Position f i 1 5 . 0 where s co mes from P osition i a five-level rank from 1 to 5 used to emphasize the significance of positions. 2002/12/13 9/31 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Word Aggregation for f 2 , f 3 , f 4 , and f 5 � Use Word Co-occurrence to reshape word unit. WC(B,C) > threshold 個人電腦 個人 電腦 ABCD AED Assume A, B, C, D, E are keywords, E is composed of B and C in order, if WC(B, C) > threshold , then replace B and C with E. freq ( ) E = WC B , C × freq freq B C [Kowalski97] 2002/12/13 10/31 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
f 2 : Positive Keyword � For a sentence s , assume s contains Keyword 1 , Keyword 2 , …, Keyword n , this feature-score is obtained as ( ) ( ) ∑ = ∈ Score s c P s S | Keyword ⋅ f k k 2 = k 1 , 2 ,..., n where c is the no . of Keywo rd in s. k k 2002/12/13 11/31 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
f 3 : Negative Keyword � For a sentence s , assume s contains Keyword 1 , Keyword 2 , …, Keyword n , this feature-score is obtained as ( ) ( ) ∑ = ∉ Score s c P s S | Keyword ⋅ f k k 3 = k 1 , 2 ,..., n where c is the no . of Keywo rd in s. k k 2002/12/13 12/31 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
f 4 : Resemblance to the Title � For a sentence s , this feature-score is obtained as I Keywords i n s Keywords i n Title ( ) = Score f s 4 U Keywords i n s Keywords i n Title 2002/12/13 13/31 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
f 5 : Centrality � For a sentence s , this feature-score is obtained as I Keywords i n s Keywords i n other se ntences ( ) = Score f s 5 U Keywords i n s Keywords i n other se ntences 2002/12/13 14/31 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Train the Score Function by the Genetic Algorithm � Help to find a suitable combination of feature-weights. � Represent a genome as ( w 1 ,w 2 ,w 3 ,w 4 ,w 5 ), and perform the genetic algorithm (GA) to determine the value of w i . � Fitness: the average recall got with the genome when applying on the training corpus. 2002/12/13 15/31 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Outline � Introduction and related works � Modified Corpus-based approach � LSA-based Text Relationship Map approach � Evaluations � Conclusions 2002/12/13 16/31 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
LSA-based T.R.M. Approach � Combine T.R.M. [Salton97] and the semantic representations derived by LSA to promote summarization to semantic-level. Latent Semantic Analysis Document Chinese Summarization based on Document Chinese Summary Document Text Relationship Map [Salton97] Summary Document Word-by-Sentence Matrix Construction Sentence Sentence Identification Global Bushy Path Singular Value Relationship Construction Decomposition Analysis Word Segmentation & Keyword- Dimension Reduction Sentence Selection Semantic Related Frequency Sentence Link Calculation Sentence Selection Semantic Matrix Text Relationship Map Reconstruction Preprocessing Construction Semantic Model Analysis Semantic Sentence/Word Text Relationship Map Semantic Sentence/Word Text Relationship Map Representations Representations 2002/12/13 17/31 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Semantic Representations � Represent a document D as a Word-by-Sentence matrix A and apply SVD to A to derive latent semantic structures of D from A . Sentence = ⋅ L S S S a L G ij ij i 1 2 N ⎛ + ⎞ c L W a a a ij = ⎜ ⎟ 1 11 12 1 N L log 1 ij ⎝ ⎠ n = j L A W a a a 2 21 22 2 N = − G 1 E i i M M M O M N 1 ( ) ∑ = − × E f log f ( ) i ij ij L [Bellegarda96] W a a a log N M M 1 M 2 MN = j 1 where c ij is the frequency of W i in S j , n j is the number Keyword: Nouns & Verbs of words in S j , and E i is the normalized entropy of W i 2002/12/13 18/31 Chinese Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis
Recommend
More recommend