lexpagerank prestige in multi document text summarization
play

LexPageRank: Prestige in Multi-Document Text Summarization G unes - PDF document

LexPageRank: Prestige in Multi-Document Text Summarization G unes Erkan , Dragomir R. Radev Department of EECS, School of Information University of Michigan gerkan,radev @umich.edu Abstract


  1. � ✄ � ✄ LexPageRank: Prestige in Multi-Document Text Summarization �✂✁ G¨ unes ¸ Erkan , Dragomir R. Radev Department of EECS, School of Information University of Michigan ☎ gerkan,radev ✆ @umich.edu Abstract marization evaluation to compare our system also with other state-of-the-art summarization systems. Multidocument extractive summarization relies on the concept of sentence centrality to identify the 2 Sentence centrality and centroid-based most important sentences in a document. Central- summarization ity is typically defined in terms of the presence of Extractive summarization produces summaries by particular important words or in terms of similarity choosing a subset of the sentences in the original to a centroid pseudo-sentence. We are now consid- documents. This process can be viewed as choosing ering an approach for computing sentence impor- the most central sentences in a (multi-document) tance based on the concept of eigenvector centrality cluster that give the necessary and enough amount (prestige) that we call LexPageRank. In this model, of information related to the main theme of the clus- a sentence connectivity matrix is constructed based ter. Centrality of a sentence is often defined in terms on cosine similarity. If the cosine similarity be- of the centrality of the words that it contains. A tween two sentences exceeds a particular predefined common way of assessing word centrality is to look threshold, a corresponding edge is added to the con- at the centroid. The centroid of a cluster is a pseudo- nectivity matrix. We provide an evaluation of our document which consists of words that have fre- method on DUC 2004 data. The results show that quency*IDF scores above a predefined threshold. In our approach outperforms centroid-based summa- centroid-based summarization (Radev et al., 2000), rization and is quite successful compared to other the sentences that contain more words from the cen- summarization systems. troid of the cluster are considered as central. For- mally, the centroid score of a sentence is the co- 1 Introduction sine of the angle between the centroid vector of the Text summarization is the process of automatically whole cluster and the individual centroid of the sen- creating a compressed version of a given text that tence. This is a measure of how close the sentence is provides useful information for the user. In this pa- to the centroid of the cluster. Centroid-based sum- per, we focus on multi-document generic text sum- marization has given promising results in the past marization, where the goal is to produce a summary (Radev et al., 2001). of multiple documents about the same, but unspeci- 3 Prestige-based sentence centrality fied topic. Our summarization approach is to assess the cen- In this section, we propose a new method to mea- trality of each sentence in a cluster and include the sure sentence centrality based on prestige in social most important ones in the summary. In Section 2, networks, which has also inspired many ideas in the we present centroid-based summarization, a well- computer networks and information retrieval. known method for judging sentence centrality. Then A cluster of documents can be viewed as a net- we introduce two new measures for centrality, De- work of sentences that are related to each other. gree and LexPageRank, inspired from the “prestige” Some sentences are more similar to each other while concept in social networks and based on our new ap- some others may share only a little information with proach. We compare our new methods and centroid- the rest of the sentences. We hypothesize that the based summarization using a feature-based generic sentences that are similar to many of the other sen- summarization toolkit, MEAD, and show that new tences in a cluster are more central (or prestigious ) features outperform Centroid in most of the cases. to the topic. There are two points to clarify in this Test data for our experiments is taken from Docu- definition of centrality. First is how to define sim- ment Understanding Conferences (DUC) 2004 sum- ilarity between two sentences. Second is how to

  2. ✑ ✙ � ✍ � ✱ ✯ ✁ � ✑ ✪ ✥ ✑ � ✥ ✑ � compute the overall prestige of a sentence given its method where each vote counts the same. How- similarity to other sentences. For the similarity met- ever, this may have a negative effect in the qual- ric, we use cosine. A cluster may be represented by ity of the summaries in some cases where several a cosine similarity matrix where each entry in the unwanted sentences vote for each and raise their matrix is the similarity between the corresponding prestiges. As an extreme example, consider a noisy sentence pair. Figure 1 shows a subset of a cluster cluster where all the documents are related to each used in DUC 2004, and the corresponding cosine other, but only one of them is about a somewhat dif- similarity matrix. Sentence ID d s indicates the ferent topic. Obviously, we wouldn’t want any of ✁ th sentence in the th document. In the follow- the sentences in the unrelated document to be in- ing sections, we discuss two methods to compute cluded in a generic summary of the cluster. How- sentence prestige using this matrix. ever, assume that the unrelated document contains some sentences that are very prestigious consider- 3.1 Degree centrality ing only the votes in that document. These sen- tences will get artificially high centrality scores by In a cluster of related documents, many of the sen- the local votes from a specific set of sentences. This tences are expected to be somewhat similar to each situation can be avoided by considering where the other since they are all about the same topic. This votes come from and taking the prestige of the vot- can be seen in Figure 1 where the majority of the ing node into account in weighting each vote. Our values in the similarity matrix are nonzero. Since approach is inspired by a similar idea used in com- we are interested in significant similarities, we can puting web page prestiges. eliminate some low values in this matrix by defining One of the most successful applications of pres- a threshold so that the cluster can be viewed as an tige is PageRank (Page et al., 1998), the underly- (undirected) graph, where each sentence of the clus- ing technology behind the Google search engine. ter is a node, and significantly similar sentences are PageRank is a method proposed for assigning a connected to each other. Figure 2 shows the graphs prestige score to each page in the Web independent that correspond to the adjacency matrix derived by of a specific query. In PageRank, the score of a page assuming the pair of sentences that have a similarity is determined depending on the number of pages above ✂☎✄✝✆✟✞✠✂☎✄☛✡☞✞ and ✂☎✄☛✌ , respectively, in Figure 1 are that link to that page as well as the individual scores similar to each other. We define degree centrality as of the linking pages. More formally, the PageRank the degree of each node in the similarity graph. As of a page is given as follows: seen in Table 1, the choice of cosine threshold dra- matically influences the interpretation of centrality. Too low thresholds may mistakenly take weak simi- ✎✏✍✒✑✔✓✕✎✖✆✘✗✚✙✛✑✢✜✣✙✤✎ PR PR larities into consideration while too high thresholds PR ✎✦✥ ✎✦✥✘✪✤✑ (1) C ✜✧✄★✄★✄✩✜ C may lose much of the similarity relations in a clus- ✎✦✥ ✎✦✥ ter. ✍ , C ✎✦✥✬✫✭✑ is the where ✪ are pages that link to ✄★✄★✄ number of outgoing links from page ✥✮✫ , and is ID Degree (0.1) Degree (0.2) Degree (0.3) d1s1 4 3 1 the damping factor which can be set between ✂ and d2s1 6 2 1 ✆ . This recursively defined value can be computed d2s2 1 0 0 by forming the binary adjacency matrix, , of the d2s3 5 2 0 d3s1 4 1 0 Web, where if there is a link from ✯✰✎✏✱✮✞✠✲✛✑✳✓✴✆ d3s2 6 3 0 page to page ✲ , normalizing this matrix so that d3s3 1 1 0 row sums equal to ✆ , and finding the principal eigen- d4s1 8 4 0 d5s1 4 3 1 vector of the normalized matrix. PageRank for ✵ th d5s2 5 3 0 page equals to the ✵ th entry in the eigenvector. Prin- d5s3 4 1 1 cipal eigenvector of a matrix can be computed with Table 1: Degree centrality scores for the graphs in a simple iterative power method. Figure 2. Sentence d4s1 is the most central sentence This method can be directly applied to the cosine for thresholds 0.1 and 0.2. similarity graph to find the most prestigious sen- tences in a document. We use PageRank to weight each vote so that a vote that comes from a more 3.2 Eigenvector centrality and LexPageRank prestigious sentence has a greater value in the cen- When computing degree centrality, we have treated trality of a sentence. Note that unlike the original each edge as a vote to determine the overall pres- PageRank method, the graph is undirected since co- tige value of each node. This is a totally democratic sine similarity is a symmetric relation. However,

Recommend


More recommend