A DOCUMENT SUMMARIZER FOR NOVICES REX RUBIN
WHY A DOCUMENT SUMMARIZER? Getting into a field of research is: Daunting with the amount of information presented Difficult to discern what is important and what isn’t How a summarizer will help: Present the most relevant information and remove the excess
EXTRACTION VS ABSTRACTION Extraction[1] Abstraction[1] Pulls sentences straight Creates sentences by from the input joining several together Does not make its own Works better for several sentences documents at once
TEXTRANK Extraction based[2] Creates a web of sentences This web is used as an input for PageRank PageRank will rank the sentences[3] Gives the summary as the output
HOW TO IMPROVE THIS MODEL? It is important to note the glossary should be of relevant terms compared to the original document The way TextRank works, the glossary will allow for similar sentences to connect and score higher This will help by giving more informative sentences It is important to know that more informative does not mean easier to read
MY TEXTRANK MODIFICATION
RESEARCH QUESTION Will including a glossary of related terms in the original document bring about more informative sentences?
HYPOTHESIS Having a glossary included in the original document will bring out more informative sentences in the final summary
EXPERIMENT OVERVIEW Two experimental groups: Control Group (Y) Test Group (X) Have the groups take a test on the original document
MY SUMMARY My summary was made using a document focused on cybersecurity and the glossary was filled with similar cybersecurity terms
PARTICIPANTS Participants: Union College students aged 18-22 Mixed group of CS students and non-CS students 2 Groups: Control(Y) read the summary that was made through the original TextRank program Test (X) read the summary that was made through my modified TextRank program
TEST GIVEN TO PARTICIPANTS The test given to participants was based on the main points of the original document Why the main points? The main points should be in the summary Question types 3 Multiples Choice 3 Open Answer
AVERAGE SCORES OF QUESTIONS 3.5 3.22 3.06 3 2.5 2 1.5 0.94 0.94 0.89 0.89 1 0.56 0.5 0.44 0.39 0.5 0.33 0.22 0.19 0.06 0 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score Multiple Choice: Open Answer: Data on the left is Y 3 1 and the right is X 4 2 6 5
AVERAGE SCORES OF QUESTIONS OUTLIERS REMOVED 4 3.625 3.5 3 3 2.5 2 1.5 1 1 1 1 1 0.5625 0.5 0.5 0.428571 0.5 0.375 0.1875 0.0714286 0 0 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score
DIFFERENCES IN RESULTS X-Y 0.4 0.33 0.3 0.2 0.16 0.13 0.11 0.11 0.1 0 0 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score -0.1 -0.2 -0.3 -0.4 -0.45 -0.5
DIFFERENCES X-Y OUTLIERS REMOVED 0.8 0.625 0.571429 0.6 0.375 0.4 0.2 0.1160714 0.0625 0 0 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score -0.2 -0.4 -0.5 -0.6
WAS MY HYPOTHESIS CORRECT? With these results, I can say my hypothesis is incorrect
SOMETHING ELSE? Differences in 4 and 6 were significant Question 4 Question 6 1 1 0.89 0.89 0.9 0.9 0.8 0.8 0.7 0.7 0.56 0.6 0.6 0.5 0.45 0.5 0.44 0.4 0.4 0.33 0.3 0.3 0.2 0.2 0.1 0.1 0 0 X Average Y Average Difference Y-X X Average Y Average Difference X-Y
CITATIONS [1]Jan Pedersen Kupiec, Julian and Francine Chen. A trainable document summarizer. ACM SIGIR conference on Research and development in information retrieval, (15):68 – 73, 1995 [2] Paul Tarau Rada Mihalcea. Textrank: Bringing order into texts. 2011. [3] Herwig Unger Mario Kubek. Topic detection based on the pagerank’s clustering property. 2011.
Recommend
More recommend