a document summarizer for novices
play

A DOCUMENT SUMMARIZER FOR NOVICES REX RUBIN WHY A DOCUMENT - PowerPoint PPT Presentation

A DOCUMENT SUMMARIZER FOR NOVICES REX RUBIN WHY A DOCUMENT SUMMARIZER? Getting into a field of research is: Daunting with the amount of information presented Difficult to discern what is important and what isnt How a


  1. A DOCUMENT SUMMARIZER FOR NOVICES REX RUBIN

  2. WHY A DOCUMENT SUMMARIZER?  Getting into a field of research is:  Daunting with the amount of information presented  Difficult to discern what is important and what isn’t  How a summarizer will help:  Present the most relevant information and remove the excess

  3. EXTRACTION VS ABSTRACTION  Extraction[1]  Abstraction[1]  Pulls sentences straight  Creates sentences by from the input joining several together  Does not make its own  Works better for several sentences documents at once

  4. TEXTRANK  Extraction based[2]  Creates a web of sentences  This web is used as an input for PageRank  PageRank will rank the sentences[3]  Gives the summary as the output

  5. HOW TO IMPROVE THIS MODEL?  It is important to note the glossary should be of relevant terms compared to the original document  The way TextRank works, the glossary will allow for similar sentences to connect and score higher  This will help by giving more informative sentences  It is important to know that more informative does not mean easier to read

  6. MY TEXTRANK MODIFICATION

  7. RESEARCH QUESTION  Will including a glossary of related terms in the original document bring about more informative sentences?

  8. HYPOTHESIS  Having a glossary included in the original document will bring out more informative sentences in the final summary

  9. EXPERIMENT OVERVIEW  Two experimental groups:  Control Group (Y)  Test Group (X)  Have the groups take a test on the original document

  10. MY SUMMARY  My summary was made using a document focused on cybersecurity and the glossary was filled with similar cybersecurity terms

  11. PARTICIPANTS  Participants:  Union College students aged 18-22  Mixed group of CS students and non-CS students  2 Groups:  Control(Y) read the summary that was made through the original TextRank program  Test (X) read the summary that was made through my modified TextRank program

  12. TEST GIVEN TO PARTICIPANTS  The test given to participants was based on the main points of the original document  Why the main points?  The main points should be in the summary  Question types  3 Multiples Choice  3 Open Answer

  13. AVERAGE SCORES OF QUESTIONS 3.5 3.22 3.06 3 2.5 2 1.5 0.94 0.94 0.89 0.89 1 0.56 0.5 0.44 0.39 0.5 0.33 0.22 0.19 0.06 0 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score Multiple Choice: Open Answer: Data on the left is Y 3 1 and the right is X 4 2 6 5

  14. AVERAGE SCORES OF QUESTIONS OUTLIERS REMOVED 4 3.625 3.5 3 3 2.5 2 1.5 1 1 1 1 1 0.5625 0.5 0.5 0.428571 0.5 0.375 0.1875 0.0714286 0 0 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score

  15. DIFFERENCES IN RESULTS X-Y 0.4 0.33 0.3 0.2 0.16 0.13 0.11 0.11 0.1 0 0 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score -0.1 -0.2 -0.3 -0.4 -0.45 -0.5

  16. DIFFERENCES X-Y OUTLIERS REMOVED 0.8 0.625 0.571429 0.6 0.375 0.4 0.2 0.1160714 0.0625 0 0 Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total Score -0.2 -0.4 -0.5 -0.6

  17. WAS MY HYPOTHESIS CORRECT? With these results, I can say my hypothesis is incorrect

  18. SOMETHING ELSE?  Differences in 4 and 6 were significant Question 4 Question 6 1 1 0.89 0.89 0.9 0.9 0.8 0.8 0.7 0.7 0.56 0.6 0.6 0.5 0.45 0.5 0.44 0.4 0.4 0.33 0.3 0.3 0.2 0.2 0.1 0.1 0 0 X Average Y Average Difference Y-X X Average Y Average Difference X-Y

  19. CITATIONS [1]Jan Pedersen Kupiec, Julian and Francine Chen. A trainable document summarizer. ACM SIGIR conference on Research and development in information retrieval, (15):68 – 73, 1995 [2] Paul Tarau Rada Mihalcea. Textrank: Bringing order into texts. 2011. [3] Herwig Unger Mario Kubek. Topic detection based on the pagerank’s clustering property. 2011.

Recommend


More recommend