An experiment on the Reddit dataset Sha hahbaz Ahm hmed ed (1155 15594 94) Viore orel Mora orari ri (1156 15629 29)
The nee need for for Sum Summarization • Go Goal - to capture the important information contained in large volumes of text, and present it in a brief, representative, and consistent summary • TL;DR - TLDR acronym expression stands for "Too Long, Didn't Read"
Types of of Sum Summariz izatio ion • Auto utomatic ic sum ummari rizati tion on - reducing a text document or a larger corpus of multiple documents into a short set of words or paragraph that conveys the main meaning of the text
Extractive vs. Ext s. Abs Abstractive • Extr tracti tive meth methods Abs bstr tracti tive meth methods - work by selecting a subset of existing - build an internal semantic representation and words, phrases, or sentences in the then use natural language generation original text to form the summary techniques to create a summary that is closer to what a human might generate
Exa Example le The Army Corps of Engineers, rushing to meet President Bush’s promise to protect New Orleans by the start of the 2006 hurricane season, installed defective flood-control pumps last year despite warnings from its own expert that the equipment would fail during a storm, according to documents obtained by The Associated Press. Ex Extr tracti tive Sum ummary ry : “ Army Corp Corps s of of En Engin ineers rs ,” “ Pre resid ident Bu Bush sh ,” “ Ne New Orlea Orleans ,” and “ de defe fecti tive floo ood-contro rol l pu pump mps ” Abs bstr tracti tive Sum ummary ry : “ po polit itic ical ne negli ligence ” or “ ina nadequate pr prote tectio ion from floo oods .”
Exis Existing wor ork Gupta and Lehal, 2010 – single document summarization • Goldstein et al. 2000 - summarization of multiple documents on the same topic (~20 200 do documen ents ts) • • Cselle, Albrecht, and Wattenhofer, 2007 - summarizing discussions such as email conversations (~20 200 0 com ommen ents ts) Hu, Sun, and Lim 2007 – blogs summarization (~15 1500 0 bl blog og po posts) sts) • • Chakrabarti and Punera 2011 – tweets summarization (440K 440K twee weets; s; ov over er 150 150 games es) Brody and Elhadad 2010 – reviews summarization •
Our ur Da Dataset : : The Red Reddit Uni niverse! Commen Co ents • 149.6 GB : 1,659,361,605(~ 1.66 billion) entries Sub ubmis issi sions • 39.7 GB : 196,531,736(~ 1.96 million) entries
Com omment • Commen Co ent - a statement of fact or opinion, especially a remark that expresses a personal reaction or attitude. {"arc rchi hive ved":true,"aut utho hor":"jaquehamr","body dy":"Thanks for proving the point of the quote.\n\nTL;DR: WOOSH", "controversiality":0, "created_utc":"1239192802", "dow owns ns":0,"edi dite ted":"false", "gild lded ed":0, "id id":"c08q8en", "link nk_i _id":"t3_8auok", "nam ame":"t1_c08q8en","pare rent_ t_id id":"t1_c08q4sz","ret etri riev eved_ d_on on":1425950159,"score":3,"scor ore_h _hi dden":false,"subreddit":"atheism","sub ubre redd ddit it_i _id":"t5_2qh2p","ups":3}
Sub Submis issi sion • Sub ubmis issi sion - a statement of fact or opinion posted by a registered user with the intention to be elaborated by other users. {"arc rchi hive ved":true,"aut utho hor":"[deleted]","crea eated ed":1297290547,"cre reate ted_ d_ut utc":"1297290547","dom dom ain":"self.WeAreTheFilmMakers","downs":0,"edited":"false","gilded":0,"hide_score":false,"i d":"fibse","is_ s_sel elf":true,"med edia_ a_em embe bed":{},"nam ame":"t3_fibse","num_ m_co comm mmen ents":2,"ov over_ r_18 18":fa lse,"perm rmal alin ink":"/r/WeAreTheFilmMakers/comments/fibse/question_about_resumes/","quar aran antin ine ":false,"retr trie ieved ed_o _on":1442846972,"save aved":false,"scor ore":2,"sec ecur ure_ e_me medi dia_ a_emb mbed ed":{}, "sel elft ftext xt":"I'm currently a film student at the University of Cincinnati and I'm going to start applying for internships soon so I was wondering what I should put on my resume when applying.\n\ntl;dr I'm going to be sending out my resume soon and I'm looking for help on what I should include on it”, ” stic icki kied ed":false,"subr bred eddi dit":"WeAreTheFilmMakers", "sub ubre reddi dit_ t_id id":"t5_2qngr","thu humbn bnai ail":"default","titl tle":"Question about resumes", "ups ps":2, "url rl":"http://www.reddit.com/r/WeAreTheFilmMakers/comments/fibse/question_about_resumes/"}
Proc ocess of of Sum Summariz izatio ion Extracting a clean dataset “The most important tasks with regard to understanding the information available in comments are filtering, ranking and summarizing the comments.” - (Potthast et al. 2012) • Extract only the items which contain tl;dr • Ch Chall llenge - "body":"It's pretty sad that someone can sum up ten years of your life with a tl;dr"
Proc ocess of of Sum Summariz izatio ion Extracting a clean dataset • Filter out comments/submissions with content length < 50 chars (our approach) • E.g "body":"Thanks for proving the point of the E. Filtering the targets quote.\n\nTL;DR : WOOSH“ – inv nvalid id • Filter out tl;dr’s with content length < 5 chars (our approach) • E. E.g. "body":"It's pretty sad that someone can sum up ten years of your life with a tl;dr “ - inv nvali lid
Proc ocess of of Sum Summariz izatio ion Extracting a clean dataset • TF TF-IDF – ranking mod model els s Filtering the targets Processing & Ranking the target content • Reduces the influence of more common words
Proc ocess of of Sum Summariz izatio ion Extracting a clean dataset • Highest rank terms form the summarization (tl;dr) Filtering the targets Processing & Ranking the target content Extracting relevant information by ranks
Proc ocess of of Sum Summariz izatio ion (12 sente tences) Extracting a clean dataset 1. there used to be several chann nnels related to technology and geek culture. Filtering the targets 2. then it merged with g4tv, a shitty comcast chann nnel of little note that wanted techtv's audience and cancelled all the decent reasons to ever tune into techtv. Processing & Ranking 3. there is nothing decent that comes on cable television that you can't watch the target content for free (and legally) on either hulu or the comedy chann nnel's website. Extracting relevant information by ranks Or Origina nal tl;dr dr: : there was a decent one. comcast more or less bought it out and axed all it's programming to get viewers for its gaming chann nnel but only succeeded in Presentation of the destroying the market and causing kevin rose to run off and create digg. retrieved content (2 sente tenc nces)
Statistic St ics abo about the dat data set set COMMENTS COMME TS DISTR TRIB IBUTION TION SUB UBMIS ISSION IONS DISTR TRIB IBUTION TION Vali lid Vali lid 0.1% 0.4% Comm Co mments ts Sub ubmis issio ions 99.9% 99.6% Invali lid Invali lid Sub ubmis issio ions Comm Co mments ts 749376 (~ 0.75 million) 1,850,031 (~ 1.85 million)
St Statistic ics abo about the dat data set set AVERAGE WOR ORD LENG NGTH TH 405.1 SUBMISSIONS 228.2 COMMENTS 0 50 100 150 200 250 300 350 400 450 Nu Numbe mber r of word rds
St Statistic ics abo about the dat data set set DISTR TRIB IBUTION TION OF OF COMME OMMENTS TS BY LENG NGTH TH Number of Comments
St Statistic ics abo about the dat data set set DISTR TRIB IBUTION TION OF OF SUB UBMIS MISSION IONS BY LENG NGTH TH
Furt Further ide ideas • Developing the automatic extractive summarizer on the valid comments & submissions (in progress) • Dealing with tl;dr at the semantic level - "body":"It's pretty sad that someone can sum up ten years of your life with a tl;dr “ • Keyphrase extraction for summarization • Form a proper representation of valid and invalid comments/submissions/ tl;dr’s • Dealing with encountered anomalies and faults in the detection process
Exa Example les • Vali lid Sub ubmiss ssio ion
Example Exa les • Vali lid Co Comment :omgwtfthatistotallyhitleri'llneverbeabletobuythatbrandoflotioneveragainthankyouforpointingthisoutt omei'llsendanemailoutoeverylawyericanfindsothatthiscompanycanbebroughttojustice!) • "wor ordcount":2
Recommend
More recommend