topic modeling on abc news
play

Topic Modeling on ABC News Mining Unstructured Data Team 2: Steve - PowerPoint PPT Presentation

Topic Modeling on ABC News Mining Unstructured Data Team 2: Steve Barnard Charles Huang Shyam Senthilkumar Eda Wang May 21, 2019 ABC News headlines are used for topic modeling 78,000 headlines from news articles published in 2015 (


  1. Topic Modeling on ABC News Mining Unstructured Data Team 2: Steve Barnard Charles Huang Shyam Senthilkumar Eda Wang May 21, 2019

  2. ABC News headlines are used for topic modeling 78,000 headlines from news articles published in 2015 ( derived from Kaggle dataset - url listed below ) ● Sourced from the Australian Broadcasting Corp. (ABC) ● ~200 articles published each day ● Mix of international and Australian news ● Sample Headlines: “Egyptian court orders retrial for journalist Peter Greste” “Boat on fire at Jacobs Well Marina” “Woman arrested after explosives allegedly found in car” Expected LDA to find common news topics: Sports, Politics, Local News, Health, Science etc. ● https://www.kaggle.com/therohk/million-headlines

  3. Number of topics choice based on heuristics approach Attempted Domain Knowledge/Heuristic based topic selection: Australian Broadcasting Corp. (ABC) news has ● 12 on website: https://www.abc.net.au/ ABC CNN lists 10 topics (excluding video) ● 12 Topics (data source) Difference in topics and one being US-centric and the other Australia-centric added to some difficulty interpreting topic themes . (i.e. no business/politics and addition of indigenous/rural). In an attempt/experiment to normalize the differences australian words were CNN manually added to stop words with limited 10 Topics success. (common words also added i.e. (wo)man) These are both as of 2019, the data used for our LDA variants was from 2015 so topics may have changed.

  4. Number of topics choice based on coherence scores Coherence Based Topic Selection: Coherence scores fluctuated throughout our multiple iterations of LDA and its variants Coherence ranged from ~0.26-0.56 ● Depending on model type and number ○ of topics selected, stopwords, etc. Overall, models with a higher number of ● topics tended to perform better in terms of coherence We settled on using k=10 for the ● Coherence Scores v. Number of Topics for one of number of topics based on the lower our models: to illustrate iterative searching for end of possible topics as well as optimal topic numbering based on coherence. deciding on 10 as a comfortable starting point.

  5. Summary of topic modeling methods and observations LSA LDA v1 LDA v2 Dataset ABC 2015 ABC 2015 ABC 2015 variations ‘english’ plus ‘english’ ‘english’ plus Stop words Gensim + Mallet’s Pkg/Method Sklearn KMeans Gemsim LDA # of Topics 10 10 10 Politics, Local News, Crime, International Shared Themes Output Economy, Education, Agriculture, Car Unique Themes Sports Car Accident, Accident, Health, Sports Infrastructure

  6. Observations – LSA ( K-Means Clustering) Topic 1 – police, wa, court, charged, death, sa, murder, fire, car, crash Crime Topic 2 – australia, world, cup, win, final, south, rugby, one, cricket, ntch Sports Topic 3 – country, hour, nt, vic, tas, nsw, march, qld Local Topic 4 – government, council, coast, sydney, health, funding, tasmanian, canberra Politics Topic 5 – australian, us, market, china, west, open, share International Topic 6 – news, national, exchange, rural, abc, quiz, press, club, march, park Local Topic 7 – live, nrl, league, afl, streaming, updates, super, blog, export, final Sports Topic 8 – grandstand, drum, hill, capital, breakfast, march, broken, stumps, digital, confab Local Topic 9 – rural, qld, sach, reporter, countrywide, sa, north, outback, drought, central Local Topic 10 – nsw, interview, election, extended, rural, rain, wrap, shark, baird, police Politics Image: WordCloud base on news data Base on word frequency, added additional stop words such as time frame (e.g., Jan, days, Sept, Christmas), nouns (e.g., woman, man), verbs (e.g., say) Applied KMeans from sklearn package with n_clusters= 10. K Means Clustering method on data with removed stop words produced interpretable topics such as Sports (topic #7, #2), Politics (topic #4, #10), local stories (topic #3, #6, #8, #9), Crime (topic #1), International (#5)

  7. Observations – LDA v1 (Gensim + Mallet’s LDA) Topic 1 – Australia, Australian, day, farmer, China, water, test, market, price, rise International Topic 2 – government, call, change, act, urge, election, labor, support, law, group Politics Topic 3 – win, open, world, set, lead, beat, record, final, return, cup Sports Topic 4 – Canberra, report, home, family, Perth, child, service, worker, work, leave Economy Topic 5 – make, show, talk, Adelaide, png, head, ban, centre, food, project Local Topic 6 – plan, council, WA, school, interview, community, high, hunter, cattle, student Education Topic 7 – hospital, minister, Tasmanian, budget, concern, cut, funding, job, time Health Topic 8 – year, Queensland, Sydney, kill, south, crash, car, hit, die, Melbourne Car Accident Topic 9 – man, police, charge, women, find, court, death, face, murder, miss Crime Topic 10 – rural, fire, gld, nsw, country_hour, national, nt, hour, warn, podcast Local Standard English stop words are used; Lemmatization keeping only noun, Adj, Verb, and Adv Gensim package with Mallet’s version of LDA algorithm was used on data. Interpretable topics include International (#1), Politics (#2), Sports (#3), Economy (#4), local stories (#5, #10), Education (#6), Health (#7), Car Accident (#8) and Crime (#9).

  8. Observations – LDA v2 (Gensim) Topic 1 – australian, tasmanian, top, school, national, industry, victim, defence Local Topic 2 – australia, council, canberra, community, victorian, review, life, murder, concern, adelaide Politics Topic 3 – road, market, car, hospital, local, funding, law, ban, turnbull, people Infrastructure Topic 4 – melbourne, china, water, company, group, claim, qld, act Local Topic 5 – former, wa, home, good, hunter, leader, islamic state, turkey, grandstand, call International Topic 6 – queensland, perth, farmer, record, flood, high, worker, drought, force, price Agriculture Topic 7 – police, race, sale, warning, drug, number, deal, court, storm, star Crime Topic 8 – sydney, plan, death, test, family, government, change, crash, driver, big Car Accident Topic 9 – hour, hobart, league, child, report, season, charge, wa_country, vic_country Local Topic 10 – fire, resident, darwin, attack, student, dog, story, award, mine, service Local Base on word frequency, added additional stop words such as time frame (e.g., Jan, days, Sept, Christmas), nouns (e.g., woman, man), verbs (e.g., say); Lemmatization keeping only noun and adj Gensim package with Mallet’s version of LDA algorithm was used on data. Interpretable topics include Local (#1, #4, #9, #10), Politics (#2), Infrastructure (#3), International (#5), Crime (#7) and Car Accident (#8). model had coherence score of 0.57

  9. Strengths and Weakness from LDA Strengths Weakness ❖ ➢ Unsupervised model without any labeling Results vary with the choice for number of requirements topics ❖ ➢ Treats each documents as a mixture of Topic interpretability and overlap : different topics and each topic a mixture of different words ❖ Provides understanding of underlying topic distributions that drive news headlines

  10. Applications of LDA on News Our analysis was performed on a single Australian news source (ABC) in the year 2015 Method could be applied across numerous years to study how news topics covered in Australia ● have changed over time Topics learned from the Australian News could be compared to other countries’ to learn how the ● topics of interest vary around the world Methods could be applied to different news sources, for example to those with contrasting ● political ideologies or differing reader bases to understand topic distributions across different demographics

  11. Q&A

Recommend


More recommend