Topic Modeling on ABC News Mining Unstructured Data Team 2: Steve - PowerPoint PPT Presentation

Topic Modeling on ABC News Mining Unstructured Data Team 2: Steve Barnard Charles Huang Shyam Senthilkumar Eda Wang May 21, 2019

ABC News headlines are used for topic modeling 78,000 headlines from news articles published in 2015 ( derived from Kaggle dataset - url listed below ) ● Sourced from the Australian Broadcasting Corp. (ABC) ● ~200 articles published each day ● Mix of international and Australian news ● Sample Headlines: “Egyptian court orders retrial for journalist Peter Greste” “Boat on fire at Jacobs Well Marina” “Woman arrested after explosives allegedly found in car” Expected LDA to find common news topics: Sports, Politics, Local News, Health, Science etc. ● https://www.kaggle.com/therohk/million-headlines

Number of topics choice based on heuristics approach Attempted Domain Knowledge/Heuristic based topic selection: Australian Broadcasting Corp. (ABC) news has ● 12 on website: https://www.abc.net.au/ ABC CNN lists 10 topics (excluding video) ● 12 Topics (data source) Difference in topics and one being US-centric and the other Australia-centric added to some difficulty interpreting topic themes . (i.e. no business/politics and addition of indigenous/rural). In an attempt/experiment to normalize the differences australian words were CNN manually added to stop words with limited 10 Topics success. (common words also added i.e. (wo)man) These are both as of 2019, the data used for our LDA variants was from 2015 so topics may have changed.

Number of topics choice based on coherence scores Coherence Based Topic Selection: Coherence scores fluctuated throughout our multiple iterations of LDA and its variants Coherence ranged from ~0.26-0.56 ● Depending on model type and number ○ of topics selected, stopwords, etc. Overall, models with a higher number of ● topics tended to perform better in terms of coherence We settled on using k=10 for the ● Coherence Scores v. Number of Topics for one of number of topics based on the lower our models: to illustrate iterative searching for end of possible topics as well as optimal topic numbering based on coherence. deciding on 10 as a comfortable starting point.

Summary of topic modeling methods and observations LSA LDA v1 LDA v2 Dataset ABC 2015 ABC 2015 ABC 2015 variations ‘english’ plus ‘english’ ‘english’ plus Stop words Gensim + Mallet’s Pkg/Method Sklearn KMeans Gemsim LDA # of Topics 10 10 10 Politics, Local News, Crime, International Shared Themes Output Economy, Education, Agriculture, Car Unique Themes Sports Car Accident, Accident, Health, Sports Infrastructure

Observations – LSA ( K-Means Clustering) Topic 1 – police, wa, court, charged, death, sa, murder, fire, car, crash Crime Topic 2 – australia, world, cup, win, final, south, rugby, one, cricket, ntch Sports Topic 3 – country, hour, nt, vic, tas, nsw, march, qld Local Topic 4 – government, council, coast, sydney, health, funding, tasmanian, canberra Politics Topic 5 – australian, us, market, china, west, open, share International Topic 6 – news, national, exchange, rural, abc, quiz, press, club, march, park Local Topic 7 – live, nrl, league, afl, streaming, updates, super, blog, export, final Sports Topic 8 – grandstand, drum, hill, capital, breakfast, march, broken, stumps, digital, confab Local Topic 9 – rural, qld, sach, reporter, countrywide, sa, north, outback, drought, central Local Topic 10 – nsw, interview, election, extended, rural, rain, wrap, shark, baird, police Politics Image: WordCloud base on news data Base on word frequency, added additional stop words such as time frame (e.g., Jan, days, Sept, Christmas), nouns (e.g., woman, man), verbs (e.g., say) Applied KMeans from sklearn package with n_clusters= 10. K Means Clustering method on data with removed stop words produced interpretable topics such as Sports (topic #7, #2), Politics (topic #4, #10), local stories (topic #3, #6, #8, #9), Crime (topic #1), International (#5)

Observations – LDA v1 (Gensim + Mallet’s LDA) Topic 1 – Australia, Australian, day, farmer, China, water, test, market, price, rise International Topic 2 – government, call, change, act, urge, election, labor, support, law, group Politics Topic 3 – win, open, world, set, lead, beat, record, final, return, cup Sports Topic 4 – Canberra, report, home, family, Perth, child, service, worker, work, leave Economy Topic 5 – make, show, talk, Adelaide, png, head, ban, centre, food, project Local Topic 6 – plan, council, WA, school, interview, community, high, hunter, cattle, student Education Topic 7 – hospital, minister, Tasmanian, budget, concern, cut, funding, job, time Health Topic 8 – year, Queensland, Sydney, kill, south, crash, car, hit, die, Melbourne Car Accident Topic 9 – man, police, charge, women, find, court, death, face, murder, miss Crime Topic 10 – rural, fire, gld, nsw, country_hour, national, nt, hour, warn, podcast Local Standard English stop words are used; Lemmatization keeping only noun, Adj, Verb, and Adv Gensim package with Mallet’s version of LDA algorithm was used on data. Interpretable topics include International (#1), Politics (#2), Sports (#3), Economy (#4), local stories (#5, #10), Education (#6), Health (#7), Car Accident (#8) and Crime (#9).

Observations – LDA v2 (Gensim) Topic 1 – australian, tasmanian, top, school, national, industry, victim, defence Local Topic 2 – australia, council, canberra, community, victorian, review, life, murder, concern, adelaide Politics Topic 3 – road, market, car, hospital, local, funding, law, ban, turnbull, people Infrastructure Topic 4 – melbourne, china, water, company, group, claim, qld, act Local Topic 5 – former, wa, home, good, hunter, leader, islamic state, turkey, grandstand, call International Topic 6 – queensland, perth, farmer, record, flood, high, worker, drought, force, price Agriculture Topic 7 – police, race, sale, warning, drug, number, deal, court, storm, star Crime Topic 8 – sydney, plan, death, test, family, government, change, crash, driver, big Car Accident Topic 9 – hour, hobart, league, child, report, season, charge, wa_country, vic_country Local Topic 10 – fire, resident, darwin, attack, student, dog, story, award, mine, service Local Base on word frequency, added additional stop words such as time frame (e.g., Jan, days, Sept, Christmas), nouns (e.g., woman, man), verbs (e.g., say); Lemmatization keeping only noun and adj Gensim package with Mallet’s version of LDA algorithm was used on data. Interpretable topics include Local (#1, #4, #9, #10), Politics (#2), Infrastructure (#3), International (#5), Crime (#7) and Car Accident (#8). model had coherence score of 0.57

Strengths and Weakness from LDA Strengths Weakness ❖ ➢ Unsupervised model without any labeling Results vary with the choice for number of requirements topics ❖ ➢ Treats each documents as a mixture of Topic interpretability and overlap : different topics and each topic a mixture of different words ❖ Provides understanding of underlying topic distributions that drive news headlines

Applications of LDA on News Our analysis was performed on a single Australian news source (ABC) in the year 2015 Method could be applied across numerous years to study how news topics covered in Australia ● have changed over time Topics learned from the Australian News could be compared to other countries’ to learn how the ● topics of interest vary around the world Methods could be applied to different news sources, for example to those with contrasting ● political ideologies or differing reader bases to understand topic distributions across different demographics

Topic Modeling on ABC News Mining Unstructured Data Team 2: Steve - PowerPoint PPT Presentation

Topic Modeling on ABC News Mining Unstructured Data Team 2: Steve Barnard Charles Huang Shyam Senthilkumar Eda Wang May 21, 2019 ABC News headlines are used for topic modeling 78,000 headlines from news articles published in 2015 (

Taming the Firehose: Thematically summarizing very large news corpora using topic modeling

Why learn topic modeling Pavel Oleinikov Associate Director Quantitative Analysis Center

Combining Topic Modeling and Regression Supervised Topic Modeling with Covariates Kenneth Tyler

Outline 1 The topic 2 Decision support systems 3 Modeling 3.3 Advanced modeling

Outline 1 The topic 2 Decision support systems 3 Modeling 3.3 Advanced Modeling 3.3.2

Machine Learning 2 DS 4420 - Spring 2020 Topic Modeling 2 Byron C. Wallace Last time: Topic

Linking words to topics Pavel Oleinikov Associate Director DataCamp Topic Modeling in R LDA

Using topic models as classifiers Pavel Oleinikov Associate Director Quantitative Analysis

Finding the best number of topics Pavel Oleinikov Associate Director Quantitative Analysis

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data Zhiyuan Chen and

Topic Modeling and the Sociology of Literature Andrew Goldstone Rutgers University, New Brunswick

RADIO NEWS PREPARATION AND PRESENTATION PREPARED BY MATUTE MENYOLI CRTV SOUTH WEST The topic the

Intertemporal topic correlations in online media A comparative study on weblogs and news websites

Opinion Integration Through Opinion Integration Through Semi supervised Topic Modeling

Multilingual and cross-lingual news topic tracking asper a Emilia K Koke, February 05, 2005 a

Modeling and Advanced Control of HVAC Systems Topic: HVAC Modeling & Control Truong Nghiem

Incorporating Topic Sentence on Neural News Headline Generation Jan Wira Gotama Putra 1 , Hayato

machine learning classification algorithms & Topic Modeling A quick look at 145,000 World

9/15/17 Outline Topic 1.Introduc8on Topic 2. RCS for six key fuels Topic 3.

Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge Ryan J. Gallagher

Outline 1 The topic 2 Decision support systems 3 Modeling 3.4 Constraints Computing with

Modeling Empathy and Distress in Reaction to News Stories Sven Buechel 2* Anneke Buffone 1* Barry

Recurrent Neural Networks for Language Modeling CSE392 - Spring 2019 Special Topic in CS Tasks

Topic Modeling on ABC News Mining Unstructured Data Team 2: Steve - PowerPoint PPT Presentation

Topic Modeling on ABC News Mining Unstructured Data Team 2: Steve Barnard Charles Huang Shyam Senthilkumar Eda Wang May 21, 2019 ABC News headlines are used for topic modeling 78,000 headlines from news articles published in 2015 (

Taming the Firehose: Thematically summarizing very large news corpora using topic modeling

Why learn topic modeling Pavel Oleinikov Associate Director Quantitative Analysis Center

Combining Topic Modeling and Regression Supervised Topic Modeling with Covariates Kenneth Tyler

Outline 1 The topic 2 Decision support systems 3 Modeling 3.3 Advanced modeling

Outline 1 The topic 2 Decision support systems 3 Modeling 3.3 Advanced Modeling 3.3.2

Machine Learning 2 DS 4420 - Spring 2020 Topic Modeling 2 Byron C. Wallace Last time: Topic

Linking words to topics Pavel Oleinikov Associate Director DataCamp Topic Modeling in R LDA

Using topic models as classifiers Pavel Oleinikov Associate Director Quantitative Analysis

Finding the best number of topics Pavel Oleinikov Associate Director Quantitative Analysis

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data Zhiyuan Chen and

Topic Modeling and the Sociology of Literature Andrew Goldstone Rutgers University, New Brunswick

RADIO NEWS PREPARATION AND PRESENTATION PREPARED BY MATUTE MENYOLI CRTV SOUTH WEST The topic the

Intertemporal topic correlations in online media A comparative study on weblogs and news websites

Opinion Integration Through Opinion Integration Through Semi supervised Topic Modeling

Multilingual and cross-lingual news topic tracking asper a Emilia K Koke, February 05, 2005 a

Modeling and Advanced Control of HVAC Systems Topic: HVAC Modeling &amp; Control Truong Nghiem

Incorporating Topic Sentence on Neural News Headline Generation Jan Wira Gotama Putra 1 , Hayato

machine learning classification algorithms &amp; Topic Modeling A quick look at 145,000 World

9/15/17 Outline Topic 1.Introduc8on Topic 2. RCS for six key fuels Topic 3.

Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge Ryan J. Gallagher

Outline 1 The topic 2 Decision support systems 3 Modeling 3.4 Constraints Computing with

Modeling Empathy and Distress in Reaction to News Stories Sven Buechel 2* Anneke Buffone 1* Barry

Recurrent Neural Networks for Language Modeling CSE392 - Spring 2019 Special Topic in CS Tasks

Modeling and Advanced Control of HVAC Systems Topic: HVAC Modeling & Control Truong Nghiem

machine learning classification algorithms & Topic Modeling A quick look at 145,000 World