Review T opic Discovery with Phrases using the Po lya Urn Model - PowerPoint PPT Presentation

Review T opic Discovery with Phrases using the Po ́ lya Urn Model Geli Fei, Zhiyuan Chen, Bing Liu University of Illinois at Chicago Presenter: Alan Akbik IBM Research Almaden / Berlin Institute of Technology

Product Aspects  Large collection of product reviews ◦ Example domain: Smartphones  Task: Discover aspects that are being discussed in the reviews ◦ Battery - Battery life, AAA batteries  „The battery life of this smartphone is great .“  „ It uses AAA batteries .“ ◦ Screen - Screen size, touch screen ◦ Camera - Resolution, image quality

T opic Models  Widely used in review topic / aspect discovery  Most models regard each topic as a distribution over individual terms (unigrams)  Terms in each document are assigned to topics ◦ Documents assigned to topics via terms  The generation of topics is mostly governed by “higher order co- occurrence” ( Heinrich 2009) ◦ i.e., how often words co-occur in different contexts

T opic Models  Major issue: individual words may not convey the same information as natural phrases ◦ e.g. “battery life” vs. “life”  Leading to three problems: ◦ Interpretability - T opics are hard for users to interpret unless they are domain experts ◦ Ambiguity - Hard to directly make use of the topical words ◦ False evidence - Causes extra or wrong co-occurrences in topic generation, leading to poorer topics

Possible Solutions (1)  Treat each whole phrase as one term “The battery life of this smartphone is great” <the> < battery_life > <of> <this> <smartphone> <is> <great>  Problems : ◦ Many phrases very rare ◦ Remove important words  “battery life” may not be in the same topic as “battery”, because we don’t observe co -occurence

Possible Solutions (2)  Keep individual words, add extra terms for phrases “The battery life of this tablet is great” <the> < battery > < life > < battery_life > <of> <this> <smartphone> <is> <great>  Problems: ◦ False evidence still exists ◦ Many phrases rare  “battery life” is much less frequent than “life” to be ranked on the top in a topic

Challenge How to retain connections between phrases and words while removing wrong co- occurrences?

Related Work  Using n-grams in topic modeling (Mukherjee and Liu 2013; Mukherjee et al. 2013).  Identifying key phrases in the post-processing step based on the discovered topical unigrams (Blei and Lafferty 2009; Liu et al. 2010; Zhao et al. 2011).  Directly modeling word order in topic model (Wallach 2006; Wang et al. 2007). ◦ Breaking the “bag -of- word” assumption ◦ Although ” bag-of- word” assumption does not always hold, it offers a great computational advantage ◦ Our method still follows the ” bag-of- word” assumption

Gibbs Sampling for LDA  One of the most commonly used inference techniques for topic models.  Considers each term in the documents in turn  Samples a topic to the current term, conditioned on the topic assignments of other terms.

Simple Po ́ lya Urn Model (SPU)  Designed in the context of colored balls and urns  In the context of topic models: ◦ A ball with a certain color: a term ◦ The urn: contains a mixture of balls with various colors (terms)  Topic-word (topic-term) distribution is reflected by the proportion of balls with a certain color in the urn

Simple Po ́ lya Urn Model (SPU)  Left: initial state  Middle: draw a ball of a certain color  Right: put two balls of the same color back  Self-reinforcing property known as “the rich get richer”

Generalized Po ́ lya Urn Model (GPU)  GPU vs. SPU: apart from two balls with the same color being put back, a certain number of balls with some other colors are also put in the urn.  We call this the promotion of these colored balls  Using the idea in the sampling process : ◦ SPU: seeing “staff” under a topic only increases the chance of seeing it again under the same topic ◦ GPU: also increases the chance of seeing “hotel staff” under the topic

Generalized Po ́ lya Urn Model (GPU)  In our application: ◦ We use each whole phrase as a term to remove wrong co-occurrences ◦ And use GPU to regain the connection between phrases and words  Two directions of promotion: ◦ Word to phrase: when a topic is assigned to an individual word, phrases containing the word are promoted ◦ Phrase to word: when a topic is assigned to a phrase, each component word is promoted

Datasets and Preprocessing  Data sets: ◦ 30 categories of electronics reviews from Amazon (1,000 reviews in each category) ◦ Hotel reviews from TripAdvisor (101,234 reviews) ◦ Restaurant reviews from Yelp (25,459 reviews)  Preprocessing: ◦ Review sentences as documents  Standard topic models cannot discover product aspects well when directly applied to reviews (Titov and McDonald, 2008) ◦ Rule-based method for noun phrase detection  Use rule-based method for efficiency

Experiments  Four sets of experiments on 32 domains ◦ Baseline #1, LDA(w) : without considering phrases ◦ Baseline #2, LDA(p) : considers phrases, uses each whole phrase as a term ◦ Baseline #3, LDA(w_p) : considers phrases, keeps individual component words, and adds phrases as extra terms ◦ LDA(p_GPU) : Our proposed method

Parameter Setting  Use the same set of parameters for all experiments ◦ Set Dirichlet priors as in (Griffiths and Steyvers, 2004)  Set document-topic prior 𝛽 =50/ 𝐿 , where 𝐿 is the number of topics.  Set topic-term prior 𝛾 =0.1 ◦ Set number of topics 𝐿 =15 ◦ posterior inference was drawn after 2000 Gibbs sampling iterations with 400 iterations of burn-in

Parameters for GPU Model  Not all words in a phrase are equally important ◦ e .g. “staff” is more important than “hotel” in “ hotel staff ”  Determine head nouns ◦ Following (Wang et al., 2007), we assume the last word in a noun phrase as the head noun  GPU promotion ◦ Word to phrase: promote a phrase by virtualcount when a topic is assigned to its head noun ◦ Phrase to word: promote 0.5 * virtualcount to the head noun and 0.25 * virtualcount to all other words when a topic is assigned to a phrase ◦ Set virtualcount=0.1 empirically, based on how much to promote phrases

Statistical Evaluation  Two commonly used evaluation statistics: ◦ Perplexity: measures the likelihood of unseen documents ◦ KL-divergence: measure the distinctiveness of topics ◦ Neither of them correlates well with human judgments  We use topic coherence (Mimno et al. 2011) ◦ It measures the degree of co-occurrence of topical words under a topic ◦ Has been shown to correlate with human judgment quite well ◦ Generates a negative value, the higher the better

Statistical Evaluation  Topic Coherence using top 15 topical terms

Statistical Evaluation  Topic Coherence using top 30 topical terms

Human Evaluation  Done by two annotators in two stages sequentially ◦ Topic labeling (Kappa score: 0.838) ◦ Topical terms labeling by computing precision@n (Kappa score: 0.846) ◦ We compute average p@15 and p@30 for each model on each domain

Human Evaluation  Human evaluation on five domains ◦ Hotel, Restaurant, Watch, Tablet, MP3Player

Example T opics  Example topics by LDA(w) and LDA(p_GPU)

Future Work  Design a topic quality metrics for topics with phrases  Systematically set the amount of promotion based on the designed metrics

Thank You!

Review T opic Discovery with Phrases using the Po lya Urn Model - PowerPoint PPT Presentation

Review T opic Discovery with Phrases using the Po lya Urn Model Geli Fei, Zhiyuan Chen, Bing Liu University of Illinois at Chicago Presenter: Alan Akbik IBM Research Almaden / Berlin Institute of Technology Product Aspects Large

Adverbial Phrases Aim Aim To identify prepositional phrases and adverbial phrases To

2 Syntax 2.1 Words 2.2 The Elements of Simple Noun Phrases 2.3 Verb Phrases and Simple

What is OPIC? The Legislature created the Office of Public Insurance Counsel (OPIC) in 1991 as

The Basics of Syntax Introducing Noun Phrases Some Further Details Introducing Verb Phrases

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Introduction to OPIC American Chamber of Commerce Ho Chi Minh City September 18, 2018 The U.S.

Learning Human Interaction by L i H I i b Interactive Phrases Interactive Phrases Yu Kong

Nesting habits of fmightless wh-phrases Patrick D. Elliott (MIT) November 25, 2019 Complex

Noun Phrases February 13, 2017 Next assignments Hundred noun phrases Hundred sentences

EAP Review Julie Marshall, Ph.D., CEAP Vice President Top opic ics EAP Benchmarks/

From Search to Discovery in our Future Library From Search to Discovery W e see a spectrum of

Watson Discovery Spring 2020 Discovery pipeline Using NLU, document conversion, and UI tools

Tunnel End-point Discovery Tunnel End-point Discovery draft-palet-v6ops-tun-auto-disc-03.txt

VPN Discovery VPN Discovery Design Team Discussions and Options Design Team Discussions and

Presentation I. Introductory remarks II. Useful phrases for presentation I. Introductory remarks

An Exploration of Embeddings for Generalized Phrases Wenpeng Yin & Hinrich Schutze ...

Working with Faculty to Ensure Digital Accessibility Ana Palla-Kane Senior IT Accessibility and

New Topical Medications Sunita Radhakrishnan, M.D. Glaucoma Center of San Francisco, Glaucoma

Estimate of P TOTAL = B TOTAL P B jh 610, 1991 page 1 RECENT / LOCAL EXAMPLES OF

LiveWell Kids Nutrition Module 3 & 4 Training 4th Grade LiveWell Kids Modules Fruits &

Towards Automa-c Topical Classifica-on of LOD Datasets

CS 61A Topical Review Object Oriented Programming Albert Xu Slides: albertxu.xyz/teaching/cs61a/

The Structural Topic Model and Applied Social Science Molly Roberts, Brandon Stewart, Dustin

COMMISSION MEETING WITH THE ADVISORY COMMITTEE ON REACTOR SAFEGUARDS (ACRS) December 6, 2019

Review T opic Discovery with Phrases using the Po lya Urn Model - PowerPoint PPT Presentation

Review T opic Discovery with Phrases using the Po lya Urn Model Geli Fei, Zhiyuan Chen, Bing Liu University of Illinois at Chicago Presenter: Alan Akbik IBM Research Almaden / Berlin Institute of Technology Product Aspects Large

Adverbial Phrases Aim Aim To identify prepositional phrases and adverbial phrases To

2 Syntax 2.1 Words 2.2 The Elements of Simple Noun Phrases 2.3 Verb Phrases and Simple

What is OPIC? The Legislature created the Office of Public Insurance Counsel (OPIC) in 1991 as

The Basics of Syntax Introducing Noun Phrases Some Further Details Introducing Verb Phrases

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Introduction to OPIC American Chamber of Commerce Ho Chi Minh City September 18, 2018 The U.S.

Learning Human Interaction by L i H I i b Interactive Phrases Interactive Phrases Yu Kong

Nesting habits of fmightless wh-phrases Patrick D. Elliott (MIT) November 25, 2019 Complex

Noun Phrases February 13, 2017 Next assignments Hundred noun phrases Hundred sentences

EAP Review Julie Marshall, Ph.D., CEAP Vice President Top opic ics EAP Benchmarks/

From Search to Discovery in our Future Library From Search to Discovery W e see a spectrum of

Watson Discovery Spring 2020 Discovery pipeline Using NLU, document conversion, and UI tools

Tunnel End-point Discovery Tunnel End-point Discovery draft-palet-v6ops-tun-auto-disc-03.txt

VPN Discovery VPN Discovery Design Team Discussions and Options Design Team Discussions and

Presentation I. Introductory remarks II. Useful phrases for presentation I. Introductory remarks

An Exploration of Embeddings for Generalized Phrases Wenpeng Yin &amp; Hinrich Schutze ...

Working with Faculty to Ensure Digital Accessibility Ana Palla-Kane Senior IT Accessibility and

New Topical Medications Sunita Radhakrishnan, M.D. Glaucoma Center of San Francisco, Glaucoma

Estimate of P TOTAL = B TOTAL P B jh 610, 1991 page 1 RECENT / LOCAL EXAMPLES OF

LiveWell Kids Nutrition Module 3 &amp; 4 Training 4th Grade LiveWell Kids Modules Fruits &amp;

Towards Automa-c Topical Classifica-on of LOD Datasets

CS 61A Topical Review Object Oriented Programming Albert Xu Slides: albertxu.xyz/teaching/cs61a/

The Structural Topic Model and Applied Social Science Molly Roberts, Brandon Stewart, Dustin

COMMISSION MEETING WITH THE ADVISORY COMMITTEE ON REACTOR SAFEGUARDS (ACRS) December 6, 2019

An Exploration of Embeddings for Generalized Phrases Wenpeng Yin & Hinrich Schutze ...

LiveWell Kids Nutrition Module 3 & 4 Training 4th Grade LiveWell Kids Modules Fruits &