Mining the Peanut Gallery Opinion Extraction and Semantic - PowerPoint PPT Presentation

Mining the Peanut Gallery Opinion Extraction and Semantic Classification of Product Reviews A paper by Kushal Dave, Steve Lawrence, David M. Pennock Presented by Ledao Chen and David Zhao 1

Problem ● Product reviews are everywhere! ● How can you possibly read them all? 2

Relevant background ● Objectivity classification ○ Separating reviews from other content ● Word classification ○ How similar are two words ● Sentiment classification ○ What emotion a word is associated with 3

Data ● CNET ○ 7 categories, all electronics ○ Review with binary good/bad ● Amazon ○ 7 categories, varied ○ Review with 5-star rating 4

Negative Positive Category 1 Evaluation Category 2 Category 3 Test 1 Cat. 4 Category 5 Train fold Category 6 Test fold Category 7 5

10x sets Negative Positive Category 1 Evaluation Category 2 Category 3 Test 2 Category 5 Cat. 4 Category 6 Category 7 6

Tokenization ● Strip HTML ● Tokenize document into sentences ● Tokenize sentences into words [ [“Peace” “cannot” “be” “kept” “by” “force” “;” “it” “can” “only” ...], [“Darkness” “cannot” “drive” “out” “darkness” “;” “only” “light”...], [“Hate” “cannot” “drive” “out” “hate” “;” “only” “love” “can” “do”...] ] 7

Metadata and statistical substitution ● Numerical tokens ‒ “I have 35 ” → “I have number ” ● Product names ‒ “I like Nikon ” & “I like Kodak ” → “I like productname ” ● Low-frequency terms ‒ “ Peach fuzz” and “ Pollen fuzz” → “ unique fuzz” ● Product-specific terms ‒ “ Lens is bad” and “ RAM is bad” → “ producttypeword is bad” 8

Linguistic substitution ● WordNet from tokens with part-of-speech tags ● Colocation of nouns and modifying adjectives ● Stemming of tokens ● Negation propagation ○ “not good or useful” → “not NOT good NOT or NOT useful” 9

N-gram and proximity ● For Test 1, trigrams performed best ● For Test 2, bigrams performed best ● Mixing n-grams with lower-order features ○ e.g. bigrams mixed with unigrams ● Smoothing using lower-order reference model ● Proximity features [Peace cannot be kept by force it can only be achieved by understanding ] Combined into “achieved-understanding” feature 10

Substrings 11

Substring Trade-off Substrings become longer their frequency decrease generally more discriminatory less evidence for considering them relevant 12

Thresholding 1. Count the frequency of features 2. Normalize (optional) 3. Thresholding The difference of different thresholds are not significantly different. 13

Smoothing ● Add-one Smoothing ● Witten-Bell Smoothing P= ● Good-Turing Smoothing P= 14

Scoring Baseline: The normalized term frequency, by taking the number of times a feature f i occurs in C and dividing it by the total number of tokens in C. A term’s score is thus a measure of bias ranging from –1 to 1. 15

Scoring 16

Scoring Alternatives: odds ratio Performs on par with SVM Sensitive to different class sizes, thus performs poorly on Test 1 17

Scoring Alternatives: Fisher Discriminant Performs poorly on both tests 18

Reweighting ● Multiplying by document frequency, dampened by logarithm, provided better result on Test 1 ● Gaussian weighting scheme on TF provided better result on Test 1 19

Classifying Basic idea: Sum the scores of the words in an unknown document and use the sign of the total to determine 20

Mining Basic idea: crawl search engine results for a given product’s name and attempt to identify and analyze product reviews within this set. Model by data from Discard some certain pages, paragraphs, sentences (such as pages without “review” in their title, paragraphs not containing the name of the product, and excessively long or short sentences) 21

Mining Evaluation Randomly selected 600 sentences (200 for each of 3 products) from search engine as parsed and thresholded by the mining tool. Manually tagged as positive (P) or negative (N) or ambiguous (I) Ambiguous means they were ambiguous when taken out of context, did not express an opinion at all, or were not describing the product. ------ Very Subjective P: 173 N:71 I: 356 22

Mining Evaluation Worse than tossing a coin 23

Summary and Conclusions ● Obtained fairly good results for the review classification task through the choice of appropriate features and metrics ● Identified a number of issues that make this problem difficult 24

The Issues ● Rating inconsistency ○ Different understanding on 1-5 stars ● Ambivalence and comparison ○ Some reviewers use terms that have negative connotations, but then write an equivocating final sentence explaining that overall they were satisfied. 25

The Issues ● Sparse data ○ Many of the reviews are very short ○ Amazon is OK, but most reviews from C|net are within 3 documents occurrence ● Skewed distribution ○ positive reviews were predominant ○ certain products and product types have more reviews ■ ‘Camera’ is positive ■ Negative reviews a longer, language are more varied. 26

Questions 27

Mining the Peanut Gallery Opinion Extraction and Semantic - PowerPoint PPT Presentation

Mining the Peanut Gallery Opinion Extraction and Semantic Classification of Product Reviews A paper by Kushal Dave, Steve Lawrence, David M. Pennock Presented by Ledao Chen and David Zhao 1 Problem Product reviews are everywhere! How

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Information and Updates Peanut Notes Peanut Information Series Field days County production

George Washington Carver inventor of many peanut uses By Ava.E George lived with his

Peanut Weed Control Update - 2017 Eric P. Prostko, Ph.D. Professor and Extension Weed Specialist

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Navigating Ones Way Through Peanut Innovation Lab Projects in Africa: Opportunities and

Peanut Weed Control for New Agents Eric P. Prostko, Ph.D. Professor and Extension Weed

New Peanut Cultivar Response to Paraquat Applications Eric P. Prostko, Ph.D. Professor and

Evaluation of A New TwinJet Nozzle for Weed Control in Peanut Eric P. Prostko 1 , Brock A. Ward 1

Peanut Weed Control Update - 2016 Eric P. Prostko, Ph.D. Professor and Extension Weed Specialist

MOKA Art Gallery Hill District, Pittsburgh PA April 17, 2018 2297 Centre Avenue MOKA ART

Preparation, presentation and conservation at the Geelong Gallery Introduction Conservation

Considerations 2 nd Sept 2019 Judith Mather Buying and Brand Licensing National Gallery Company

Varley Art Gallery of Markham Strategic Plan 2014-18 Niamh OLaoghaire Manager, Varley Art

SCHOOL PROGRAM FUNDING QUEST ART SCHOOL + GALLERY Quest Art School + Gallery is one of three

About the gallery Bubenberg is a gallery for contemporary art founded in Paris in 2017 by Robin

2017 Annual Shareholders Meeting Greg Cross, Chairman Chris Brennan, CEO Rod Garrett, CFO

ICMA European Repo Council Annual General Meeting Luxembourg, 22 January 2014 Welcome and opening

Allegion Third-Quarter 2015 Results October 29, 2015 Safe Harbor This presentation contains

The World Wide Web: Facing the Cyber Threat John Ansbach, CIPP/US General Counsel General

Activation and marketing of B2B customers flexibility in Portugal European Workshop

Benefits Overview Agenda Benefits Enrollment Eligibility and Documentation for Dependents

Drupal and Solr Saturday, August 30, 2008 1 Hello Im Alexandru Badiu Drupal and Solr -

BUSINESS METHOD PATENTS IN THE UNITED STATES: A JUDICIAL HISTORY & PROSECUTION PRACTICE by

Sambuz

Useful Links

Newsletter

Mail Us

Mining the Peanut Gallery Opinion Extraction and Semantic - PowerPoint PPT Presentation

Mining the Peanut Gallery Opinion Extraction and Semantic Classification of Product Reviews A paper by Kushal Dave, Steve Lawrence, David M. Pennock Presented by Ledao Chen and David Zhao 1 Problem Product reviews are everywhere! How

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Information and Updates Peanut Notes Peanut Information Series Field days County production

George Washington Carver inventor of many peanut uses By Ava.E George lived with his

Peanut Weed Control Update - 2017 Eric P. Prostko, Ph.D. Professor and Extension Weed Specialist

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Navigating Ones Way Through Peanut Innovation Lab Projects in Africa: Opportunities and

Peanut Weed Control for New Agents Eric P. Prostko, Ph.D. Professor and Extension Weed

New Peanut Cultivar Response to Paraquat Applications Eric P. Prostko, Ph.D. Professor and

Evaluation of A New TwinJet Nozzle for Weed Control in Peanut Eric P. Prostko 1 , Brock A. Ward 1

Peanut Weed Control Update - 2016 Eric P. Prostko, Ph.D. Professor and Extension Weed Specialist

MOKA Art Gallery Hill District, Pittsburgh PA April 17, 2018 2297 Centre Avenue MOKA ART

Preparation, presentation and conservation at the Geelong Gallery Introduction Conservation

Considerations 2 nd Sept 2019 Judith Mather Buying and Brand Licensing National Gallery Company

Varley Art Gallery of Markham Strategic Plan 2014-18 Niamh OLaoghaire Manager, Varley Art

SCHOOL PROGRAM FUNDING QUEST ART SCHOOL + GALLERY Quest Art School + Gallery is one of three

About the gallery Bubenberg is a gallery for contemporary art founded in Paris in 2017 by Robin

2017 Annual Shareholders Meeting Greg Cross, Chairman Chris Brennan, CEO Rod Garrett, CFO

ICMA European Repo Council Annual General Meeting Luxembourg, 22 January 2014 Welcome and opening

Allegion Third-Quarter 2015 Results October 29, 2015 Safe Harbor This presentation contains

The World Wide Web: Facing the Cyber Threat John Ansbach, CIPP/US General Counsel General

Activation and marketing of B2B customers flexibility in Portugal European Workshop

Benefits Overview Agenda Benefits Enrollment Eligibility and Documentation for Dependents

Drupal and Solr Saturday, August 30, 2008 1 Hello Im Alexandru Badiu Drupal and Solr -

BUSINESS METHOD PATENTS IN THE UNITED STATES: A JUDICIAL HISTORY &amp; PROSECUTION PRACTICE by

Sambuz

Useful Links

Newsletter

Mail Us

BUSINESS METHOD PATENTS IN THE UNITED STATES: A JUDICIAL HISTORY & PROSECUTION PRACTICE by