Movie Review Classifications CONG OLAF CHEN Objective Genres - PowerPoint PPT Presentation

Movie Review Classifications CONG “OLAF” CHEN

Objective • Genres according to IMDb: animation, adventure, comedy, drama, family, fantasy. What a mouthful! • Excerpt from sample critic review: “The level of invention is so high, and the density of detail is so great, that it’s impossible to absorb everything in a single viewing.” –Joe Morgenstern, Wall Street Journal • Can we get back the movie’s genre(s) if the review snippet is all we have? How about the mystery reviewer’s mood?

Methods used Also considered: • PRAW (Reddit API) • Gensim’s LDA model • IMDbPy (dead end) • NLTK • Pinterest API • Scikit Learn

Dataset of genres • ~75,000 reviews to train with plus ~25,000 to test, each contains IMDb ID of corresponding movie • Distribution of genres in test set as follows, ignoring all genres present in fewer than 1000 reviews • On average, a movie corresponds to 7 reviews and a review corresponds to 2.5 genre labels

Libshorttext model • Popular and trusted package for NLP needs at LTH • 8 ways to preprocess data: unigram/bigram, with/without stopword removal, with/without Porter stemming • 4 classification mechanisms: standard/L1/L2 SVC, logistic regression • 4 ways to weigh features: binary, word count, term frequency, TFIDF • 6-7 minutes to train a corpus of 190,000 reviews

Classification Accuracy • Accuracy obtained by comparing lib-shorttext Bin. Word Term TFIDF prediction to IMDb listings (in Count Freq. this table, we use bigram features, stemming and .7164 .7071 .7074 .7190 SVM stopword filtering) L1SVM .7324 .7362 .7362 .7409 • Preprocessing and feature L2SVM .7811 .7791 .7791 .7843 selection do not make a major difference by themselves (+/- LogReg .7730 .7761 .7761 .7745 1%) but the classification mechanism can, L2SVM and LogReg are better • However, we have a problem…

Actual/Predicted Romance Horror Romance 257 70 Horror 5 2444 Confusion table snippets Actual/Predicted Comedy Sci-Fi Comedy 4086 79 Sci-Fi 399 631 Act./Pred. Drama Action Thriller Drama 8145 330 574 Action 1180 954 438 Thriller 2192 493 1008

The “Drama” label • 47% of all our test reviews come from a movie listed by IMDb as a “Drama”. We would expect a similar proportion in our training set. Not surprising given how all movies have to be at least a little dramatic, but this is too high a percentage given we have 27 different genre labels! • This is not good for our model! We output as many genre labels as IMDb lists for all our reviews, but the same review outputs the same label each time. • Frequency bias/cold start: As a result of this, 42.75% of all outputted labels are “Drama”—it’s like a security blanket for our model when it sees things it doesn’t recognize. Meanwhile, not a single instance had the predicted label “News”, “Talk Show”, “Game Show”, or “Adult”. • We must retrain! Remove all instances with the label “Drama”. This alone should not take any individual reviews or movies out of the training or testing sets.

Precision and Recall

Some observations • Every tidbit of extraneous info we cut out makes a difference! The vast majority of the points on each scatterplot lies above the blue line in both cases. • Comedy and horror have excellent recall rates, compared to precision. They are the 2 nd and 4 th most popular labels in the test set. For romance (5 th ), this rate somehow went up from 5.95% to 36.53%! Less common labels generally have a higher precision than recall rate, because our model is less likely to guess them—but when that actually happens, it knows. • The overall accuracy rate actually dropped from 77.59% to 65.47%, once our label could no longer simply output “Drama” knowing it had a decent chance of being correct. Keep in mind that “Drama” is not a very informative label!

Future exploration • Our model was trained with relatively high-end data. But the same is not necessarily true for data we crawl from clients’ social media accounts, such as Reddit. How does our model adapt to different “languages”? • The next most frequent genres on the “chopping block” are comedy and thriller. Do they provide as much convoluting information about a review as “drama” does? • If each review was given a multidimensional sentiment rating, each on a scale from 1 to 10, would a SVM or linear regression better classify a client’s current mood?

Acknowledgements • Pierre Nugues • For critical training data, but more importantly your invaluable real-life application insight: – Lars Hård – Axel Antonsson – Kateryna Wikström – Ola Lindberg Best of luck in Silicon Valley! • Andrew Maas, Stanford University (IMDb training data) • Chih Jen Lin, National Taiwan University (libshorttext) • Joe Morgenstern, Wall Street Journal (sample review)

Movie Review Classifications CONG OLAF CHEN Objective Genres - PowerPoint PPT Presentation

Movie Review Classifications CONG OLAF CHEN Objective Genres according to IMDb: animation, adventure, comedy, drama, family, fantasy. What a mouthful! Excerpt from sample critic review: The level of invention is so high, and

Research Professionals Job Classifications March 2015 Objectives for today Review goals and

Assignment: Movie Review on The Illusionist Feedback received from group members There were

Introductions Tell us about your favorite Sci Fi movie Sci Fi Movie Link

Exact Minimization of # of Joins Example (movie database) select m1.director movie m1, movie

} Exceptions ! { I/O = 1 if n 1 10-20 questions = n Variety of question

Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor

Sentiment analysis CS440 Positive or negative movie review? unbelievably disappointing

Sentiment Analysis What is Sentiment Analysis? Dan Jurafsky Positive or negative movie review?

Going beyond the algorithms Yehuda Koren Haifa movie #15868 Rese search movie #7614 movie

Movie Summarization and Movie Summarization and Skimming Demonstrator Skimming Demonstrator

NETFLIX Movie Recommendations Virgil Pavlu Shahzad Rajput Keshi Dai Movie ratings: 1 (bad) - 5

Motion pictures etcetera Turnover & output Robbert de Ruijter Content o Classifications

Using Windows Movie Maker to Edit or Compile Media for use with Presentations and Classroom

CS 237: Interactive Movie Streaming Service Wei Pan, panw4@uci.edu Yi Zhou, zhouy46@uci.edu

I ndustrial Rela tions Statistics: Basic concepts an nd definitions and Classifications

Preparing for Change: The DOLs Final Rule and Exempt Classifications Agenda A Quick Review

FlickOh : Personalized Movie Recommendation and Rating System What is FlickOh? Movie rating

Faceted classifications as linked data A logical analysis Claudio Gnoli with collaboration of

Description of data presentation from cadRCS with movie.exe The data that cadRCS creates can

CS 4518 Mobile and Ubiquitous Computing Lecture 20: Movie Rating Emmanuel Agu Your Reaction

MODERN PARABLES Part 5: The Sower 10.24.10 Movie of Readers: Matthew 13:3-9 ESV Movie: The Sower

Interim Report 2014 Our vision To be the best way to see a movie Contents Highlights

Canadian Movie Channels Investment Opportunity March 2012 MOVIES Executive Summary SPT

WFEO WFEO Younger Engineers/Future Leaders Younger Engineers/Future Leaders Movie Movie Who

Movie Review Classifications CONG OLAF CHEN Objective Genres - PowerPoint PPT Presentation

Movie Review Classifications CONG OLAF CHEN Objective Genres according to IMDb: animation, adventure, comedy, drama, family, fantasy. What a mouthful! Excerpt from sample critic review: The level of invention is so high, and

Research Professionals Job Classifications March 2015 Objectives for today Review goals and

Assignment: Movie Review on The Illusionist Feedback received from group members There were

Introductions Tell us about your favorite Sci Fi movie Sci Fi Movie Link

Exact Minimization of # of Joins Example (movie database) select m1.director movie m1, movie

} Exceptions ! { I/O = 1 if n 1 10-20 questions = n Variety of question

Movie &amp; Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor

Sentiment analysis CS440 Positive or negative movie review? unbelievably disappointing

Sentiment Analysis What is Sentiment Analysis? Dan Jurafsky Positive or negative movie review?

Going beyond the algorithms Yehuda Koren Haifa movie #15868 Rese search movie #7614 movie

Movie Summarization and Movie Summarization and Skimming Demonstrator Skimming Demonstrator

NETFLIX Movie Recommendations Virgil Pavlu Shahzad Rajput Keshi Dai Movie ratings: 1 (bad) - 5

Motion pictures etcetera Turnover &amp; output Robbert de Ruijter Content o Classifications

Using Windows Movie Maker to Edit or Compile Media for use with Presentations and Classroom

CS 237: Interactive Movie Streaming Service Wei Pan, panw4@uci.edu Yi Zhou, zhouy46@uci.edu

I ndustrial Rela tions Statistics: Basic concepts an nd definitions and Classifications

Preparing for Change: The DOLs Final Rule and Exempt Classifications Agenda A Quick Review

FlickOh : Personalized Movie Recommendation and Rating System What is FlickOh? Movie rating

Faceted classifications as linked data A logical analysis Claudio Gnoli with collaboration of

Description of data presentation from cadRCS with movie.exe The data that cadRCS creates can

CS 4518 Mobile and Ubiquitous Computing Lecture 20: Movie Rating Emmanuel Agu Your Reaction

MODERN PARABLES Part 5: The Sower 10.24.10 Movie of Readers: Matthew 13:3-9 ESV Movie: The Sower

Interim Report 2014 Our vision To be the best way to see a movie Contents Highlights

Canadian Movie Channels Investment Opportunity March 2012 MOVIES Executive Summary SPT

WFEO WFEO Younger Engineers/Future Leaders Younger Engineers/Future Leaders Movie Movie Who

Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor

Motion pictures etcetera Turnover & output Robbert de Ruijter Content o Classifications