CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 3: SENTIMENT ANALYSIS Spring 2019 Marion Neumann
SENTIMENT ANALYSIS …discover people’s opinions , emotions , feelings about a subject , topic , product , or service from text Step 3: Step 1: Step 2: Infer sentiment Get the data Process text into features 2
SENTIMENT ANALYSIS Recap: Data Science Workflow scientific, collect & clean & use data social, or data understand format to create business problem data solution data problem ? improve movie sentiment scrape working with rule-based predictor • recommender analysis web/twitter text data machine learning • or classifier gauging brand perception 3
SENTIMENT ANALYSIS WORKFLOW à rule-based prediction à machine learning classifier bad & Negation Handling Feature Extraction excluded bad ping pong excluded rio 2016 Stemming 4 bad ping pong exclude rio 2016
RULE-BASED APPROACH à Lab 3 DSFS p25 5 Control Flow
TEXT DATA • Data representation à strings • four kinds of string data 1) categorical data 2) free strings (that can be semantically mapped to categories) 3) structured string data 4) free-form text data à What makes text different ? 6
TEXT DATA …is Big Data! 7
MACHINE LEARNING APPROACH • Classification 8
FEATURES FOR TEXT DATA • bag of words à does word occur in document yes / no à binary feature location great Same great flavor and friendly service as in the S 18th street friends location. This location is not as small but it's hard to talk to friends. small Thankfully there is great outdoor seating to escape the noise. … • word counts à how often does word occur? à count feature • more advanced: n-grams, TF-IDF 9
FEATURE REPRESENTATION • bag-of-words and word counts are vectors of review features or binary review counts III resin great D f's'EukeJvocasueary o 1 horrible 170.000 D i words Tpositive easel KEITEL dictionary D extremely sparse features many zeros since most word do not appear in review PDSH p38 10 Arrays
WHAT IS A CLASSIFIER? • Rule-based • list of positive and negative words results in fixed score (+1, -1, or 0) for each word • Classifier • no fixed lists of positive/negative words • each word gets a weight parameter ! assigned w ( x is referred to as • classifier = parameterized model of the dot product, • inner product, or • relationship between input and output/label scalar product • • e.g. label = sign(w ( x + +) using a linear relationship • classifier learns the weights from labeled training data 11
CLASSIFIER • output ( sentiment ) is a binary class Is this new review positive or negative? or 12
EVALUATION • Which approach (rule-based or machine learning) performs better? à How can we measure this? • Measures: • error rate (or misclassification rate) = # #$%%&'(%%$)$*+ ,*%, -.$/,% # ,*%, -.$/,% • average accuracy ( = 1 − 23343 3562 ) 13
SUMMARY & READING • Sentiment Analysis automatically identifies , extracts , and analyzes emotions in text data. • Text data needs to be preprocessed to get features that can be used for prediction and learning. • Linear classification is used to predict binary or categorical targets . Do not use the implementations PDSH • DSFS introduced in this p38 chapter à use NumPy Arrays! • Ch4: Linear Algebra à Vectors (p49-53) • Ch9: Getting Data (p105-108, p114-120) • Ch20: Natural Language Processing (p239-244) 14
Recommend
More recommend