���������������������������� Opinion Mining in GATE Opinion Mining in GATE Horacio Saggion & Adam Funk
���������������������������� �������������� �������������� �������������� �������������� Is interested in the opinion a particular piece of discourse expresses • – Opinions are subjective statements reflecting people’s sentiments or perceptions on entities or events There are various problems associated to opinion mining • – Identify if a piece of text is opinionated or not (factual news vs. – Identify if a piece of text is opinionated or not (factual news vs. Editorial) – Identify the entity expressing the opinion – Identify the polarity and degree of the opinion (in favour vs. against) – Identify the theme of the opinion (opinion about what?)
���������������������������� Extract Factual Data with Information Extraction from Company Web Site Extract Opinions using Opinion Mining from Web Fora
���������������������������� ����������� ����������� ����������� ����������� Combine information extraction from company Web site with OM • findings – Given a review find company web pages and extract factual information from it including products and services – Associate the opinion to the found information Use information extraction to identify positive/negative phrases and • the “object” of the opinion – Positive: correctly packed bulb , a totally free service , a very efficient management … – Negative: the same disappointing experience , unscrupulous double glazing sales , do not buy a sofa from DFS Poole or DFS anywhere , the utter inefficiency …
���������������������������� ������������������� sentiment ������������������� ������������������� ������������������� opinion
���������������������������� positive opinions negative opinions negative opinion, but less evident
���������������������������� ������������������������� ������������������������� ������������������������� ������������������������� Because we have access to documents which have already an associated class, we • see OM as a classification problem – we consider our data “opinionated” We are interested in: • differentiate between positive opinion vs negative opinion – • “customer service is diabolical” • “I have always been impressed with this company” • “I have always been impressed with this company” recognising fine grained evaluative texts (1-star to 5-star classification) – • “one of the easiest companies to order with” (5-stars) • “STAY AWAY FROM THIS SUPPLIER!!!” (1-star) We use a supervised learning approach (Support Vector Machines) that uses • linguistic features; the system decides which features are most valuable for classification We use precision, recall, and F-score to assess classification accuracy •
���������������������������� ������ ������ ������ ������ We have a customisable crawling process to collect all texts from Web fora • 92 texts from a Web Consumer forum • – Each text contains a review about a particular company/service/product and a thumbs up/down – texts are short (one/two paragraphs) – 67% negative and 33% positive 600 texts from another Web forum containing reviews on companies or 600 texts from another Web forum containing reviews on companies or • • products – Each text is short and it is associated with a 1 to 5 stars review – * ~ 8%; ** ~ 2; *** ~ 3%; **** ~ 20%; ***** ~ 67% Each document is analysed to separate the commentary/review from the • rest of the document and associate a class to each review After this, the documents are processed with GATE processing resources: • – tokenisation; sentence identification; parts of speech tagging; morphological analysis; named entity recognition, and sentence parsing
���������������������������� ����������� ����������� ����������� ����������� Support Vector Machines (SVM) are very good algorithms used for • classification and have been also used in information extraction Learning in SVM is treated as a binary classification problem and a • multiclass problem is transformed in a set of n binary classification problems Given a set of training examples, each is represented as a vector in a space • of features and SVM tries to find an hyper plane which separates positive of features and SVM tries to find an hyper plane which separates positive from negative instances Given a new instance SVM will identify in which side of the hyper plane the • new instance lies and produce the classification accordingly The distance from the hyper plane to the positive and negative instances is • the margin and we use SVM with uneven margins available in GATE In order to use them, we need to specify how instances are represented and • decide on a number of parameters usually adjusted experimentally over training data
���������������������������� ��� ��� ��� ���� � ��� � ��� �� �� � � ������������� ������������ ������������ ������������� � � � �������������� �������������� �������������� �������������� We decided to start investigating a very simple approach – word-based or • bag of words approach (usually works very well in text classification) – the original word – the root or lemma of the word (for “running” we use “run”) – the parts of speech category of the word (determinant, noun, verb, etc.) – the orthography of the word (all uppercase, lowercase, etc.) – the orthography of the word (all uppercase, lowercase, etc.) Each sentence/text is represented as a vector of features and values • – we carried out different combinations of features (different n-grams) – 10-fold cross validation experiments were run over the corpus with binary classifications (up/down) – the combination of root and orthography (unigram) provides the best classifier • around 80% F-score – use of higher n-grams decreases performance of the classifier – use of more features not necessarily improves performance – a uninformed classifier would have a 67% accuracy
���������������������������� ���� ��� ��� ��� ����������� ����������� ��������� �������� ��� ��� � � �� �� � � ���������� ���������� � � �������� �������� �������������� �������������� �������������� �������������� • Same learning system used to produce the 5 stars classification over the fine-grained dataset • Same feature combinations were studied: – 74% overall classification accuracy using word root only only – other combinations degrade performance – 1* classification accuracy = 80%; 5* classification accuracy = 75% – 2* = 2%; 3*=3%; 4*=19% – 2*, 3*, 4* difficult to classify because or either share vocabulary with extreme cases or are vague
���������������������������� ���� ���������������������������������� ���� ���������������������������������� ���� ���������������������������������� ���� ���������������������������������� !����� !����� !����� !����� • word-based binary classification – thumbs-down: !, not, that, will, … – thumbs-up: excellent, good, www, com, site, … • word-based fine-grained classification • word-based fine-grained classification – 1*: worst, not, cancelled, avoid,… – 2*: shirt, ball, waited,…. – 3*: another, didn’t, improve, fine, wrong, … – 4*: ok, test, wasn’t, but, however,… – 5*: very, excellent, future, experience, always, great,…
Recommend
More recommend