Mining Sentiment Mining Sentiment Classification from Classification from Political Web Logs Political Web Logs Kathleen Durant Kathleen Durant WebKDD ‘ ‘06 06 WebKDD August 20, 2006 August 20, 2006 WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA Philadelphia, PA, USA
Explosion of News and Opinions on Explosion of News and Opinions on the Web the Web � Substantial growth of people Substantial growth of people � accessing the Internet for news accessing the Internet for news – 3% in 1995, 20% in 2004 3% in 1995, 20% in 2004 – � Growth of web logs on the Web Growth of web logs on the Web � – 100,000 in 2002 to 4.8 million in 2004 – 100,000 in 2002 to 4.8 million in 2004 � Growth in people reading Web logs Growth in people reading Web logs � – 2004 saw a 58% increase in readers of 2004 saw a 58% increase in readers of – web logs web logs WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA Philadelphia, PA, USA
Sentiment Topic View of the Sentiment Topic View of the Blog Space Blog Space � Web logs provide readily available Web logs provide readily available � opinions on a myriad of topics opinions on a myriad of topics � Sentiment classification separates Sentiment classification separates � opinions into two opposing camps opinions into two opposing camps � Take advantage of opinions and tools Take advantage of opinions and tools � to build a custom view of blog space to build a custom view of blog space by topic and opinion by topic and opinion WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA Philadelphia, PA, USA
Questions Investigated Questions Investigated � Can existing Machine learning Can existing Machine learning � techniques be successfully applied? techniques be successfully applied? � Which techniques work well? Which techniques work well? � – Na Naï ïve Bayes, Support Vector Machines ve Bayes, Support Vector Machines – � What What ’ ’s the effect of unbalanced class s the effect of unbalanced class � compositions on results? compositions on results? – Different camps write at different rates Different camps write at different rates – on particular topics on particular topics WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA Philadelphia, PA, USA
Research Statement Research Statement � Apply sentiment classification to political Apply sentiment classification to political � web log posts web log posts – Topic specific corpus Topic specific corpus – � George W. Bush and the Iraq War George W. Bush and the Iraq War � – Domain Specific – Domain Specific � Political Web log Posts Political Web log Posts � � Judge Judge – – Joe Gandelman Joe Gandelman � – – classified over 250 web logs classified over 250 web logs � Classify Web log posts according to our Classify Web log posts according to our � judge’ judge ’s sentiment class s sentiment class – Right Right - - voice voice – – Left – Left - - voice voice WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA Philadelphia, PA, USA
Segmentation of Data Segmentation of Data � Data segmented Data segmented Mar 03 Apr 03 May 03 Jun 03 � Mar 03 Apr 03 May 03 Jun 03 by the Month by the Month – Leading to 25 – Leading to 25 Mar 03 Apr 03 May 03 Jun 03 Jul 03 Aug 03 Sep 03 Oct 03 different models different models – Small enough to – Small enough to Mar 03 Apr 03 May 03 Jun 03 limit the events limit the events Nov 03 Dec 03 Jan 04 Feb 04 discussed discussed – Large enough to Large enough to – Mar 03 Apr 03 May 03 Jun 03 Mar 04 Apr 04 May 04 Jun 04 generate enough generate enough posts on topic posts on topic Mar 03 Apr 03 May 03 Jun 03 Jul 04 Aug 04 Sep 04 Oct 04 Mar 03 Apr 03 WebKDD 2006 Workshop on WebKDD 2006 Workshop on Nov 04 Dec 04 Knowledge Discovery on the Web, Nov 04 Jan 05 Knowledge Discovery on the Web, May 03 Jun 03 Aug. 20, 2006, at KDD 2006, Aug. 20, 2006, at KDD 2006, Feb 05 Mar 05 Philadelphia, PA, USA Philadelphia, PA, USA
Dataset Representation via the Dataset Representation via the Vector Space Model Vector Space Model � Feature set Feature set – – terms occurring at least 5 terms occurring at least 5 � times within the Month’ times within the Month ’s corpus s corpus – Unigrams with polarity of environment Unigrams with polarity of environment – � Differentiate between Differentiate between “ “ not support not support ” ” , , “ “ support support ” ” � – Bag Bag- - of of- - words framework words framework – � Order not important, Order not important, “ “ Bush is Bush is” ” = = “ “ Is Bush Is Bush” ” � – Presence Vectors – Presence Vectors � Given n features the post is represented as a n Given n features the post is represented as a n- - � dimensional vector dimensional vector – 0 feature not present in post 0 feature not present in post – – 1 feature is present 1 feature is present – – Example: { 0,1,1,1,0} 5 features feature 1 and feature Example: { 0,1,1,1,0} 5 features feature 1 and feature – 5 are not present, features 2,3,4 are. 5 are not present, features 2,3,4 are. WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA Philadelphia, PA, USA
Naï ïve Bayes Classification ve Bayes Classification Na Choose the category with the Maximum Posterior Probability Prior for the red class Prior for the blue class Calculate the product of the probabilities for each term in a post Likelihood term i appears = total number of occurrences of term in class/ in Class total number of words in red category Posterior WebKDD 2006 Workshop on WebKDD 2006 Workshop on Probability = Knowledge Discovery on the Web, Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Aug. 20, 2006, at KDD 2006, Prior * Likelihood Philadelphia, PA, USA Philadelphia, PA, USA
Support Vector Machines Support Vector Machines Wx+ b = -1 Wx+ b = 1 Wx + b < 0 Wx + b > 0 W x+ b = 0 Margin = 2/ | W| WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA Philadelphia, PA, USA
Web logs to Classifiers Web logs to Classifiers SVM SVM Naïve Bayes Balanced Balanced Inflated On Topic NB Unbalanced Unbalanced Small Off Topic WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA Philadelphia, PA, USA
Comparing Machine Learning Comparing Machine Learning Techniques Techniques 100 � Off Off- - the the- - shelf shelf � 90 Machine Learning Machine Learning 80 Predictability Techniques perform Techniques perform well well 70 � Na Naï ïve Bayes ve Bayes � 60 significantly significantly 50 outperforms outperforms Support Vector 40 Support Vector Machines Machines 30 – 99.9% confidence 99.9% confidence – 3 7 1 3 7 1 3 level, CI level, CI 0 0 1 0 0 1 0 - - - - - - - 3 3 3 4 4 4 5 [ 1.425,3.489] [ 1.425,3.489] 0 0 0 0 0 0 0 WebKDD 2006 Workshop on WebKDD 2006 Workshop on 0 0 0 0 0 0 0 Knowledge Discovery on the Web, 2 2 2 2 Knowledge Discovery on the Web, 2 2 2 Aug. 20, 2006, at KDD 2006, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA Philadelphia, PA, USA
Class Composition found on the Class Composition found on the Web Web � Imbalance in Imbalance in � 100 90 the class ratio the class ratio 80 70 Percentage – 14% of right 14% of right - - – 60 50 voice posts on voice posts on 40 30 topic topic 20 10 – 24% of left 24% of left - - – 0 voice posts on voice posts on 3 5 7 9 1 1 3 5 7 9 1 1 3 0 0 0 0 1 0 0 0 0 0 1 0 0 - - - - - - - - - - - - - 3 3 3 3 3 4 4 4 4 4 4 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 topic topic 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 Right voice Left Voice WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA Philadelphia, PA, USA
Recommend
More recommend