term project
play

TERM PROJECT Classifying Tweets Using Nave Bayes Classifier CSC 177 - PowerPoint PPT Presentation

TERM PROJECT Classifying Tweets Using Nave Bayes Classifier CSC 177 Spring 2020 Andrew Flores, Hera Flores Agenda Demo Motivation Background Knowledge Lessons Learned Scope of the Project Future work Approach


  1. TERM PROJECT Classifying Tweets Using Naïve Bayes Classifier CSC 177 – Spring 2020 Andrew Flores, Hera Flores

  2. Agenda  Demo  Motivation  Background Knowledge  Lessons Learned  Scope of the Project  Future work  Approach  References  Acknowledgment  Implementation Details

  3. Extensive amounts of data  IBM has estimated that 80% of the worlds data is unstructured. [8]  Everyday roughly 2.5 billion GB of new data are created.  In particular, Twitter creates roughly 12 TB of data every day. [9]

  4. Knowledge is power (1)  All of this unstructured data is a goldmine of knowledge.

  5. Knowledge is power (2)  The data is out there. We just have to mine it.

  6. Background Knowledge Sentiment analysis [6]

  7. Why sentiment analysis matters  Many companies apply sentiment analysis techniques to social media in order to gain an understanding of  Service performance  Investor sentiment  Public opinion  The following are examples from SentDex, an online algorithm that indexes financial, political, and geographical sentiment. [10]

  8. EBAY [10]

  9. Charles Schwab Corp [10]

  10. Public Enterprise Group Inc [10]

  11. Scope of the Project  We downloaded a dataset from Kaggle that consisted of roughly 14000 tweets.  It was our goal to analyze, transform, and classify this data.  We wanted to run multiple trials adjusting parameters to achieve optimal performance.

  12. Naïve Bayes Classifier  Naïve Bayes Classifier is a supervised machine learning algorithm based on statistical methods created from the mathematician Thomas Bayes

  13. Naïve Bayes properties  Principle of Naive Bayes Classifier  Probabilistic machine learning model  Based on the Bayes theorem [4]

  14. The problem  It’s difficult to analyze sentiment via traditional surveys that are inefficient in terms of time and effort. These traditional methods can also be erroneous at times.  Airline companies aren’t able to sift through the thousands of social media posts in any given time that might store valuable information regarding their service performance or critical concerns.

  15. The solution  We’ll build a sentiment text classifier that puts airline related tweet texts into one of two categories - negative or positive sentiment. [1]

  16. Methods of Implementation (1)  Data from Kaggle  Anaconda to implement Python code  Jupyter Notebook for EDA  Naïve Bayes classifier in Jupyter NB and Spyder

  17. Methods of Implementation (2)

  18. Let’s visualize some data!

  19. Tweet Wordcloud  Consists of all tweets from the our tweet.csv file

  20. Positive Tweet Wordcloud  Consists of the most positively associated words from the entire data set.

  21. Negative Tweet Wordcloud  Consists of the most negatively associated words from the entire data set.

  22. Overall Sentiment Distribution Sentiment 2363 3099 9178 Positive Negative Neutral

  23. Distribution of Airlines Airlines 504 2759 3822 2913 2420 2222 Virgin America United Southwest Delta US Airways American

  24. Individual Airline Sentiment Distribution

  25. Classification Report (1) size accuracy precision recall f1-score support Time(sec) Input 1 avg 2000 0.88 0.88 0.88 0.87 400 12.76 Input 2 avg 3000 0.91 0.91 0.91 0.91 600 15.40 Input 3 avg 3600 0.92 0.92 0.92 0.92 720 24.55 Input 4 avg 4726 0.92 0.93 0.92 0.92 946 30.53

  26. Classification Report (2) Classification chart 0.94 0.93 0.93 0.92 0.92 0.92 0.92 0.92 0.92 0.91 0.91 0.91 0.91 0.9 0.89 0.88 0.88 0.88 0.87 0.87 0.86 0.85 0.84 Input1 1 Input 2 Input 3 Input 4 accuracy/ recall precision f1-score

  27. Classification Report (3) Size Time (sec) 1 2000 12.76 2 3000 15.40 3 3600 24.55 4 4726 30.53

  28. Confusion Matrices

  29. DEMO TIME!!!!!

  30. Lessons Learned  How to implement text mining techniques using Python and its associated libraries.  Performing multiple trials with different sizes yield unique results.  Speed decreased with increased sizes.  As size increased, accuracy increased.

  31. Future Work  In the future we plan to:  Incorporate usage of neutral data.  Compare its efficiency against other supervised ML algorithms.  Implement web scraping to find new twitter data set

  32. References [1] V. Valkov, "Movie review sentiment." Curiousily, www.curiousily.com/posts/movie-review-sentiment-analysis-with-naive-bayes/. Accessed 18 Apr.  2020. [2] K. DeGrave. " A Naive Bayes Tweet Classifier." kaggle, www.kaggle.com/degravek/a-naive-bayes-tweet-classifier.Accessed 19 Apr. 2020.  [3] N.K. Sharma, S. Rahamatkar, S. Sharma "Classification of Airline Tweet using Naïve-Bayes classifier for Sentiment Analysis." 9 IEEXplore. Accessed 5  Apr. 2020. [4] R. Gandhi. "Naive Bayes Classifier." TowardsDataScience, towardsdatascience.com/naive-bayes-classifier-81d512f50a7c. Accessed 27 Apr. 2020.  [5] C. Masolo. "Sentiment analysis on US Twitter Airline dataset." TowardsDataScience, towardsdatascience.com/sentiment-analysis-on-us-twitter-airline-  dataset-1-of-2-2417f204b971. Accessed 5 Apr. 2020. [6] MonkeyLearn "Sentiment Analysis.", monkeylearn.com/sentiment-analysis/. Accessed 5 Apr. 2020.  [7] C. Schneider. "The biggest data challenges that you might not even know you have." ibm, www.ibm.com/blogs/watson/2016/05/biggest-data-  challenges-might-not-even-know/. Accessed 22 Apr. 2020. [8] P. Upadhyay. "Removing stop words with NLTK in Python." GeeksforGeeks, www.geeksforgeeks.org/removing-stop-words-nltk-python/. Accessed 15  Apr. 2020. [9] D. Gura. "All Those 140-Character Twitter Messages Amount To Petabytes Of Data Every Year." npr, www.npr.org/sections/thetwo-  way/2010/09/28/130199229/all-those-140-character-twitter-messages-yield-four-petrabytes-of-data-annually. Accessed 15 Apr. 2020. [10] H. Kinsley. "Sentiment analysis accuracy." Sentdex, sentdex.com/how-accurate-is-sentiment-analysis-for-stocks/. Accessed 15 Apr. 2020.  [11] kaggel. "Twitter US Airline Sentiment." , www.kaggle.com/crowdflower/twitter-airline-sentiment. Accessed 15 Mar. 2020.  [12] Tsuruoka, Yoshimasa, et al. "Highly Scalable Text Mining – Parallel Tagging Application ." IEEE Xplore, Oxford Journals, 2004. 

  33. Acknowledgments (1) Thank you to Thomas Bayes for laying down the theoretical groundwork for future engineers, scientists, and mathematicians.

  34. Acknowledgments (2) Thank you to Dr. Lu for providing us with a great learning environment and teaching us plenty about Data Warehousing/ Data Mining.

Recommend


More recommend