10/16/2015 (Simple) Text Analysis using Twitter Data IR Lab., CS - UI Bayu Distiawan Trisedya Slide Acknowledgement: Alfan Farizki W Information Retrieval Lab. Faculty of Computer Science University of Indonesia 1 Big Data Workshop – Bank Indonesia, Surabaya, Indonesia
We will do fun programming task 10/16/2015 IR Lab., CS - UI If you’re a not a programmer, don’t worry ! 2
10/16/2015 Crawling Tweets & Simple IR Lab., CS - UI Processing 3
1. Getting Twitter API • Create a twitter account if you do not already have one. • Go to https://apps.twitter.com/ and log in with your twitter 10/16/2015 credentials. • Click "Create New App“ IR Lab., CS - UI • Fill out the form, agree to the terms, and click "Create your Twitter application“ • In the next page, click on "API keys" tab, and copy your "API key" and "API secret". • Scroll down and click "Create my access token", and copy your "Access token" and "Access token secret". 4
2. Install Python (and libraries) • Install Anaconda • Install Tweepy 10/16/2015 OR • Install Python IR Lab., CS - UI • Lib: Tweepy, pandas, matplotlib, numpy • Set environment variables 5
3. Creating crawler • Open tweepy installation folder, find streaming example • \installation_dir\tweepy\examples\streaming.py 10/16/2015 IR Lab., CS - UI 6
3. Creating crawler (2) • Copy your key & token from Twitter API to the code 10/16/2015 # Go to http://apps.twitter.com and create an app. # The consumer key and secret will be generated for you after consumer_key =“” IR Lab., CS - UI consumer_secret =“” # After the step above, you will be redirected to your app's page. # Create an access token under the the "Your access token" section access_token =“” access_token_secret =“” 7
3. Creating crawler (3) class StdOutListener(StreamListener): 10/16/2015 def on_data(self, data): print(data) return True IR Lab., CS - UI def on_error(self, status): print(status) • This part is a listener to print received tweet to standard output for example command line 8
3. Creating crawler (4) stream.filter(track=['basketball']) 10/16/2015 • This part is where we put the keyword of tweets that suits our IR Lab., CS - UI interest. From the example, we will received tweets that contains “basketball” • You can try to change with multiple keywords like • stream.filter (track=[‘ jokowi ‘, ‘ prabowo ’]) • We don’t cover the technique to enhance the keywords to make the search result better. 9
3. Creating crawler (5) • Run the crawler • open your command line or terminal 10/16/2015 • change the active directory to the place of the crawler file • command: IR Lab., CS - UI • python crawler.py • see what happended • change the command to: • python crawler.py > output.json 10
4. Preparing corpus (1) • Filter the tweet stream, pick the attribute we want to analyze. • In this example we only want to analyze the text from the 10/16/2015 tweets IR Lab., CS - UI 11
4. Preparing corpus (2) import json fo = open (‘ file_path\output.json ’, 'r') 10/16/2015 fw = open(‘ file_path\corpus.txt', 'a') IR Lab., CS - UI • Create new python file (transform2.py) • Impor json, we want to read json format • fo -> read the file where the crawler produced • fw -> create new file 12
4. Preparing corpus (3) for line in fo: try: tweet = json.loads(line) 10/16/2015 fw.write(tweet['text']+"\n") except: continue IR Lab., CS - UI • read all line in fo • write the tweet text to fw 13
5. Simple Analysis (1) import json import pandas as pd 10/16/2015 import matplotlib.pyplot as plt import re IR Lab., CS - UI • Count how many tweets contain “ jokowi ” and “ prabowo ” • Create new python file (simple_analysis.py) • Import the library needed 14
5. Simple Analysis (2) def word_in_text(word, text): word = word.lower() text = text.lower() 10/16/2015 match = re.search(word, text) if match: return True IR Lab., CS - UI return False • Create a function to check wheter the word contained on a text or not 15
5. Simple Analysis (3) corpus_path = 'corpus.txt' tweets = [] 10/16/2015 corpus_file = open(corpus_path, "r") for line in corpus_file: tweets.append(line) IR Lab., CS - UI print "Tweets count: " + str(len(tweets)) • Read the corpus file line by line and store each line into an array • Count how many tweets on it 16
5. Simple Analysis (4) tweets_frame = pd.DataFrame() tweets_frame['jokowi'] = map(lambda tweet: word_in_text('jokowi', tweet), tweets) 10/16/2015 tweets_frame['prabowo'] = map(lambda tweet: word_in_text('prabowo', tweet), tweets) print tweets_frame['jokowi'].value_counts()[True] print tweets_frame['prabowo'].value_counts()[True] IR Lab., CS - UI • Calculate how many tweets contains “ jokowi ” and “ prabowo ” • Print the result 17
5. Simple Analysis (5) candidates = ['jokowi', 'prabowo'] tweets_candidates = [tweets_frame['jokowi'].value_counts()[True], tweets_frame['prabowo'].value_counts()[True]] 10/16/2015 x_pos = list(range(len(candidates))) width = 0.8 fig, ax = plt.subplots() plt.bar(x_pos, tweets_candidates, width, alpha=1, color='g') IR Lab., CS - UI # Setting axis labels and ticks ax.set_ylabel('Number of tweets', fontsize=15) ax.set_title('Jokowi vs. Prabowo', fontsize=10, fontweight='bold') ax.set_xticks([p + 0.4 * width for p in x_pos]) ax.set_xticklabels(candidates) plt.grid() plt.show() • Show the result on bar chart 18
10/16/2015 Simple Political Sentiment IR Lab., CS - UI Analysis on Tweets 19
The Data • We collected tweets on June 9 th 2014, when the first 10/16/2015 presidential election debate was held. IR Lab., CS - UI • The data consists of around 2,456,465 tweets, crawled for around 24 hours. • You can find this data on “debatcapres_ 2014_sesi1. txt” . Please do not open directly using your text editor ! 20
The Data If you want to peek your data, you can use the following Python code: 10/16/2015 dataFile = open ("debatcapres_2014_sesi1.txt", "r") IR Lab., CS - UI lines = 10 for i in range (lines): print (dataFile. readline ()) dataFile. close () 21
Our Task Sentiment Analysis Positive tweets 10/16/2015 Negative tweets “ Jokowi ” IR Lab., CS - UI Positive tweets Preprocessed Raw Tweets Tweets Negative tweets “ Prabowo ” 22
Steps • Preprocessing our corpus • Splitting our corpus 10/16/2015 • “ jokowi ” • “ prabowo ” IR Lab., CS - UI • Simple Sentiment Analysis • N-gram Frequency Analysis • Top 100 unigrams • Top 100 bigrams 23
Preprocessing Before we analyze the text, we need to clean and normalize our raw data. 10/16/2015 We usually do the following steps (but not limited to): IR Lab., CS - UI Normalization Stop Word Removal Stemming 24
Preprocessing Normalization We transform our text data into standard form. 10/16/2015 yawdh gw yg akan pergi ksn IR Lab., CS - UI Ya sudah, saya yang akan pergi ke sana 25
Preprocessing aje saja ajh saja Normalization ak aku We leverage a special dictionary. 10/16/2015 alesan alasan ancur hancur ane saya IR Lab., CS - UI Resource: anget hangat singkatan.dic ank anak apah apa Code: apo apa normalizer.py aq aku asek asik ati2 hati-hati 26 atit sakit
Preprocessing Normalization 10/16/2015 Let’s try from command prompt ! IR Lab., CS - UI > python >>> from normalizer import Normalizer >>> norm = Normalizer() >>> norm.normalize (“ yawdh gw yg akan pergi ksn ”) ‘ ya sudah saya yang akan pergi ke sana ’ 27
Preprocessing Stop Word Removal Stop Word: most common words; they usually have little value. 10/16/2015 IR Lab., CS - UI 28 http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html
Preprocessing Resource: Stop Word Removal twitter_stp.dic 10/16/2015 Code: stpremoval.py IR Lab., CS - UI > python >>> from stpremoval import StpRemoval >>> st = StpRemoval() >>> norm.removeStp (“ budi dan rani pergi ke bandung ”) ‘ budi rani pergi bandung ’ 29
Preprocessing Let’s combine Normalization and Stop Word Removal to our whole corpus. 10/16/2015 IR Lab., CS - UI Preprocessed Raw Tweets Tweets You just need to run preprocesscorp.py ! 30
Splitting Corpus Simple Approach: Suppose we want to find tweet that 10/16/2015 mentions “ prabowo ” . “ jokowi ” IR Lab., CS - UI Idea: for each tweet in the corpus: if tweet contains “ prabowo ” then print tweet Preprocessed Tweets Use select.py for “ prabowo ” and “ jokowi ” ! “ prabowo ” 31
Sentiment Analysis One of the tasks: Polarity Classification 10/16/2015 Positive Tweets @IMKristenBell I absolutely love the samsung commercials IR Lab., CS - UI with you and Dax xD so cute and funny. ♡♡ Hope you have nice week :) I love my iPhone 6! :D The case on it makes it look so nice ! Negative Tweets # samsung chargers are just as bad as #apple ones. holy hell. > :( snapchat looks so bad on the new iphone :-( 32
Recommend
More recommend