Collecting and Analyzing Twitter Data Best Practices Ramon - PDF document

6/11/2020 Collecting and Analyzing Twitter Data Best Practices Ramon Villa-Cox rvillaco@andrew.cmu.edu The CASOS Center School of Computer Science, Carnegie Mellon Summer Institute 2020 June 2020, CASOS Summer Institute 2020 1 Collecting Data on the Web in General • What platform should I use? • Should I collect everything? • How much should I pay? • Is my collection method ethical? • Can I share this data? • Real-time vs. Historical • API vs. Scraping June 2020, CASOS Summer Institute 2020 2 1

6/11/2020 Why Twitter? • One popular social website---more users, more data • Various ways to collect data---depends on your research purpose. • Easy to collect, though there are certain limitations to share the data. June 2020, CASOS Summer Institute 2020 3 Ways to Collect Twitter Data • Following users • Following keywords Streaming API Yes • Following locations(geo-bounding boxes) • Real-time? • Sampling tweets without filters • Get follower ids No Search by users • Get followee ids (certain rate limits) • Get user timeline June 2020, CASOS Summer Institute 2020 4 2

6/11/2020 What format is my data in • JSON! • Related question, what is it? • JSON is a simple format for sharing unstructured data • Typically – one JSON “object” per tweet/line of file June 2020, CASOS Summer Institute 2020 5 Tweets to meta-networks Networks Twitter JSON Structure • Text • User x User • Coordinates – Mention • Created_at – Following • favorite_count – Retweet • favorited • Hashtag Graphs • id • Lang – Co-occurrence • User (another JSON object) – Bipartite graph: user x hash tag • … • Node attributes – Profile features: following count, creation Full list of fields at: date,… https://dev.twitter.com/overview/api/tweets – Language patterns, geo coord., etc June 2020, CASOS Summer Institute 2020 6 3

6/11/2020 How to do it? • Option 1: Use some commercial data collecting services • Option 2: Get the ASU team to do it (TweetTracker) • Option 3: Do it yourself! – What you’ll need: • API credentials (https://apps.twitter.com/) • Find a programming language you’re comfortable with – R - twitteR package – Python – tweepy is the most popular tool – Java – Hosebird is Twitter’s own tool for connecting to the streaming API June 2020, CASOS Summer Institute 2020 7 Common approaches • Track all tweets within the U.S. for 6 months • Follow 1000 users I think are interesting for 6 months, do a network analysis • Follow #coronavirus for 6 months, do a network analysis • … June 2020, CASOS Summer Institute 2020 8 4

6/11/2020 Common practice 1 1. Hook in to the Streaming API with keywords and/or bounding box for a bit 2. Find users that are “interesting” 3. Use the Search API to collect all of these users’ data 4. Try to get rid of bots, celebrities, etc. Pros: Relatively easy, fast Cons: Results are limited to these streaming keywords/locations. The resulting mentioning/retweeting networks are usually sparse. June 2020, CASOS Summer Institute 2020 9 Common practice 2---snowball sampling 1. Start with a set of seed users of interest 2. Collect timelines for these users 3. Find new users within one-step connection (mentioning, following, retweeting) 4. Repeat step 1. Pros: Get comprehensive social links for a group of users. Cons: Time consuming, relies on the choice of seed users. June 2020, CASOS Summer Institute 2020 10 5

6/11/2020 Demo – Step 1: Go to https://apps.twitter.com/, and apply for a developer account. The process can take some days to complete. – Step 2: install tweepy for python, pip install tweepy –user Or (if you use anaconda as a package manager) conda install -c conda-forge tweepy – Step 3: Fill the access token and filtering criteria in stream.py The code takes in a list of strings (queries). Elements in the list are searched as an OR query, words in an element constitute an AND query. – Step 4: run stream.py python stream.py June 2020, CASOS Summer Institute 2020 11 6

Collecting and Analyzing Twitter Data Best Practices Ramon - PDF document

6/11/2020 Collecting and Analyzing Twitter Data Best Practices Ramon Villa-Cox rvillaco@andrew.cmu.edu The CASOS Center School of Computer Science, Carnegie Mellon Summer Institute 2020 June 2020, CASOS Summer Institute 2020 1

Collecting & Analyzing Twitter data an Introduction Viktoria Spaiser UAF in Political

Twitter Networks Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media Data

Collecting and Analyzing Reddit Data Best Practices Christine Sowa csowa@andrew.cmu.edu Center

Processing Twitter Text Alex Hanna Computational Social Scientist DataCamp Analyzing Social

Maps and Twitter data Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media

Collecting Engineering Data Three ways of collecting data on the impacts of factors on a response

Collecting Data: New Information Sources November 2019 Outlines Collecting Data Legal

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

Understanding Census geography and tigris basics Kyle Walker Instructor DataCamp Analyzing US

What are survey weights? Kelly McConville Assistant Professor of Statistics DataCamp Analyzing

Welcome to data visualization best practices in R Nick Strayer Instructor DataCamp

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Analyzing twitter data AN ALYZ IN G S OCIAL MEDIA DATA IN R Sowmya Vivek Data Science Coach

//Dashboard //Twitter Panel //Twitter Panel Context and Actions Act based on the document

Using Twitter for your CPD Janet Thomas November 2019 #PHYSIO19 Why twitter for CPD?

ML at Twitter: A Deep Dive into Twitters Timeline Cibele Montez Halasz, Twitter Cortex

Site-swap Juggling Ingredients: Two hands (L and R) Some balls to throw A clock,

The Ultimate Debian Database Consolidating Bazaar Metadata for Quality Assurance and Data Mining

Triquetrum: Models of Computation for Workflows Christopher Brooks, University of California,

Sage: Open Source Mathematical Software http://www.sagemath.org William Stein 1 1 Department of

Standards & Best Practices for the World Wide Web ! Richard Ishida W3C Internationalization

Todays Agenda New Hires Pride of CASE V Award Winners Announcements Featured

MEDIA TRAINING Media Outreach and Social Media INTRODUCTIONS Media Outreach Best Practices

SociallyDrivenWebSitesfortheMasses FrankUyeda

Collecting and Analyzing Twitter Data Best Practices Ramon - PDF document

6/11/2020 Collecting and Analyzing Twitter Data Best Practices Ramon Villa-Cox rvillaco@andrew.cmu.edu The CASOS Center School of Computer Science, Carnegie Mellon Summer Institute 2020 June 2020, CASOS Summer Institute 2020 1

Collecting &amp; Analyzing Twitter data an Introduction Viktoria Spaiser UAF in Political

Twitter Networks Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media Data

Collecting and Analyzing Reddit Data Best Practices Christine Sowa csowa@andrew.cmu.edu Center

Processing Twitter Text Alex Hanna Computational Social Scientist DataCamp Analyzing Social

Maps and Twitter data Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media

Collecting Engineering Data Three ways of collecting data on the impacts of factors on a response

Collecting Data: New Information Sources November 2019 Outlines Collecting Data Legal

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

Understanding Census geography and tigris basics Kyle Walker Instructor DataCamp Analyzing US

What are survey weights? Kelly McConville Assistant Professor of Statistics DataCamp Analyzing

Welcome to data visualization best practices in R Nick Strayer Instructor DataCamp

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Analyzing twitter data AN ALYZ IN G S OCIAL MEDIA DATA IN R Sowmya Vivek Data Science Coach

//Dashboard //Twitter Panel //Twitter Panel Context and Actions Act based on the document

Using Twitter for your CPD Janet Thomas November 2019 #PHYSIO19 Why twitter for CPD?

ML at Twitter: A Deep Dive into Twitters Timeline Cibele Montez Halasz, Twitter Cortex

Site-swap Juggling Ingredients: Two hands (L and R) Some balls to throw A clock,

The Ultimate Debian Database Consolidating Bazaar Metadata for Quality Assurance and Data Mining

Triquetrum: Models of Computation for Workflows Christopher Brooks, University of California,

Sage: Open Source Mathematical Software http://www.sagemath.org William Stein 1 1 Department of

Standards &amp; Best Practices for the World Wide Web ! Richard Ishida W3C Internationalization

Todays Agenda New Hires Pride of CASE V Award Winners Announcements Featured

MEDIA TRAINING Media Outreach and Social Media INTRODUCTIONS Media Outreach Best Practices

SociallyDrivenWebSitesfortheMasses FrankUyeda

Collecting & Analyzing Twitter data an Introduction Viktoria Spaiser UAF in Political

Standards & Best Practices for the World Wide Web ! Richard Ishida W3C Internationalization