using big data to understand
play

using Big Data to Understand Human Systems Ryan Kennedy University - PowerPoint PPT Presentation

Opportunities and Challenges in using Big Data to Understand Human Systems Ryan Kennedy University of Houston How Big is Big Data? In the 3 rd Century BC, the Library of Alexandria was thought to contain the entire sum of human knowledge.


  1. Opportunities and Challenges in using Big Data to Understand Human Systems Ryan Kennedy University of Houston

  2. How Big is Big Data? • In the 3 rd Century BC, the Library of Alexandria was thought to contain the entire sum of human knowledge. • Today, there is enough digitally stored information to give every person alive 320 times as much information as we think were stored in that library.

  3. Opportunities in Big Data • Microtargeting • Real-time feedback • Information on hard-to-find groups • Data on social systems

  4. Big Data Hype • (Mayer- Schönberger and Cukier 2014): • Approaching “N = all” • Correlation is enough (no need for theory)

  5. Problems in Big Data • Big Data Hubris • Overfitting • Vulnerability to Artifacts • Non-ideal Users • Blue Team Issues • Red Team Issues

  6. Big Data Hubris • The belief that volume can solve all problems. • Don’t get me wrong, it can solve some (e.g. Xbox project). • But there are several things that have to be clear: • Sampling frame • Convenience samples are still convenience samples • Generalizability still has the standard limitations • Motivations for uses of technology not always clear • Behavioral analogue must be clear (and often requires small data)

  7. Can Twitter Predict Elections? Party Election Results Share of Twitter Share of Twitter Mentions Mentions (Original Study) (Replication) Christian 28.4% 30.1% 18.6% Democrats (CDU) Christian Social 6.8% 5.6% 3.0% Democrats (CSU) Social 24.0% 26.6% 14.7% Democratic Party (SPD) Free Democratic 15.2% 17.3% 11.2% Party (FDP) The Left (Die 12.4% 12.4% 8.3% Linke) Green Party 11.1% 8.0% 9.3% (Grüne) Pirate Party 2.1% -- 34.8% (Piraten)

  8. Can Twitter Predict Elections? Tweets mentioning Chavez Tweets mentioning Capriles

  9. Better Small Data = Better Big Data

  10. Using Big Data to Supplement Small Data

  11. Overfitting • When there is a lot of data to fit to a relatively small number of data points, the number of strong correlations that will be found by chance alone increases dramatically. • This danger is made even worse by modern algorithms that can find very non-linear and highly-interactive relationships. • Out-of- sample prediction helps, but it doesn’t completely solve the problems. • Causality is still important.

  12. Google Flu Trends (GFT) • Examine 50m search terms. • Utilized those most heavily correlated with flu prevalence, as measured by CDC regional reports, but curated to weed out non-flu-related searches. • Released in 2008, updated in 2009, updated again in 2013.

  13. Big data means big danger of overfitting … • In Google Flu the search terms are identified through “brute force” (50 million search terms fit to 1152 data points). • Similar problems for other “big data” resources (e.g. Twitter). #feelingalittlesick #nyquilfixeseverything #blowingmyownbrainsout #fallingapart #shamblingnose #YOLO?

  14. Vulnerability to Artifacts • Need to know how your system turns unstructured data into quantitative data. • Examples: • Google NGram project • Stanford Sentiment Analyzer (http://nlp.stanford.edu:8080/sentiment/rntnDemo.html)

  15. Vulnerability to Artifacts

  16. Vulnerability to Artifacts

  17. Vulnerability to Artifacts

  18. Non-Ideal Users • Often an ideal-user assumption, but: • Users often create multiple accounts • Users will sometimes not answer particular questions • Users will modify behavior if know are observed • Users will modify behavior due to irrelevant events

  19. Trends: abdominal pain on my right side

  20. Blue Team Issues • “Data exhaust” not designed for analysis. • Process generating data always changing and geared towards goals other than collecting accurate data. • This means that the data-generating process can change without warning and in unpredictable ways (e.g. Google’s Search Algorithm). • Even if the data-generating process remains relatively stable, it can still be idiosyncratic (e.g. event data).

  21. Changes that could affect Google Flu… • Improved Trends geolocation. • Recommended searches. • Related search listings. • Diagnosis related to search terms for symptoms. • Customized search results by time and location. • Popularization of particular terms. • And many many more.

  22. Red Team Issues • As we become better at monitoring systems, individuals have incentives to manipulate those signals. • Bots and puppets. • Purchased support.

  23. Red Team Issues

  24. Methods to Address Issues • Strong ground truth measures (need small data to have good big data). • Causality still matters. • Critically evaluate the sample from which big data is derived. • Dynamic re-estimation. • Understanding data generating process and critically tracking system changes. • Anticipating manipulation and setting up early warnings. • “Rosetta Stone” for linking digital behavior to specific social context.

  25. Dynamic re- estimation… Santillana et al. Forthcoming. American Journal of Preventative Medicine. Can we reach Asymptopia?

  26. Understanding Causality and the Data Generating Process

  27. Linking Digital Behavior to Context

  28. Thank You Ryan Kennedy University of Houston

Recommend


More recommend