Opportunities and Challenges in using Big Data to Understand Human Systems Ryan Kennedy University of Houston
How Big is Big Data? • In the 3 rd Century BC, the Library of Alexandria was thought to contain the entire sum of human knowledge. • Today, there is enough digitally stored information to give every person alive 320 times as much information as we think were stored in that library.
Opportunities in Big Data • Microtargeting • Real-time feedback • Information on hard-to-find groups • Data on social systems
Big Data Hype • (Mayer- Schönberger and Cukier 2014): • Approaching “N = all” • Correlation is enough (no need for theory)
Problems in Big Data • Big Data Hubris • Overfitting • Vulnerability to Artifacts • Non-ideal Users • Blue Team Issues • Red Team Issues
Big Data Hubris • The belief that volume can solve all problems. • Don’t get me wrong, it can solve some (e.g. Xbox project). • But there are several things that have to be clear: • Sampling frame • Convenience samples are still convenience samples • Generalizability still has the standard limitations • Motivations for uses of technology not always clear • Behavioral analogue must be clear (and often requires small data)
Can Twitter Predict Elections? Party Election Results Share of Twitter Share of Twitter Mentions Mentions (Original Study) (Replication) Christian 28.4% 30.1% 18.6% Democrats (CDU) Christian Social 6.8% 5.6% 3.0% Democrats (CSU) Social 24.0% 26.6% 14.7% Democratic Party (SPD) Free Democratic 15.2% 17.3% 11.2% Party (FDP) The Left (Die 12.4% 12.4% 8.3% Linke) Green Party 11.1% 8.0% 9.3% (Grüne) Pirate Party 2.1% -- 34.8% (Piraten)
Can Twitter Predict Elections? Tweets mentioning Chavez Tweets mentioning Capriles
Better Small Data = Better Big Data
Using Big Data to Supplement Small Data
Overfitting • When there is a lot of data to fit to a relatively small number of data points, the number of strong correlations that will be found by chance alone increases dramatically. • This danger is made even worse by modern algorithms that can find very non-linear and highly-interactive relationships. • Out-of- sample prediction helps, but it doesn’t completely solve the problems. • Causality is still important.
Google Flu Trends (GFT) • Examine 50m search terms. • Utilized those most heavily correlated with flu prevalence, as measured by CDC regional reports, but curated to weed out non-flu-related searches. • Released in 2008, updated in 2009, updated again in 2013.
Big data means big danger of overfitting … • In Google Flu the search terms are identified through “brute force” (50 million search terms fit to 1152 data points). • Similar problems for other “big data” resources (e.g. Twitter). #feelingalittlesick #nyquilfixeseverything #blowingmyownbrainsout #fallingapart #shamblingnose #YOLO?
Vulnerability to Artifacts • Need to know how your system turns unstructured data into quantitative data. • Examples: • Google NGram project • Stanford Sentiment Analyzer (http://nlp.stanford.edu:8080/sentiment/rntnDemo.html)
Vulnerability to Artifacts
Vulnerability to Artifacts
Vulnerability to Artifacts
Non-Ideal Users • Often an ideal-user assumption, but: • Users often create multiple accounts • Users will sometimes not answer particular questions • Users will modify behavior if know are observed • Users will modify behavior due to irrelevant events
Trends: abdominal pain on my right side
Blue Team Issues • “Data exhaust” not designed for analysis. • Process generating data always changing and geared towards goals other than collecting accurate data. • This means that the data-generating process can change without warning and in unpredictable ways (e.g. Google’s Search Algorithm). • Even if the data-generating process remains relatively stable, it can still be idiosyncratic (e.g. event data).
Changes that could affect Google Flu… • Improved Trends geolocation. • Recommended searches. • Related search listings. • Diagnosis related to search terms for symptoms. • Customized search results by time and location. • Popularization of particular terms. • And many many more.
Red Team Issues • As we become better at monitoring systems, individuals have incentives to manipulate those signals. • Bots and puppets. • Purchased support.
Red Team Issues
Methods to Address Issues • Strong ground truth measures (need small data to have good big data). • Causality still matters. • Critically evaluate the sample from which big data is derived. • Dynamic re-estimation. • Understanding data generating process and critically tracking system changes. • Anticipating manipulation and setting up early warnings. • “Rosetta Stone” for linking digital behavior to specific social context.
Dynamic re- estimation… Santillana et al. Forthcoming. American Journal of Preventative Medicine. Can we reach Asymptopia?
Understanding Causality and the Data Generating Process
Linking Digital Behavior to Context
Thank You Ryan Kennedy University of Houston
Recommend
More recommend