User-generated content mining: From collective disease rates to - PowerPoint PPT Presentation

User-generated content mining: From collective disease rates to individual demographics Vasileios Lampos Computer Science @ UCL @lampos | lampos.net Language Technology Lab University of Cambridge Oct. 27, 2016

Structure of the presentation 1. Introductory remarks 2. Collective disease surveillance from search query data   — Google Flu Trends and inference inaccuracies   — Steps towards improvement 3. Mining socio-economic demographics from social media users   — Occupational class   — Income   — Socioeconomic status 4. Concluding remarks

Context and Motivation

Context and Motivation How can we use online   user-generated content (UGC) to our benefit?

User-generated content for health. WHY? + Online content can potentially access a larger and more representative part of the population   Note: Health surveillance systems are based on the subset of people who actively seek medical attention + More timely information ( almost instant ) + Geographical regions with less established health monitoring systems could benefit + Small cost when data access and modelling expertise are in place

Google Flu Trends — The idea Can we turn online search query statistics   to estimates about the rate of influenza-like illness (ILI) in the real-world population?

Google Flu Trends — Supervised learning Flu rates from a health search query frequency agency representing time series doctor consultations Bing 0.03 0.02 0.01 0 M x N M X ∈ ℝ y ∈ ℝ logit ( y ) = β 0 + β 1 ✕ logit ( q ) + ε ( Ginsberg et al., 2009 )

Google Flu Trends — Supervised learning Flu rates from a health search query frequency agency representing time series doctor consultations Bing 0.03 q is the aggregate frequency   0.02 of a selected subset of the N   0.01 candidate search queries 0 M x N M X ∈ ℝ y ∈ ℝ logit ( y ) = β 0 + β 1 ✕ logit ( q ) + ε ( Ginsberg et al., 2009 )

Google Flu Trends — Failure 10 Lagged CDC Google Flu Google Flu + CDC CDC 8 Google estimates more ( Lazer et al., 2014 ) than double CDC estimates 6 % ILI 4 2 0 07/01/09 07/01/10 07/01/11 07/01/12 07/01/13 The estimates of the online Google Flu Trends tool were approx. two times larger than the ones from the CDC in 2012/13

Google Flu Trends — Hypotheses for failure - “ Big Data ” criticism - The statistical learning model was not good enough - Feature selection was not good enough bringing in spurious search queries - Media hype about flu significantly affects inference accuracy - The ground truth is not perfect; it is rather a “silver” standard

Google Flu Trends — Hypotheses for failure X “ Big Data ” criticism The statistical learning model was not ✓ good enough Feature selection was not good enough ✓ bringing in spurious search queries ? Media hype about flu significantly affects inference accuracy ✓ ? The ground truth is not perfect; it is rather a “silver” standard

Advances in nowcasting influenza-like illness rates using online search logs Lampos, Miller, Crossan & Stefansen (Nature Scientific Reports, 2015)

Data Google search logs - weekly search counts of 49,708 search queries - corresponding total volume of weekly searches - user search sessions geolocated in the US - anonymised & aggregate data - Jan. 2004 to Dec. 2013 (521 weeks, ~ decade ) ILI rates from CDC

Elastic Net for linear regularised regression x i ∈ R m , i ∈ { 1 , . . . , n } query frequency — X ILI rates y i ∈ R , i ∈ { 1 , . . . , n } — y weights, bias w j , β ∈ R , j ∈ { 1 , . . . , m } — w ∗ = [ w ; β ] 2 8 9 0 1 n m m m < = X X X X w 2 argmin + λ 1 | w j | + λ 2 @ y i − β − x ij w j A j w , β : ; i =1 j =1 j =1 j =1 L1-norm L2-norm a sparse set of weights ( w ) is encouraged ( Zou & Hastie, 2005 )

Nonlinearities in the data (1) logit space 1 1 0.9 0.8 0.8 0.7 “ flu symptoms 0.6 0.6 0.5 in children ” 0.4 0.4 0.3 0.2 0.2 ILI rate 0.1 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 1 1 0.9 0.8 0.8 0.7 “ flu symptoms 0.6 0.6 0.5 in adults ” 0.4 0.4 0.3 0.2 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Query frequency

Nonlinearities in the data (2) logit space 1 1 0.9 0.8 0.8 0.7 0.6 0.6 “ flu remedies ” 0.5 0.4 0.4 0.3 0.2 0.2 ILI rate 0.1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 1 1 0.9 0.8 0.8 0.7 0.6 0.6 “ tamiflu dosage ” 0.5 0.4 0.4 0.3 0.2 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Query frequency

Gaussian Processes for nonlinear modelling R → R Formally, GP and we want to learn Say x ∈ R d : f : R d → R x inputs x R d : inputs x 0 )) f ( x x ) ∼ GP ( m ( x x ) , k ( x x x x x x,x mean function covariance function (kernel) drawn on inputs drawn on pairs of inputs Formally: Sets of random variables any finite number of which have a multivariate Gaussian distribution Why do we use Gaussian Processes? + Kernelised, models nonlinearities + Interpretability ( A uto R elevance D etermination) + Performance ( Rasmussen & Williams, 2006 )

Common covariance functions (kernels) Kernel name: Squared-exp ( SE ) Periodic ( Per ) Linear ( Lin ) f ( x − c )( x Õ − c ) − ( x ≠ x Õ ) 2 1 2 1 ¸ 2 sin 2 1 22 − 2 π x ≠ x Õ σ 2 σ 2 σ 2 k ( x, x Õ ) = f exp f exp 2 ¸ 2 p Plot of k ( x, x Õ ) : 0 0 0 x (with x Õ = 1 ) x − x Õ x − x Õ ↓ ↓ ↓ Functions f ( x ) sampled from GP prior: x x x Type of structure: local variation repeating structure linear functions ( Duvenaud, 2014 )

Combining kernels in a GP it is possible to add or multiply kernels (among other operations) Lin × Lin SE × Per Lin × SE Lin × Per 0 0 0 0 x (with x Õ = 1 ) x (with x Õ = 1 ) x (with x Õ = 1 ) x − x Õ ↓ ↓ ↓ ↓ quadratic functions locally periodic increasing variation growing amplitude ( Duvenaud, 2014 )

Exploring nonlinearities with Gaussian Processes. � � � � � � GP kernel on query clusters   C   ∑ ′ ) 2  ⋅ ′  ′ ( , ) = ( ,  + ( , ), k x x k c c σ δ x x   SE n i i   = 1 i + protects inferences from radical changes in the � frequency of isolated queries + models the contribution of various themes (clusters) to the final prediction ( bi-product: interpretability ) + learns a sum of lower-dimensional functions: smaller � input space, easier learning task , fewer samples required, more statistical traction obtained - [ trade-off ] assumption that relationships between queries in separate clusters provide no information about ILI � �

Inference performance Google Flu Trends old model Elastic Net Gaussian Process (10 clusters) 25 24.8% 20.4% MAPE (%) 15 15.8% 11.9% 11% 10.8% 5 Test data Test data; peaking moments Mean absolute percentage (%) of error (MAPE) in flu rate estimates (2008-2013)

Comparative inference plots

Comparative inference plots What happened here?

From 4 Dec. 2011 to 28 Apr. 2012… rsv flu symptoms benzonatate GFT original model symptoms of pneumonia upper respiratory infection ear thermometer musinex Elastic Net how to break a fever flu like symptoms fever reducer 0% 8% 17% 25% Top-5 most influential search queries for flu rate inferences

I am skipping… (1) How, and, hence, why the GP-clustering works (2) The obvious auto-regressive extensions (3) How we incorporated statistical NLP to further improve models ( submitted paper )

Inferring user-level information   from user-generated content occupational class income socio-economic status (SES) Preotiuc-Pietro, Lampos & Aletras (ACL 2015) Preotiuc-Pietro, Volkova, Lampos, Bachrach & Aletras (PLOS ONE, 2015) Lampos, Aletras, Geyti, Zou & Cox (ECIR 2016)

About Twitter

About Twitter > 140 characters per published status ( tweet ) > users can follow and be followed > embedded usage of topics (using #hashtags) > user interaction (re-tweets, @mentions, likes) > real-time nature > biased demographics (13-15% of UK’s population, age bias etc.) > information is noisy and not always accurate

Linguistic expression and demographics “ Socioeconomic variables are influencing language use. ” ( Bernstein, 1960 ; Labov, 1972/2006 ) + Validate this hypothesis on a broader, larger data set using social media + Applications > research, as in computational social science, health, and psychology > commercial

User-generated content mining: From collective disease rates to - PowerPoint PPT Presentation

User-generated content mining: From collective disease rates to individual demographics Vasileios Lampos Computer Science @ UCL @lampos | lampos.net Language Technology Lab University of Cambridge Oct. 27, 2016 Structure of the presentation

Generated by CamScanner Generated by CamScanner Generated by CamScanner Generated by CamScanner

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Wake Up to Lyme What is Lyme Disease? Risk of Lyme Disease Preventing Lyme Disease

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

RUN groupadd -r user && useradd -r -g user user USER user $ docker run --read-only debian

Discovering the multifaceted information hidden within large user-generated text streams Daniel

COLLECTIVE LEADERSHIP AND SAFETY CULTURES COLLECTIVE LEADERSHIP FOR SAFETY SKILLS Co-Lead Coll

Collective Investment Schemes in Cyprus What are the Collective Investment Schemes A

Collective states and transitional behavior in schooling fish KOLBJRN TUNSTRM Collective

Linear regression How to measure the accuracy of linear regression models Linear Regression

Seminar Overview Occupational health and safety Occupational health and safety Workplace

How do you treat health like safety? There is an increased awareness within the

Leveraging new and existing labour force data to understand the impact of COVID-19 on workers in

1 Whats in a Name? Corporate Social Responsibility Sustainability Corporate Responsibility

Using&an&iconic&language&to improve&access&to&electronic

England Translating applied health research through collaboration EAHSN Regional profile at

Habitats of Southern New England www.forestadaptation.org Northern Institute of Applied Climate

PRIMARY 1 ORIE IENTATION 2017 2017 AGENDA Address by Principal, Mrs Shahul

User-generated content mining: From collective disease rates to - PowerPoint PPT Presentation

User-generated content mining: From collective disease rates to individual demographics Vasileios Lampos Computer Science @ UCL @lampos | lampos.net Language Technology Lab University of Cambridge Oct. 27, 2016 Structure of the presentation

Generated by CamScanner Generated by CamScanner Generated by CamScanner Generated by CamScanner

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Wake Up to Lyme What is Lyme Disease? Risk of Lyme Disease Preventing Lyme Disease

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

RUN groupadd -r user &amp;&amp; useradd -r -g user user USER user $ docker run --read-only debian

Discovering the multifaceted information hidden within large user-generated text streams Daniel

COLLECTIVE LEADERSHIP AND SAFETY CULTURES COLLECTIVE LEADERSHIP FOR SAFETY SKILLS Co-Lead Coll

Collective Investment Schemes in Cyprus What are the Collective Investment Schemes A

Collective states and transitional behavior in schooling fish KOLBJRN TUNSTRM Collective

Linear regression How to measure the accuracy of linear regression models Linear Regression

Seminar Overview Occupational health and safety Occupational health and safety Workplace

How do you treat health like safety? There is an increased awareness within the

Leveraging new and existing labour force data to understand the impact of COVID-19 on workers in

1 Whats in a Name? Corporate Social Responsibility Sustainability Corporate Responsibility

Using&amp;an&amp;iconic&amp;language&amp;to improve&amp;access&amp;to&amp;electronic

England Translating applied health research through collaboration EAHSN Regional profile at

Habitats of Southern New England www.forestadaptation.org Northern Institute of Applied Climate

PRIMARY 1 ORIE IENTATION 2017 2017 AGENDA Address by Principal, Mrs Shahul

RUN groupadd -r user && useradd -r -g user user USER user $ docker run --read-only debian

Using&an&iconic&language&to improve&access&to&electronic