Discovering the multifaceted information hidden within large user-generated text streams Daniel Preotiuc-Pietro daniel@dcs.shef.ac.uk 23.04.2014
Context • vast increase in user generated content • Online Social Networks most time-consuming activity on Internet • multiple modalities: text, time, location, user info, images, etc. • social network structure • Challenges: • Engeneering: data volume • Algorithmic: restricted information, grounded in context, streaming, noise
Motivation Assumption: Text has different use conditioned on factors such as time, location, etc. Aim: Build models which incorporate these factors Tasks: • Supervised prediction applications • internal, external • Study the effect of these factors in text use • Improve performance of downstream applications
Outline i. Introduction ii. Data processing iii. Temporal patterns iv. Text forecasting real-world outcomes v. Spatio-temporal clustering vi. User level properties
TrendMiner project • `Large scale, cross-lingual trend mining and summarization of real time media streams’ • 6+4 organisations; we work with University of Southampton and SORA on machine learning • application to predicting political polls and aiding political analysts to make sense of social media data www.trendminer-project.eu
Text Processing new conventions lack of context creative spellings RT @MediaScotland greeeat!!!lvly speech by cameron on scott's indy :) #indyref shortenings unorthodox capitalisation OOV words
Processing Architecture • Fast: real time processing, Hadoop MapReduce (I/O bound), online and batch processing • Scalable: adding more machines • Modular: easy to add new modules • Pipeline: the user specifies his needs • Extensible: different sources of data (USMF format) • Data consistency: JSON format, append to ‘analysis’ • Reusable: open-source (ICWSM 2012)
Components
Gaussian Processes Task: Forecast hashtag frequency in Social Media - identify and categorise complex temporal patterns (EMNLP 2013) Non-parametric Bayesian framework • kernelised • probabilistic formulation • propagation of uncertainty • exact posterior inference for regression • Non-parametric extension of Bayesian regression • very good results, but hardly used in NLP
Gaussian Processes Define prior over functions Compute posterior (ACL 2014 Tutorial)
Extrapolation
Examples of time series #FYI #SNOW SE #FAIL #RAW
Experimental results
Experimental results Compared to Mean prediction
Text classification Task: Assign the hashtag to a given tweet • Most frequent (MF) • Naive Bayes model (NB-E) • Naive Bayes with GP forecast as prior (NB-P) MF NB-E NB-P Match@1 7.28% 16.04% 17.39% Match@5 19.90% 29.51% 31.91% Match@50 44.92% 59.17% 60.85% MRR 0.144 0.237 0.252
User behaviour 100 Task: Predict venue 50 check-in frequencies 0 • Modelled using GPs Linear SE PER PS Select -50 • Compared to Mean -100 -150
Individual user behaviour Task: Predict venue type of user check-in Method Accuracy • highly periodic Random 11.11% M.Freq Categ. 35.21% • compared to standard Markov-1 36.13% Markov predictors Markov-2 34.21% Daily period 38.92% Weekly period 40.65% (WebScience 2013)
Text based forecasting Task: predicting real world outcomes Aim: replace expensive polls with streaming text • predict political voting intention (not elections!) • based on social media (Twitter) text • strong baselines (last day, mean) • 2 different use cases (UK and Austria) • UK: 42k users, 60m tweets, 3 parties, 2 years (ACL 2013)
Linear regression w x t + β = y t
Linear regression 𝑜 w, β = argmin (𝑥𝑦 𝑗 + 𝛾 − 𝑧 𝑗 ) 2 𝑗=1
Linear regression 𝑜 w, β = argmin (𝑥𝑦 𝑗 + 𝛾 − 𝑧 𝑗 ) 2 + 𝜔 𝑓𝑚 (𝑥, 𝜍) 𝑗=1 LEN – Elastic Net
Bilinear regression • main issue is noise: many non-informative users • we look for a model of sparse words & sparse users • bi-convex optimisation problem • solved by alternatively fixing each set of weights and iterating until convergence
Bilinear regression u X t w T + β = y t
Bilinear regression 𝑜 w, u, β = argmin (𝑣𝑌 𝑗 𝑥 𝑈 + 𝛾 − 𝑧 𝑗 ) 2 𝑗=1
Bilinear regression 𝑜 w, u, β = argmin (𝑣𝑌 𝑗 𝑥 𝑈 + 𝛾 − 𝑧 𝑗 ) 2 + 𝜔 𝑓𝑚 𝑥, 𝜍 1 + 𝜔 𝑓𝑚 (𝑣, 𝜍 2 ) 𝑗=1 BEN – Bilinear Elastic Net
Bilinear regression 𝑜 𝑥 𝑢 , 𝑣 𝑢 , β = argmin (𝑣 𝑢 𝑌 𝑗 𝑥 𝑢 + 𝛾 − 𝑧 𝑢𝑗 ) 2 + 𝜔 𝑓𝑚 𝑥 𝑢 , 𝜍 1 + 𝜔 𝑓𝑚 (𝑣 𝑢 , 𝜍 2 ) 𝑗=1
Bilinear regression 𝑜 𝑥 𝑢 , 𝑣 𝑢 , β = argmin (𝑣 𝑢 𝑌 𝑗 𝑥 𝑢 + 𝛾 − 𝑧 𝑢𝑗 ) 2 + 𝜔 𝑓𝑚 𝑥 𝑢 , 𝜍 1 + 𝜔 𝑓𝑚 (𝑣 𝑢 , 𝜍 2 ) 𝑗=1
Bilinear regression 𝑜 𝑥 𝑢 , 𝑣 𝑢 , β = argmin (𝑣 𝑢 𝑌 𝑗 𝑥 𝑢 + 𝛾 − 𝑧 𝑢𝑗 ) 2 + 𝜔 𝑓𝑚 𝑥 𝑢 , 𝜍 1 + 𝜔 𝑓𝑚 (𝑣 𝑢 , 𝜍 2 ) 𝑗=1
Bilinear regression 𝜐 𝑜 w, u, β = argmin (𝑣 𝑢 𝑌 𝑗 𝑥 𝑢 + 𝛾 − 𝑧 𝑢𝑗 ) 2 + 𝜔 𝑚 1 𝑚 2 𝑥, 𝜍 1 + 𝜔 𝑚 1 𝑚 2 (𝑣, 𝜍 2 ) 𝑢=1 𝑗=1 BGL – Bilinear Group LASSO
Quantitative results Polls BEN Root Mean Squared Error (RMSE) forecasting results over 50 testing polls (in VI %) BGL
Quantitative results Party Tweet Score Author CON PM in friendly chat with top EU mate, Sweden’s Fredrik 1.334 Journalist Reinfeldt, before family photo Have Liberal Democrats broken electoral rules? Blog on -0.991 Journalist Labour complaint to cabinet secretary LAB Blog Post Liverpool: City of Radicals Website now Live 1.954 Art Fanzine <link> #liverpool #art I am so pleased to head Paul Savage who worked for -0.552 Politicial the Labour group has been Appointed the Marketing (Labour) manager for the baths hall GREAT NEWS LBD RT @user: Must be awful for TV bosses to keep getting 0.874 LibDem MP knocked back by all the women they ask to host election night (via @user) Blog Post Liverpool: City of Radicals 2011 – More -0.521 Art Fanzine Details Announced #liverpool #art
User features • The real-world outcome and users share: i. region info: London (L), South England (S), Midlands & Wales (MW), North (N), Scotland (Sc) - observed ii. gender: Male (M), Female (F) - inferred using statistical text-based classifier iii. age: 18-24, 25-39, 40-59, 60+ - unknown
Recap: Bilinear regression 𝜐 𝑜 w, u, β = argmin (𝑣 𝑢 𝑌 𝑗 𝑥 𝑢 + 𝛾 − 𝑧 𝑢𝑗 ) 2 + 𝜔 𝑚 1 𝑚 2 𝑥, 𝜍 1 + 𝜔 𝑚 1 𝑚 2 (𝑣, 𝜍 2 ) 𝑢=1 𝑗=1 BGL – Bilinear Group LASSO
Region & Demographics 𝜐 𝜖 𝑜 w, u, β = argmin (𝑣 𝑢𝑠 𝑌 𝑗𝑠 𝑥 𝑢𝑠 + 𝛾 𝑢𝑠 − 𝑧 𝑢𝑗𝑠 ) 2 + 𝑢=1 𝑠=1 𝑗=1 𝜖 𝜔 𝑚 1 𝑚 2 𝑥 𝑠 , 𝜍 1 + 𝜔 𝑚 1 𝑚 2 𝑥 𝑢 , 𝜍 1 + 𝜔 𝑚 1 𝑚 2 (𝑣 𝑠 , 𝜍 2 ) BGGR 𝑠=1
Region & Demographics 𝝂 S L MW N Sc 𝑪 𝝂 2.9 3.9 3.2 3.2 3.8 3.4 𝑪 𝒎𝒃𝒕𝒖 3.0 4.9 4.3 4.0 5.3 4.3 BGGR 2.6 3.9 3.2 3.0 3.7 3.3 Regional model 𝝂 M F 𝑪 𝝂 2.6 2.1 2.4 𝑪 𝒎𝒃𝒕𝒖 2.6 2.4 2.5 BGGR 2.1 2.1 2.1 Gender model
Region & Demographics London Predictions Female Predictions
Region & Demographics Conservatives, Positive London
NewsSummaries dataset Task: Predict socioeconomic EU indicators Dataset: • News summaries from Open Europe think tank • Daily summaries of EU and member states related news together with their news source • Feb 2006 – Nov 2013; 1,913 days; 94 months • 296 news outlets (with >10 summaries) • Features: unigrams + bigrams (LACSS 2014)
Predictions Unemployment ESI (Economic Sentiment Indicator) ESI Unemployment LEN 9.253 (9.89%) 0.9275 (8.75%) BEN 8.209 (8.77%) 0.9047 (8.52%)
Economic Sentiment Indicator
Unemployment
Deep linguistic features • Unigrams (8,912) (cameron) • Bigrams (33,206) (david__cameron) • POS (10,277) : Unigrams together with their part-of-speech (cameron/NNP) • NE (1,013) : Entities - Location, Person or Organisation (Person:David_Cameron) • Annotations (3,392) : Link entities to DBpedia e.g. political party (Org:Conservative_Party), office held (Office:Prime_minister)
Deep linguistic features Features ESI Unempl. Unigrams 8.21 1.27 Bigrams 9.66 1.61 Unigrams + Bigrams 8.91 1.47 POS 7.87 1.14 Entities 9.59 1.45 POS + NE 8.09 1.12 NE + Annotations 12.67 1.62 POS + NE + Annotations 10.50 1.31 Unigrams + NE + Annotations 10.92 1.31 Unigrams + Bigrams + NE + Annotations 10.81 1.53
Recommend
More recommend