discovering the multifaceted information hidden within
play

Discovering the multifaceted information hidden within large - PowerPoint PPT Presentation

Discovering the multifaceted information hidden within large user-generated text streams Daniel Preotiuc-Pietro daniel@dcs.shef.ac.uk 23.04.2014 Context vast increase in user generated content Online Social Networks most


  1. Discovering the multifaceted information hidden within large user-generated text streams Daniel Preotiuc-Pietro daniel@dcs.shef.ac.uk 23.04.2014

  2. Context • vast increase in user generated content • Online Social Networks most time-consuming activity on Internet • multiple modalities: text, time, location, user info, images, etc. • social network structure • Challenges: • Engeneering: data volume • Algorithmic: restricted information, grounded in context, streaming, noise

  3. Motivation Assumption: Text has different use conditioned on factors such as time, location, etc. Aim: Build models which incorporate these factors Tasks: • Supervised prediction applications • internal, external • Study the effect of these factors in text use • Improve performance of downstream applications

  4. Outline i. Introduction ii. Data processing iii. Temporal patterns iv. Text forecasting real-world outcomes v. Spatio-temporal clustering vi. User level properties

  5. TrendMiner project • `Large scale, cross-lingual trend mining and summarization of real time media streams’ • 6+4 organisations; we work with University of Southampton and SORA on machine learning • application to predicting political polls and aiding political analysts to make sense of social media data www.trendminer-project.eu

  6. Text Processing new conventions lack of context creative spellings RT @MediaScotland greeeat!!!lvly speech by cameron on scott's indy :) #indyref shortenings unorthodox capitalisation OOV words

  7. Processing Architecture • Fast: real time processing, Hadoop MapReduce (I/O bound), online and batch processing • Scalable: adding more machines • Modular: easy to add new modules • Pipeline: the user specifies his needs • Extensible: different sources of data (USMF format) • Data consistency: JSON format, append to ‘analysis’ • Reusable: open-source (ICWSM 2012)

  8. Components

  9. Gaussian Processes Task: Forecast hashtag frequency in Social Media - identify and categorise complex temporal patterns (EMNLP 2013) Non-parametric Bayesian framework • kernelised • probabilistic formulation • propagation of uncertainty • exact posterior inference for regression • Non-parametric extension of Bayesian regression • very good results, but hardly used in NLP

  10. Gaussian Processes Define prior over functions Compute posterior (ACL 2014 Tutorial)

  11. Extrapolation

  12. Examples of time series #FYI #SNOW SE #FAIL #RAW

  13. Experimental results

  14. Experimental results Compared to Mean prediction

  15. Text classification Task: Assign the hashtag to a given tweet • Most frequent (MF) • Naive Bayes model (NB-E) • Naive Bayes with GP forecast as prior (NB-P) MF NB-E NB-P Match@1 7.28% 16.04% 17.39% Match@5 19.90% 29.51% 31.91% Match@50 44.92% 59.17% 60.85% MRR 0.144 0.237 0.252

  16. User behaviour 100 Task: Predict venue 50 check-in frequencies 0 • Modelled using GPs Linear SE PER PS Select -50 • Compared to Mean -100 -150

  17. Individual user behaviour Task: Predict venue type of user check-in Method Accuracy • highly periodic Random 11.11% M.Freq Categ. 35.21% • compared to standard Markov-1 36.13% Markov predictors Markov-2 34.21% Daily period 38.92% Weekly period 40.65% (WebScience 2013)

  18. Text based forecasting Task: predicting real world outcomes Aim: replace expensive polls with streaming text • predict political voting intention (not elections!) • based on social media (Twitter) text • strong baselines (last day, mean) • 2 different use cases (UK and Austria) • UK: 42k users, 60m tweets, 3 parties, 2 years (ACL 2013)

  19. Linear regression w x t + β = y t

  20. Linear regression 𝑜 w, β = argmin (𝑥𝑦 𝑗 + 𝛾 − 𝑧 𝑗 ) 2 𝑗=1

  21. Linear regression 𝑜 w, β = argmin (𝑥𝑦 𝑗 + 𝛾 − 𝑧 𝑗 ) 2 + 𝜔 𝑓𝑚 (𝑥, 𝜍) 𝑗=1 LEN – Elastic Net

  22. Bilinear regression • main issue is noise: many non-informative users • we look for a model of sparse words & sparse users • bi-convex optimisation problem • solved by alternatively fixing each set of weights and iterating until convergence

  23. Bilinear regression u X t w T + β = y t

  24. Bilinear regression 𝑜 w, u, β = argmin (𝑣𝑌 𝑗 𝑥 𝑈 + 𝛾 − 𝑧 𝑗 ) 2 𝑗=1

  25. Bilinear regression 𝑜 w, u, β = argmin (𝑣𝑌 𝑗 𝑥 𝑈 + 𝛾 − 𝑧 𝑗 ) 2 + 𝜔 𝑓𝑚 𝑥, 𝜍 1 + 𝜔 𝑓𝑚 (𝑣, 𝜍 2 ) 𝑗=1 BEN – Bilinear Elastic Net

  26. Bilinear regression 𝑜 𝑥 𝑢 , 𝑣 𝑢 , β = argmin (𝑣 𝑢 𝑌 𝑗 𝑥 𝑢 + 𝛾 − 𝑧 𝑢𝑗 ) 2 + 𝜔 𝑓𝑚 𝑥 𝑢 , 𝜍 1 + 𝜔 𝑓𝑚 (𝑣 𝑢 , 𝜍 2 ) 𝑗=1

  27. Bilinear regression 𝑜 𝑥 𝑢 , 𝑣 𝑢 , β = argmin (𝑣 𝑢 𝑌 𝑗 𝑥 𝑢 + 𝛾 − 𝑧 𝑢𝑗 ) 2 + 𝜔 𝑓𝑚 𝑥 𝑢 , 𝜍 1 + 𝜔 𝑓𝑚 (𝑣 𝑢 , 𝜍 2 ) 𝑗=1

  28. Bilinear regression 𝑜 𝑥 𝑢 , 𝑣 𝑢 , β = argmin (𝑣 𝑢 𝑌 𝑗 𝑥 𝑢 + 𝛾 − 𝑧 𝑢𝑗 ) 2 + 𝜔 𝑓𝑚 𝑥 𝑢 , 𝜍 1 + 𝜔 𝑓𝑚 (𝑣 𝑢 , 𝜍 2 ) 𝑗=1

  29. Bilinear regression 𝜐 𝑜 w, u, β = argmin (𝑣 𝑢 𝑌 𝑗 𝑥 𝑢 + 𝛾 − 𝑧 𝑢𝑗 ) 2 + 𝜔 𝑚 1 𝑚 2 𝑥, 𝜍 1 + 𝜔 𝑚 1 𝑚 2 (𝑣, 𝜍 2 ) 𝑢=1 𝑗=1 BGL – Bilinear Group LASSO

  30. Quantitative results Polls BEN Root Mean Squared Error (RMSE) forecasting results over 50 testing polls (in VI %) BGL

  31. Quantitative results Party Tweet Score Author CON PM in friendly chat with top EU mate, Sweden’s Fredrik 1.334 Journalist Reinfeldt, before family photo Have Liberal Democrats broken electoral rules? Blog on -0.991 Journalist Labour complaint to cabinet secretary LAB Blog Post Liverpool: City of Radicals Website now Live 1.954 Art Fanzine <link> #liverpool #art I am so pleased to head Paul Savage who worked for -0.552 Politicial the Labour group has been Appointed the Marketing (Labour) manager for the baths hall GREAT NEWS LBD RT @user: Must be awful for TV bosses to keep getting 0.874 LibDem MP knocked back by all the women they ask to host election night (via @user) Blog Post Liverpool: City of Radicals 2011 – More -0.521 Art Fanzine Details Announced #liverpool #art

  32. User features • The real-world outcome and users share: i. region info: London (L), South England (S), Midlands & Wales (MW), North (N), Scotland (Sc) - observed ii. gender: Male (M), Female (F) - inferred using statistical text-based classifier iii. age: 18-24, 25-39, 40-59, 60+ - unknown

  33. Recap: Bilinear regression 𝜐 𝑜 w, u, β = argmin (𝑣 𝑢 𝑌 𝑗 𝑥 𝑢 + 𝛾 − 𝑧 𝑢𝑗 ) 2 + 𝜔 𝑚 1 𝑚 2 𝑥, 𝜍 1 + 𝜔 𝑚 1 𝑚 2 (𝑣, 𝜍 2 ) 𝑢=1 𝑗=1 BGL – Bilinear Group LASSO

  34. Region & Demographics 𝜐 𝜖 𝑜 w, u, β = argmin (𝑣 𝑢𝑠 𝑌 𝑗𝑠 𝑥 𝑢𝑠 + 𝛾 𝑢𝑠 − 𝑧 𝑢𝑗𝑠 ) 2 + 𝑢=1 𝑠=1 𝑗=1 𝜖 𝜔 𝑚 1 𝑚 2 𝑥 𝑠 , 𝜍 1 + 𝜔 𝑚 1 𝑚 2 𝑥 𝑢 , 𝜍 1 + 𝜔 𝑚 1 𝑚 2 (𝑣 𝑠 , 𝜍 2 ) BGGR 𝑠=1

  35. Region & Demographics 𝝂 S L MW N Sc 𝑪 𝝂 2.9 3.9 3.2 3.2 3.8 3.4 𝑪 𝒎𝒃𝒕𝒖 3.0 4.9 4.3 4.0 5.3 4.3 BGGR 2.6 3.9 3.2 3.0 3.7 3.3 Regional model 𝝂 M F 𝑪 𝝂 2.6 2.1 2.4 𝑪 𝒎𝒃𝒕𝒖 2.6 2.4 2.5 BGGR 2.1 2.1 2.1 Gender model

  36. Region & Demographics London Predictions Female Predictions

  37. Region & Demographics Conservatives, Positive London

  38. NewsSummaries dataset Task: Predict socioeconomic EU indicators Dataset: • News summaries from Open Europe think tank • Daily summaries of EU and member states related news together with their news source • Feb 2006 – Nov 2013; 1,913 days; 94 months • 296 news outlets (with >10 summaries) • Features: unigrams + bigrams (LACSS 2014)

  39. Predictions Unemployment ESI (Economic Sentiment Indicator) ESI Unemployment LEN 9.253 (9.89%) 0.9275 (8.75%) BEN 8.209 (8.77%) 0.9047 (8.52%)

  40. Economic Sentiment Indicator

  41. Unemployment

  42. Deep linguistic features • Unigrams (8,912) (cameron) • Bigrams (33,206) (david__cameron) • POS (10,277) : Unigrams together with their part-of-speech (cameron/NNP) • NE (1,013) : Entities - Location, Person or Organisation (Person:David_Cameron) • Annotations (3,392) : Link entities to DBpedia e.g. political party (Org:Conservative_Party), office held (Office:Prime_minister)

  43. Deep linguistic features Features ESI Unempl. Unigrams 8.21 1.27 Bigrams 9.66 1.61 Unigrams + Bigrams 8.91 1.47 POS 7.87 1.14 Entities 9.59 1.45 POS + NE 8.09 1.12 NE + Annotations 12.67 1.62 POS + NE + Annotations 10.50 1.31 Unigrams + NE + Annotations 10.92 1.31 Unigrams + Bigrams + NE + Annotations 10.81 1.53

Recommend


More recommend