Mining the social web: A series of statistical NLP case studies Vasileios Lampos Department of Computer Science University College London May, 2014 1 / 49 v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy 1/49
Key assumptions about social media • a significant sample of the population uses them • a significant amount of the published content is geo-located • this content reflects on collective portions of real-life (opinions, events, phenomena) ◦ usually forming a real-time relationship • it is easy ( ? ) to collect, store and process this content • and everyone seems to know how to use this “ big data ” 2 / 49 v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy 2/49
Twitter in one slide • 140 characters per published status (tweet) • users can follow and can be followed • embedded usage of topics (#rbnews, #inception in statistics) • retweets ( RT ), @replies, @mentions, favourites • real-time nature • biased user demographics (13-15% of UK’s population is now on Twitter) 3 / 49 v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy 3/49
In this talk Ways for harnessing social media information... • to extract simplified collective mood patterns (Lansdall et al., 2012) • to nowcast phenomena (an infectious disease or rainfall rates) (Lampos, Cristianini, 2010 & 2012) • to model voting intention (Lampos et al., 2013) • to understand characteristics related to user impact (Lampos et al., 2014) 4 / 49 v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy 4/49
Proof of concept and a little more: extracting collective mood patterns 5 / 49 v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy 5/49
Time series of joy and anger based on UK tweets 933 Day Time Series for Joy in Twitter Content , 10 * XMAS * XMAS e raw joy signal * XMAS Normalised Emotional Valence 14−day smoothed joy 8 d by joy 6 st happy, enjoy, love, 4 is. * valentine * valentine * halloween * easter od glad, joyful, elated... 2 * halloween ied * easter * RIOTS d 0 * halloween * CUTS * roy.wed. −2 ying location Jul 09 Jan 10 Jul 10 Jan 11 Jul 11 Jan 12 Date s, 1.5 Anger Fear Date of Budget Cuts 1 Date of Riots Difference in mean 0.5 derivative of anger & fear 0 −0.5 −1 Jul 09 Jan 10 Jul 10 Jan 11 Jul 11 Jan 12 Date (Lansdall et al., 2012), (Strapparava, Valitutti, 2004) → WordNet Affect 6 / 49 v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy 6/49
Mood projections Projections of 4-dimensional mood score signals (joy, sadness, anger and fear) on their top-2 principal components (2011 Twitter data) Days of the Week Days in 2011 359 0.4 Tuesday Monday 10 0.3 2nd Principal Component 2nd Principal Component 8 0.2 Wednesday 1 0.1 358 6 Thursday 0 365 45 4 −0.1 357 282 114 Friday 170 −0.2 93 119 205 204 2 197 334 10 23 Saturday 360 196 304 128 79 17 281 Sunday 184 232 9 296 185 239240 218219 212 2 309 112 195 113 233 211 191 226 115 289 16 295 261 122 177 155 107 80 356 229 213 225 222 303 302 297 331 221 181 328 120 100 43 230 355 32 317 30 275 198 253 254 288 −0.3 60 82 180 169 176 165 156 99 241 227 51 344 148 352 58 246 274 247 22 324 214 265 127 4 5 351 72 8 76 162 83 44 178 190 8687 81 236 179 316 92 157 106 280 234 121 323 129 208 206 38 3 70 104 237 66 174 164 109 59 163 158 228 194 244 94 90 19 310 151 188 150 73 24 193 124 278279 260 293 135 346 330 361 142 332 182 183 154 111 231 97 145 139 7 215 62 152 271 18 173 131 171 202 243 337 20 345 192 292 257 338 235 15 223 12 333 342 36 220 175 41 238 210 47 272 108 96 110 153 144 61 78 117 149 103 167 29 34 242 166 132 350 88 327 31 325 33 291 6 341 326 248 364 199 277 276 298 71 37 123 98 217 189 102 116 35 294 301 203 5253 50 54 172 207 186 57 313 209 136 95 319 264 137 318 339 300 262 299 283 311 353 290 0 161 147 105 224 321 335 306 27 159 85 307 256 349 187 89 13 354 305 25 312 143 74 362 363 249 56 168 160 46 287 55 39 216 101 138 26 348 270 11 340 134 42 84 77 140141 67 273 21 14 343 68 252 118 48 308 245 133 69 320 258 146 75 314 130 263 201 284 347 255 91 329 40 259 251 125 286 200 250 28 336 266 322 285 268 −0.4 49 126 315 65 63 64 269 267 −0.5 −2 −1.5 −1 −0.5 0 0.5 1 −8 −6 −4 −2 0 2 4 6 8 1st Principal Component 1st Principal Component New Year ( 1 ), Valentine’s ( 45 ), Christmas Eve ( 358 ), New Year’s Eve ( 365 ) O.B. Laden’s death ( 122 ), Winehouse’s death & Breivik ( 204 ), UK riots ( 221 ) (Lampos, 2012), (Strapparava, Valitutti, 2004) → WordNet Affect 7 / 49 v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy 7/49
Supervised learning Primary outcomes 8 / 49 v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy 8/49
Regression basics — Ordinary Least Squares x i ∈ R m , • observations x x — X X X i ∈ { 1 , ..., n } • responses y i ∈ R , — y y y i ∈ { 1 , ..., n } • weights, bias w w w j , β ∈ R , — w w ∗ = [ w w ; β ] j ∈ { 1 , ..., m } Ordinary Least Squares (OLS) � − 1 X � y � 2 X T X T argmin � X X X ∗ w w w ∗ − y y ℓ 2 ⇒ w w w ∗ = X X ∗ X X ∗ X X ∗ y y y w w w ∗ Why not? X T − − X − X ∗ X X X ∗ may be singular (thus difficult to invert) − − − high-dimensional models difficult to interpret − − − unsatisfactory prediction accuracy (estimates have large variance) 9 / 49 v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy 9/49
Regression basics — Ridge Regression • observations x i ∈ R m , x x — X X X i ∈ { 1 , ..., n } • responses y y i ∈ R , — y y i ∈ { 1 , ..., n } • weights, bias w j , β ∈ R , — w w ∗ = [ w w w w ; β ] j ∈ { 1 , ..., m } Ridge Regression (RR) � � y � 2 w � 2 argmin � X X X ∗ w w w ∗ − y y ℓ 2 + λ � w w ℓ 2 w w w ∗ + + + size constraint on the weight coefficients ( regularisation ) → resolves problems caused by collinear variables + + + less degrees of freedom, better predictive accuracy than OLS − − − does not perform feature selection (nonzero coefficients) (Hoerl, Kennard, 1970) 10 / 49 v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy 10/49
Regression basics — Lasso x i ∈ R m , • observations x x — X X X i ∈ { 1 , ..., n } • responses y i ∈ R , — y y y i ∈ { 1 , ..., n } • weights, bias w w w j , β ∈ R , — w w ∗ = [ w w ; β ] j ∈ { 1 , ..., m } ℓ 1 ℓ 1 ℓ 1 –norm regularisation or lasso (Tibshirani, 1996) � � y � 2 argmin � X X X ∗ w w w ∗ − y y ℓ 2 + λ � w w w � ℓ 1 w w ∗ w − − − no closed form solution — quadratic programming problem + Least Angle Regression (LAR) explores entire reg. path + + (Efron et al., 2004) + w + + sparse w w , interpretability, better performance (Hastie et al., 2009) − if m > n , at most n variables can be selected − − − − − strongly corr. predictors → model-inconsistent (Zhao, Yu, 2009) 11 / 49 v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy 11/49
Lasso for text regression x i ∈ R m , • n-gram frequencies x x — X X X i ∈ { 1 , ..., n } • target phenomenon y i ∈ R , — y y y i ∈ { 1 , ..., n } • weights, bias w w w j , β ∈ R , j ∈ { 1 , ..., m } — w w ∗ = [ w w ; β ] ℓ 1 –norm regularisation or lasso ℓ 1 ℓ 1 � � y � 2 X w y w argmin � X X ∗ w w ∗ − y ℓ 2 + λ � w w � ℓ 1 w w ∗ w 12 / 49 v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy 12/49
Nowcasting ILI rates from Twitter (1/2) Assumptions • Twitter users post about their health condition • We can turn this information into an influenza-like-illness (ILI) rate Is there a signal in the data? • 41 illness related keyphrases (e.g. flu, fever, sore throat, headache) • z-scored cumulative frequency vs z-scored official ILI rates −2 Twitter’s Flu−score (region D) HPA’s Flu rate (region D) 5 Flu rate / score (z−scores) 4 England & Wales (region D) 3 r = .856 2 (Lampos, Cristianini, 2010) 0 −1 −2 160 180 200 220 240 260 280 300 320 340 Day Number (2009) 13 / 49 v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy 13/49
Nowcasting ILI rates from Twitter (2/2) • create a pool of unigram features by indexing all words in relevant web pages (Wikipedia, NHS pages) • stop-words removed, Porter-stemming • automatic unigram selection and weighting via lasso Selected uni-grams ‘unwel’, ‘temperatur’, ‘headach’, ‘appetit’, ‘symptom’, ‘diarrhoea’, ‘muscl’, ‘feel’, ‘flu’, ‘cough’, ‘nose’, ‘vomit’, ‘diseas’, ‘sore’, ‘throat’, ‘fever’, ‘ach’, ‘runni’, ‘sick’, ‘ill’, ... 150 HPA Inferred Flu rate 100 England & Wales r = .968 50 0 180 200 220 240 260 280 300 320 340 Day Number (2009) (Lampos, Cristianini, 2010) 14 / 49 v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy 14/49
Recommend
More recommend