3
play

3 Piet Daas and Mark van der Loo* Statistics Netherlands * With - PDF document

Big Data (and Big Data (and official statistics) 3 Piet Daas and Mark van der Loo* Statistics Netherlands * With contributions of: Edwin de Jonge and Paul van den Hurk MSIS 2013, April 25, Paris Overview Whats Big Data? g


  1. Big Data (and Big Data (and official statistics) 3 Piet Daas and Mark van der Loo* Statistics Netherlands * With contributions of: Edwin de Jonge and Paul van den Hurk MSIS 2013, April 25, Paris Overview • What’s Big Data? g • Definition and the 3 V’s • Can Big Data be used for official statistics? • Examples from Statistics Netherlands • Future challenges • What has to change? Wh t h t h ? MSIS 2013, April 25, Paris 1 1

  2. • Data, data everywhere! Data, data everywhere! X MSIS 2013, April 25, Paris 2 What is Big Data? • According to a group of experts Bi Big data are data sources that can be – d t d t th t b generally– described as: “high volume, velocity and variety of data that demand cost-effective, innovative forms of processing for enhanced insight and decision making.” • According to a user “ Data so big that it becomes awkward to work with” MSIS 2013, April 25, Paris 3 2

  3. The most 3 important characteristics of Big Data Amount Complexity Unstructured data Rapid availability Text MSIS 2013, April 25, Paris 4 3 Big Data case studies Can Big Data be used for official statistics? Ca g a a be used o o c a s a s cs Examples from Statistics Netherlands 1. Traffic loop detection data (100 million records/day) • Traffic & transport statistics 2. Mobile phone data (35 million records/day) • Day time population, tourism y p p 3. Dutch social media messages (1~2 million messages/day) • Topics and sentiment MSIS 2013, April 25, Paris 5 3

  4. 1. Traffic loop detection data • Traffic ‘loops’ • Every minute (24/7) the number of passing E i t (24/7) th b f i vehicles is counted by >10,000 road sensors & camera’s in the Netherlands • Total vehicles and in different length classes • Interesting source to produce traffic and g p transport statistics (and more) • Huge amounts of data, about 100 million records a day Locations MSIS 2013, April 25, Paris 6 Number of detected vehicles on a single day By all loops Total = ~ 295 million MSIS 2013, April 25, Paris 7 4

  5. Traffic loop detection activity (only first 10 min.) MSIS 2013, April 25, Paris 8 Correct for missing data • ‘Corrected’ data (for blocks of 5 min) Before After Total = ~ 295 million Total = ~ 330 million (+ 12%) MSIS 2013, April 25, Paris 9 5

  6. For different vehicle lengths 1 categorie 3 categoriën 5 categoriën X X Totaal Totaal Totaal <= 5.6m > 1.85 & <= 2.4m > 5.6 & <= 12.2m > 2.4 & <= 5.6m > 12.2m > 5.6 & <= 11.5m > 11.5 & <= 12.2m > 12.2m Small vehicles <= 5.6 m Medium sized vehicles > 5.6 m & <= 12.2 m Large vehicles > 12.2 m MSIS 2013, April 25, Paris 10 Small vehicles ~75% of total MSIS 2013, April 25, Paris 11 6

  7. Small & medium vehicles MSIS 2013, April 25, Paris 12 Small, medium & large vehicles MSIS 2013, April 25, Paris 13 7

  8. 2. Mobile phone data • Nearly every person in the Netherlands has a mobile phone bil h • On them and almost always switched on! • An increasing number of people has a smart phone • Ideal source of information to: • Use mobile phone data of mobile phone companies: • Travel behaviour (‘Day time’-population) • Tourism (new phones that register to network) • Crowd info (for example during events) MSIS 2013, April 25, Paris 14 Travel behaviour of mobile phones Mobility of very active active mobile phone users - during a 14-day period g y p - data of a single mob. company Based on: - Call- and text- activity multiples times a day - Location based on phone masts Clearly selective: Clearly selective: - Includes major cities - But the North and South-east of the country much less MSIS 2013, April 25, Paris 15 8

  9. 3. Social media messages • Dutch are very active on social media platforms • Bijna altijd bij zich en staat vrijwel altijd aan • Steeds meer mensen hebben een smartphone! • Mogelijke informatiebron voor: • Welke onderwerpen zijn actueel: • Aantal berichten en sentiment hierover • Als meetinstrument te gebruiken voor: • . Map by Eric Fischer (via Fast Company) MSIS 2013, April 25, Paris 16 3. Social media messages • Dutch are very active on social media platforms • Potential information source for: • Topics discussed and sentiment over these topics (quickly available!) and probably more? • Investigate it to obtain an answer on its potential use 3a. Content: - Collected Dutch Twitter messages for study: ‘selection’ of 12 million 3b. Sentiment - Sentiment in Dutch social media messages: ‘all’ ~2 billion MSIS 2013, April 25, Paris 17 9

  10. Social media: Dutch Twitter topics (3%) (7%) (3%) (10%) (7%) (3%) (3%) (5%) (46%) 12 million messages MSIS 2013, April 25, Paris 18 Sentiment in Social media • Access to Coosto database • ~ 2 billion publicly available messages 2 billi bli l il bl • Twitter, Facebook, Hyves, Webfora, Blogs etc. • Sentiment of each message • Positive, negative or neutral • Interesting finding • Looked at so-called ‘Mood of the nation’ compared to Consumer confidence of Statistics Netherlands MSIS 2013, April 25, Paris 19 10

  11. Consumer confidence, survey data Sentiment towards the economic climate (pos – neg) as % of total ( ~1000 respondents/month MSIS 2013, April 25, Paris 20 Sentiment in social media messages Sentiment towards the economic climate & Social media message sentiment (pos – neg) as % of total ( Corr: 0.88 ~25 million messages/month MSIS 2013, April 25, Paris 21 11

  12. Challenges: Big Data and statistics Legal • • Is access routinely allowed (not only for research)? y ( y ) Privacy • • With more and more data, privacy demands increase • We have to be careful here! Costs • • In the Netherlands we don’t pay for admin data. • Should we pay for Big Data? Manage • • Who owns the data? Stability of delivery/source • Because of its volume, run queries in database of data source holder MSIS 2013, April 25, Paris 22 Challenges: Big Data and statistics (2) • Methodological • Big data sources register events, not units, and they are selective! • Methods & models specific for large dataset (fast and ‘robust’) • Methods & models specific for large dataset (fast and robust ) • Try to ‘make big data small’ ASAP (noise reduction) • Technological • Learn from ‘computational statistical’ research areas • High Performance Computing needs, parallel processing • People p • Need ‘data scientists’ (statistical minded people with programming skills that are curious) • That are able to think outside the traditional sample survey based paradigm! MSIS 2013, April 25, Paris 23 12

  13. The future of Stat Neth? MSIS 2013, April 25, Paris 13

Recommend


More recommend