United Nations Economic Commission for Europe Statistical Division NTTS 2015 March 10, 2015 A Shared Computation Environment for International Cooperation on Big Data Matjaz Jug Carlo Vaccari Antonino Virgillito Project Consultant, UNECE Project Consultant, UNECE Project Consultant, UNECE Statistics Netherlands Istat Istat
BACKGROUND BACKGROUND EXPERIMENTS FINDINGS QUESTIONS
The Role of Big Data in the Modernisation of Statistical Production and Services Introduction • The High-Level Group for the Modernisation of Statistical Production and Services (HLG) promotes activities for the modernisation of statistical production and services – Reports directly to the Conference of European Statisticians • Collaboration projects – 2013: Generic Statistical Information Model – 2013: Common Statistical Production Architecture – 2014: Big Data 3 NTTS 2015 March 10, 2015
The Role of Big Data in the Modernisation of Statistical Production and Services The HLG Big Data Project • Objectives – to identify the main possibilities offered by Big Data to statistical organizations – to demonstrate the feasibility of efficient production of both novel products and 'mainstream' official statistics using Big Data sources • 75 participants from 20 Organizations – National Statistical Offices and International Organizations • Ran from January to December 2014 • 4 task teams – Quality – Partnership – Privacy – Technology: hands-on work on Big Data tools and dataset on common, shared computation environment - The Sandbox 4 NTTS 2015 March 10, 2015
The Role of Big Data in the Modernisation of Statistical Production and Services The Sandbox Shared computation environment Hortonworks for the storage and the analysis of Data Pentaho RHadoop large-scale datasets Platform Used as a platform for collaboration across participating institutions Created with support from: - CSO Central Statistics Office of Ireland Cluster of 28 machines Accessible through web and SSH - ICHEC Irish Centre for High-End Software: full Hadoop stack, visual analytics, Computing R, RDBMS, NoSQL DB Objectives • Explore tools and methods • Test feasibility of producing Big Data-derived statistics • Replicate outputs across countries 5 NTTS 2015 March 10, 2015
The Role of Big Data in the Modernisation of Statistical Production and Services The Sandbox Web Interface 6 NTTS 2015 March 10, 2015
BACKGROUND EXPERIMENTS EXPERIMENTS FINDINGS QUESTIONS
The Role of Big Data in the Modernisation of Statistical Production and Services Social Media Job Vacancies Ads Mobile Phones Web Scraping Prices Traffic Loops Smart Meters Positive indication Each experiment team produced a A summary of the results “Mixed” indication detailed report on its activity, available is presented in the Negative indication on the UNECE wiki appendix More work needed / ongoing 8 NTTS 2015 March 10, 2015
BACKGROUND EXPERIMENTS FINDINGS FINDINGS QUESTIONS
The Role of Big Data in the Modernisation of Statistical Production and Services Statistics We showed some of the possible improvements that can be obtained using Big Data sources Cheaper More timely Novel 10 NTTS 2015 March 10, 2015
The Role of Big Data in the Modernisation of Statistical Production and Services Skills All available tools were used in the experiments by both researchers and techicians with no previous experience The Sandbox can represent a capacity building platform for participating institutions At present there is insufficient training in the Projects in planning were less likely to use tools generally skills that were identified as most important associated with “Big Data” . Often this decision was made due to a lack of familiarity with new tools or a deficit of for people working with Big Data secure “Big Data” infrastructure (e.g. parallel processing no - Skills on Hadoop/NoSQL DBs indicated as SQL data stores such as Hadoop). “planned in the near future” by majority of UNSD Big Data Questionnaire organizations UNECE Big Data Questionnaire Crucial for building “data scientist” skills 11 NTTS 2015 March 10, 2015
The Role of Big Data in the Modernisation of Statistical Production and Services Technology • Big Data tools are necessary when dealing with data ranging from hundreds of Gb on – Effective starting from tenths of Gb – “Traditional” tools perform better with smaller datasets • Researchers/technicians should be able to master different tools and be ready to deal with immature software – Highly dynamic situation with frequent updates and new tools spawning frequently • Need strong IT skills for managing the tools – Support from software companies might be required in early phases 12 NTTS 2015 March 10, 2015
The Role of Big Data in the Modernisation of Statistical Production and Services Acquisition • 7 datasets were loaded – Initial project proposal required “one or more” • Difficult to retrieve “interesting” (i.e., meaningful, disaggregated…) datasets – Privacy and size issues • This also applies to web sources that are only apparently easy to retrieve – Issues with quality, in terms of coverage and representativeness 13 NTTS 2015 March 10, 2015
The Role of Big Data in the Modernisation of Statistical Production and Services Sharing • Naturally achieved sharing of methods and datasets • Many data sets have the same form in all countries – Methods can be developed and tested in the shared environment and then applied to real counterparts within each NSI • Privacy constraints on datasets limit the possibility of sharing – Can be partly bypassed through the use of synthetic data sets 14 NTTS 2015 March 10, 2015
The Role of Big Data in the Modernisation of Statistical Production and Services Extension of the Project in 2015 • Production of Multi-national statistics only basing on Big Data sources – Objective: present results in a press conference in November 2015 • Continuation of experiments – Consolidated technical skills that now can be used more effectively in experiments • Possibility of testing new models of partnership – Moving data is too difficult. Why not trying to involve partners in running our programs on their data in their data centers? 15 NTTS 2015 March 10, 2015
The Role of Big Data in the Modernisation of Statistical Production and Services Project output available on UNECE Wiki http://www1.unece.org/stat/platform/display/bigdata/2014+Project 16 NTTS 2015 March 10, 2015
The Role of Big Data in the Modernisation of Statistical Production and Services BACKGROUND EXPERIMENTS FINDINGS QUESTIONS QUESTIONS
Appendix BACKGROUND EXPERIMENTS EXPERIMENTS FINDINGS QUESTIONS
Countries Dataset Social Media: records Tweets generated 42M in Mexico Mobility Studies Jan14/Jul14 size 9.2Gb Analysis of mobility starting from georeference data of single tweets Patterns of mobility to touristic cities Trans-border mobility Mobility statistics computed at detailed territorial level 19 UNFPA Big Data Bootcamp February 3, 2015
Countries Dataset Social Media: records Tweets generated 42M in Mexico Sentiment Analysis Jan14/Jul14 size 9.2Gb Derived sentiment indicator from analysis of Emoticons and media acronyms Mexican tweets Statistic Nederlands applied its methodology Cross-country sharing of method to relate sentiment to consumer confidence - Only emoticons were considered Correlation is not as good as in previous - Dutch study also used Facebook study based on Dutch data as a source More accurate, language-based computation of sentiment currently carried out in Mexico, based on partnership with university 20 UNFPA Big Data Bootcamp February 3, 2015
Countries Dataset Mobile Phones records Four datasets 865M from Orange. Call data from Ivory size Coast. 31.4Gb Analysis of mobility from aggregate phone data Visual analysis of call location data User categories from call intensity patterns 21 UNFPA Big Data Bootcamp February 3, 2015
Countries Dataset Consumer Price records Synthetic 11G scanner data Index size 260Gb Test performance of big data technologies on Comparison between “traditional” big data sets through the computation of a and Big Data technologies simplified consumer price index on synthetic price data Could write index computation script Big Data tools are necessary and with one of the high-level languages achieve good scalability when data part of Hadoop environment grow over tenth of Gb Future work on methodology Work on scanner data is active in several NSIs. Data has same structure and methods can be shared. Novel statistics can be computed working on large scale data (no sampling) 22 UNFPA Big Data Bootcamp February 3, 2015
Countries Dataset Smart Meters records - Real data from 160M Ireland - Synthetic data size from Canada 2.5Gb Test of aggregation using Big Data tools Future work on sharing methods through the use of synthetic data sets Weekly consumption per hour of day over a year (IE) Hourly consumption per day (CAN) winter mid-seasons summer Quickly wrote aggregation scripts that could be used on both datasets 23 UNFPA Big Data Bootcamp February 3, 2015
Recommend
More recommend