USE OF GEOSPATIAL AND WEB DATA FOR OECD STATISTICS CCSA S PECIAL SESSION ON SHOWCASING BIG DATA 1 O CTOBER 2015 Paul Schreyer Deputy-Director, Statistics Directorate, OECD
OECD APPROACH
• OECD : – Facilitator of discussion on new data sources for NSOs – OECD’s own use of new data sources • From Big Data to Sm art Data – Not every New data source is Big Not every Big data source is New
Business value analysis: why are we working on this? • More granularity or coverage of existing data (e.g. spatial disaggregation) • New output (e.g., measuring trust, inequalities) • Greater tim eliness – nowcasting • Increased im pact – analysis supporting OECD mission, possibility to link areas • Increased responsiveness – capacity to address new topics quickly, respond to what-if questions
Business process analysis: Necessary capabilities – Capacity to identify, evaluate and access new data sources – Command of methodology – Proven quality and metadata frameworks – Suitable IT infrastructures – Established legal and ethical frameworks – Skills and training capacity
4 types of new sources and examples of use cases Content Analysis Mobility studies Web crawling, web Sensor and geospatial data scraping * Online Real estate prices * African Economic * Air quality and land * Measure transport (OECD GOV) Outlook (AEO): Civil reliability from cover data (OECD tensions and political GOV) geolocalisation logs (ITF) * Measuring trade governance indicators * Enriching the restrictiveness by (OECD DEV) scraping and metropolitan database * Big Data Measures of analysing trade using geo-spatial data laws (OECD TAD) Human Well-Being – (OECD GOV) Evidence from US Google * PIAAC log file data Index (OECD STD) (OECD EDU)
EXAMPLE 1 ENVIRONMENTAL INDICATORS Using geospatial data (satellite data)
Average population exposure to air pollution (PM2.5) Key messages that the indicator should communicate – Where air pollution is above recommended levels – Where improvements in air quality have happened – Linking air pollution to health
Source: Raster (satellite observations) Satellite observations • Raster: van Donkelaar et al. (2014) • Resolution: ~10 km2 • Years: 1998-2012 Ground-based stations Satellite observations • • Advantages Direct measures Global coverage • • Offer regular levels of air pollution over Consistent method to compute air time pollution in cities, regions and • More pollutants are available countries • Consistent time-series data, spanning more than a decade • • Disadvantages Low coverage in developing countries Modelled data • • Uneven coverage within and across Satellite observations are less precise countries for bright surfaces (snow or desert) • • PM 2.5 concentration rarely monitored Current data are on a multi-year • Site selection, measurement average, evaluation of short-term techniques, and reporting methods events often unavailable differ across regions and countries 9
Basic methodology 1. The satellite-based values of air pollution are multiplied by the population living in the area (using a 1km2 resolution grid) 2. The exposure to air pollution in a region is given by the sum of the population weighted values of PM2.5 in the 1km2 grid cells falling within the boundaries of the region 3. Finally, dividing this aggregated value by the total population in the region, we obtain the average exposure to PM2.5 concentration in a region
• • Source: Brezzi and Sanchez-Serra (2014) Country (No. of cities) countries , the largest in Mexico, Italy, Japan and Korea OECD estimates show wide variation in PM 2.5 exposure levels across cities within pollution above the WHO’s recommended levels. 68% of the urban population in OECD countries (376 million people) are exposed to -10 0 10 20 30 40 Levels and trends in OECD cities Mérida Mexico (33) Cuernavaca Italy (11) Palermo Milan Japan (36) Naha Kumamoto Ulsan Cheongju Korea (10) France (15) Toulon Strasbourg United States (70) Portland Buffalo Metropolitan minimum Country average Gdańsk Poland (8) Kraków Spain (8) Las Palmas Zaragoza Germany (24) Bremen Essen Sweden (3) Stockholm Malmö United Kingdom (15) Glasgow Liverpool Czech Republic (3) Brno Ostrava Chile (3) Concepción Santiago Switzerland (3) Geneva Zurich Canada (9) Quebec Toronto Netherlands (5) Utrecht The Hague Portugal (2) Lisbon Porto Metropolitan maximum Greece (2) Athens Thessalonica Antwerp Belgium (4) Brussel Austria (3) Linz Vienna Hungary (1) Budapest Slovak Republic (1) Bratislava Slovenia (1) Ljubljana Denmark (1) Copenhaguen Finland (1) Helsinki Estonia (1) Tallinn Norway (1) Oslo Ireland (1) Dublin 11
Other example: raster sources used for land cover Europe USA Japan World Raster Corine land National land cover Japan National MODIS 500 Map of Global nam e cover dataset (NLCD) Land Service Urban Extent Information data Resolution 25 metres 30 metres 100 metres 500m Years 2000-06 2001-06 1997-2006 2008 Classif. of 4 4 land urban 21 land cover 11 land cover 17 land cover classes urban land classes classes classes Water
…feeds into the OECD Regional Well-Being Database Links: Regional Well-Being database Regional Well-Being web tool
EXAMPLE 2 TRADE POLICY ANALYSIS Using qualitative data from government websites
Basic idea Traditionally: • Policy questionnaires to countries • ‘Manual’ screening of government websites New: • Machine-based monitoring of government web sites • Automatic check for changes or addition of rules and regulations Test case: qualitative information for the OECD’s trade restrictiveness information and index
How? Text comparison - Initial discovery Run a text comparison between the original document and the new updated document Detect and flag specific paragraphs changed or updated inside long documents Text comparison - Advanced discovery. Changes in rules and regulations can also happen through new pages Use ‘big data’ techniques to compare in house structured information to the universe of laws and regulations in a given country. Work on text definitions similar to the original ones to help identifying potentially relevant documents.
IT Tools Web-crawling: scripts to systematically scan governmental websites where regulations can be found (federal, provincial, regional, etc.). Web-scraping: scripts to extract the relevant information in documents, possibly based on articles and paragraphs (text analysis). Document conversion: most laws and regulations are in pdf but possibly in other formats that would need to become text documents to run text analysis. Text comparison: tools and dictionaries to compare the text of updated documents with the original text, to calculate similarity coefficients with other documents, in a variety of languages with the option to also use proximity of similar words.
Web scraping / Text analysis Promising results on French legal texts (Legifrance)
Summary • Significant potential • Use cases and pilots provide really important reality checks • Smart data and multiple source, not necessarily big data • Initiatives have sprung in many parts of OECD • Need to be accompanied by overall strategy being developed at OECD
Thank you!
Recommend
More recommend