The Impact of the Data Revolution on Official Statistics: Opportunities, Challenges and Risks Prof. Rob Kitchin NIRSA, Maynooth University
Background • All-Island Research Observatory (AIRO; www.airo.ie) • Dublin Dashboard (www.dublindashboard.ie) • Digital Repository of Ireland (DRI; www.dri.ie) • The Programmable City
The Data Revolution book • A synoptic overview of big data, open data and data infrastructures • An introduction to thinking conceptually about data, data infrastructures, data analytics and data markets • A critical discussion of the technical issues and the social, political and ethical consequences of the data revolution • An analysis of the implications of the data revolution to academic, business and government practices
The data revolution • Data infrastructures • Open and linked data • Big data • Data analytics • Data markets • Conceptualisation of data • Disruptive innovations that offer opportunities, challenges and risks for government, business and academy
Data infrastructures • Actively planned, curated and managed • Enables storing, scaling, combining, sharing and consuming data across networked archives and repositories • Produces ‘data amplification’ • NSIs long and loosely operated as such (trusted) infrastructures, but now organising into more coordinated platforms with: • dedicated and integrated hardware and networked technologies; interoperable software and middleware services and tools; shared standards, protocols, metadata; shared services (relating to data management and processing), analysis tools & policies (concerning access, use, IPR, etc) • Such infrastructures are being federated into larger pan-national infrastructures (Eurostat, ESPON, UN, etc). • Many other institutions catching up
Open and linked data • Opening PSI (and other) data for re-use: driven by transparency, participation, collaboration, economic arguments • Linking data/metadata using non-propriety formats and URIs and RDF so that data can be referenced and conjoined • NSIs already very active in this space; other government data providers much further beyond • More to be done, especially retro opening and linking historical records; producing APIs; upgrading extent of openness (licensing re. re-use, reworking, redistribution, reselling); using non-proprietary formats; opening data about the organizations themselves
Big data Characteristic Small data Big data Volume Limited to large Very large Exhaustivity Samples Entire populations Resolution and Coarse & weak to tight Tight & strong indexicality & strong Relationality Weak to strong Strong Velocity Slow, freeze-framed Fast Variety Limited to wide Wide Flexible and scalable Low to middling High
Big data and official statistics (source ESSC 2014)
Data analytics • Challenge of making sense of big data is coping with its abundance and exhaustivity, timeliness and dynamism, messiness and uncertainty, semi- structured or unstructured nature • Solution has been machine learning made possible by advances in computation and computational techniques • Four broad classes of analytics: • data mining and pattern recognition • statistical analysis • prediction, simulation, and optimization • data visualization and visual analytics
Conceptualising data • Technically and methodologically: data generation, handling, processing, storing, analyzing, sharing, etc. • Philosophically: ontology, epistemology, ideology • what can we know about the world, how can we know it, what do should we do with such knowledge • Critical data studies • rather than understanding data as objective, neutral, pre-analytic & commonsensical, data are understood as being framed socially, political, ethically, philosophically in terms of their form, selection, analysis and deployment • data do not exist independently of the ideas, instruments, practices, contexts, knowledges and systems used to generate, process and analyze them • data express a normative notion about what should be measured, for what reasons, and what they should tell us; they have normative effects; they do not simply reflect the world but actively produce it • data are framed by and situated within data assemblages – NSI constitute such assemblages
Data assemblage Attributes Elements Modes of thinking, philosophies, theories, models, ideologies, rationalities, Systems of thought etc. Forms of Research texts, manuals, magazines, websites, experience, word of mouth, knowledge chat forums, etc. Business models, investment, venture capital, grants, philanthropy, profit, Finance etc. Policy, tax regimes, public and political opinion, ethical considerations, etc. Political economy Governmentalities / Data standards, file formats, system requirements, protocols, regulations, Legalities laws, licensing, intellectual property regimes, etc. Materialities & Paper/pens, computers, digital devices, sensors, scanners, databases, infrastructures networks, servers, etc. Techniques, ways of doing, learned behaviours, scientific conventions, etc. Practices Archives, corporations, consultants, manufacturers, retailers, government Organisations & agencies, universities, conferences, clubs and societies, committees and institutions boards, communities of practice, etc. Subjectivities & Of data producers, curators, managers, analysts, scientists, politicians, users, communities citizens, etc. Labs, offices, field sites, data centres, server farms, business parks, etc, and Places their agglomerations For data, its derivatives (e.g., text, tables, graphs, maps), analysts, analytic Marketplace software, interpretations, etc.
Implications and uses of data • Scaled, open, linked, big data and associated analytics produces knowledge that enhances governing of people, managing organisations, leveraging value and producing capital, creating better places, improving health and well-being, tackling social and ecological issues, fostering civic participation, etc. • They improve insight and wisdom, productivity, competitiveness, efficiency, effectiveness, utility, sustainability, safety & security, transparency ... • Challenge established epistemologies in the academy • “ Revolutions in science have often been preceded by revolutions in measurement ” Sinan Aral • new empiricism, data-driven science, computational social sciences, digital humanities • transforming how we frame, ask and answer questions
Opportunities for OS/NSIs • New sources of dynamic and linked data and more timely outputs • Complement/replace/improve/add to existing data/approaches • New forms of data analytics can provide greater insights from existing and new datasets • Optimize working practices, gain efficiencies, redeploy staff • Stronger links/partnerships with computational social science, data science (esp. viz), and data industries • Drive creation of data-driven institutions and evidence- informed governance • Greater visibility and use of products
Challenges for OS/NSIs • Sourcing data from third parties and associated partnering, legal and financial issues, including opening OSs derived from private data • Experimenting and trialing to determine: • suitability for official statistics, esp. when data being repurposed, is not representatively sampled, and is flexible thus potentially altering continuity, and has undefined data quality (re. veracity (accuracy, fidelity), uncertainty, error, bias, reliability, calibration) • technological feasibility re. transferring, storing, cleaning, checking, and linking big data • methodological feasibility re. augmenting/producing OSs.
Challenges for OS/NSIs • Building and maintaining new IT infrastructure, retro work on older data (opening, linking); ensuring security/data protection, deploying new data analytics • Sourcing additional resourcing (financial and staffing) for dealing with new data streams and opening/linking data • Developing new technical and methodological skills and sourcing/retaining trained/skilled staff • Establishing standards, standardization, interoperability across jurisdictions
Risks for OS/NSIs • Undermining of reputation and trust • quantity and utility of data opened (moving beyond low-hanging fruit) • quality of data (big data often messy & dirty) and losing control of generation/sampling/processing • established statistical products become undermined or discontinued before alternatives fully established/verified • partnering with third parties (tarnished by their reputation) • public perception and resistance to use of big data • Privacy and security • Access and continuity (will private sources of data be available over long term; will flexibility alter/break time-series); resistance from third parties to sharing data (gratis); • Fragmented landscape across jurisdictions • Pressure to reduce staff/budget rather than redeploy • Competition and privatisation (data brokers)
Recommend
More recommend