I t’s not the documents; it’s the DATA! Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico USA t o m @ j t j o h n s o n . c o m 1
I t’s not the documents, it’s the DATA! Presentation at “ 2011 Open Government Academy” March 26, 2011 Presented by the New Mexico Foundation for Open Government, New Mexico Press Association and New Mexico Broadcasters Association This PowerPoint deck and Tipsheet posted at: http:// j o h n s o n – f o g . n o t l o n g . c o m Licensed under a Creative Commons Attribution ‐ NonCommercial ‐ NoDerivs 3.0 Unported License. 2
I mportant point Nothing is as important – and valuable – as a good theory! 3
Theory of Journalistic Process Data In � Analysis � Info Out • Data = that which, upon Analysis, yields Information. “Data” has many forms. • Analysis = Examination of data and facts to uncover and understand cause ‐ effect and contextual relationships and patterns, thus providing basis for problem solving and decision making. • I nformation = that which aids in making decisions 4
5 I mportant point document the data. is not The
Bertillon system: Public Records DB Early public records • Intricate data collection • Potential for error in data entry • Potential for error in filing • No machine retrieval or analysis • Even today, OCR would be impossible
Bertillon system: Public Records DB By 1910… • Indexing system has improved • Typewriters instead of pen • Better haircuts But still … • Null fields • Subject to data entry errors; lost or misfiled cards/data • Limited large ‐ scale analysis resources
Bertillon system: Public Records DB • Early public records • Intricate data collection By 1910… • Data entry potential • Indexing system has improved • Typewriters instead of pen for error • Better haircuts • Filing potential for But still … error • Null fields • No machine retrieval • Subject to data entry errors; lost or misfiled cards/data or analysis • Limited large ‐ scale analysis • Even today, no OCR resources Early “hard drives,” data retrieval and data analysis of public records
Bertillon system: Public Records DB • A public record, but one of limited usage • Early public records • A DOCUMENT , but no • Intricate data efficient, productive, collection insightful way to FIND By 1910… • Data entry potential • Indexing system has improved the data • Typewriters instead of pen for error • A DOCUMENT , but no • Better haircuts • Filing potential for efficient, productive, But still … error • Null fields insightful way to • No machine retrieval • Subject to data entry errors; EXTRACT the data lost or misfiled cards/data or analysis • Limited large ‐ scale analysis • Even today, no OCR resources • Sorta like a PDF Early “hard drives,” data retrieval and data analysis of public records
Traditional Data I n � Analysis � Info Out Data I n � Analysis � Info Out • Notes • Text • Numeric • Images • Maps • How? Who? 10
Digital Age Data I n � Analysis � Info Out • Notes • New data is • Text ubiquitous, • Numeric shareable, scaleable. • I mages • Retrieval, copying • Charts/ Graphs and storage costs • Maps trivial • Audio • Can be validated and • Video explored by • Atoms � Bits individuals and • How? Who? applications 11
Digital Age Data I n � Analysis � Info Out • All data today requires • Notes NEW tools for • Text ANALYSIS and STORY ‐ • Numeric TELLING • Images • Charts/Graphs • Statutes are usually • Maps adequate; the • Audio • Video CULTURES are the • Atoms � Bits challenge. • How? Who? 12
I mportant point The document is not the data. Without analysis, the data are not the story. 13
Four stories • Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail • Craig Harris: “Arizona pension systems a soaring burden” • Waite: water, developers, land use = disappearing wet lands • UK: Investigate Your MPs Expenses “We have 458,832 pages of documents. 27,731 of you have reviewed 223,475 of them. Only 235,357 to go” MP’s expense claims on Google spreadsheet 14
Journalism and GI S • Steve Doig [Miami Herald] 1992 Hurricane Andrew + damage reports + building inspection = jail terms 15
16 Doig: Hurricane Andrew
Four stories • Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail • Craig Harris: “Arizona pension systems a soaring burden” 17
18 Search DB info Analysis with real data Sort
Four stories • Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail • Craig Harris: “Arizona pension systems a soaring burden” • Waite: water, developers, land use = “Vanishing Wetlands” 19
20 Vanishing Wetlands
Four stories • Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail • Craig Harris: “Arizona pension systems a soaring burden” • Waite: water, developers, land use = disappearing wet lands • UK: Investigate Your MPs Expenses “We have 458,832 pages of documents. 27,731 of you have reviewed 223,475 of them. Only 235,357 to go” MP’s expense claims on Google spreadsheet • EFF Seeks Cooperating FOIA Reviewers 21
UK MP’s expenses Solid search tools These are PDFs, POST ‐ search 22
Major questions? As participants in a liberal democracy… • How do we get the necessary data? • And from where? • And in appropriate forms? 23
Files, Transparency, Ease of Analysis Easier Challenging 24
25 Files, Transparency, Ease of Analysis
Data I n: Objectives/ Requirements • Move data from “out there” to analytic site/tools • Looking for connections; patterns 26
Data I n: Objectives/ Requirements • Seeking fine-grained data, NOT aggregations • Seek data in original form (i.e. NO PDFs) • Get data in lowest common denominator format: - Comma-delimited files in ASCII or Text • Who collected the data? Why? How? • Who proofed/edited the data? Why? How? • If from data base, first ask for “record layout” or “code sheet” or “schema” • Definitions of variables or fields. Constant or ??? 27
Data I n: “Typical” problems with gov sites Barriers data = barriers to analysis • NO site search capability; no site map • Failure to use open-standard HTML; using closed- standard Adobe Flash/Shockwave environment. • Page formats/layouts not consistent; too many drill-downs instead of search-driven generators • Jiggly roll-overs; too much effort spent on bling • Impossible to download or scrape data for analysis • Information available only in Adobe PDF files; notoriously unfriendly to data analysis. 28
29 Feedback! Español Search! Good NM sites
NM Legis. Bill Finder Download bill in TWO formats Could be better: no way to find what bills were introduced by X legislator 30
Data I n: Challenges • New site in New Mexico: www.sunshineportalnm.com • “ Beta ,” but facade for taxpayers; a secondary tax bcs of minimal utility; torture for journos 31
Data I n: Challenges in SunshinePort • Comprehensive Annual Financial Reports • Possible to machine download, but laborious to format for analysis • Investment Holdings reports are far worse • They are poor-quality static image files, not machine- readable. • Tabular data roughly formatted; makes conversion for analysis an arduous, if not impossible task. 32
Bottom line on SunshinePortalNM.com “This is not even a web page, it’s a Flash application, so there’s not going to be much sunlight escaping from this portal. “ “If the State of New Mexico takes the position that through this site it is discharging all of its disclosure obligations with respect to these particular records, open government is in trouble there.” 33
Bottom line on SunshinePortalNM.com “This is not even a web page, it’s a Flash application, so there’s not going to be “A perfect example of creating the much sunlight escaping from this portal. “ appearance of transparency without “If the State of New Mexico takes the actually being transparent.” position that through this site it is discharging all of its disclosure obligations with respect to these particular records, open government is in trouble there.” 34
Good data sites – Gov and NGO • Data.gov [A beta site] www.data.gov/ • Metrics www.data.gov/metric • DataSF - http://datasf.org/ a clearinghouse of datasets available from the City & County of San Francisco • San Francisco Enterprise GI S Program - http://gispub02.sfgov.org/data.asp • Maplight.com – an example of how citizens can use data Nonprofit, nonpartisan research organization, provides citizens and journalists the transparency tools to shine a light on the influence of money on politics. • Prize-winning gov’t agency web sites: http://www.centerdigitalgov.com/survey/88/2010 35
Common aspects? • All have up-front search capabilities • All are written in “data-accessible” code • All data can be downloaded with “relative” ease • Some have various languages available • ALL are run by GOVERNMENT; no commercial sites 36
Recommend
More recommend