i t s not the documents it s the data
play

I ts not the documents; its the DATA! Tom Johnson Managing - PowerPoint PPT Presentation

I ts not the documents; its the DATA! Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico USA t o m @ j t j o h n s o n . c o m 1 I ts not the documents, its the DATA! Presentation at 2011 Open


  1. I t’s not the documents; it’s the DATA! Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico USA t o m @ j t j o h n s o n . c o m 1

  2. I t’s not the documents, it’s the DATA! Presentation at “ 2011 Open Government Academy” March 26, 2011 Presented by the New Mexico Foundation for Open Government, New Mexico Press Association and New Mexico Broadcasters Association This PowerPoint deck and Tipsheet posted at: http:// j o h n s o n – f o g . n o t l o n g . c o m Licensed under a Creative Commons Attribution ‐ NonCommercial ‐ NoDerivs 3.0 Unported License. 2

  3. I mportant point Nothing is as important – and valuable – as a good theory! 3

  4. Theory of Journalistic Process Data In � Analysis � Info Out • Data = that which, upon Analysis, yields Information. “Data” has many forms. • Analysis = Examination of data and facts to uncover and understand cause ‐ effect and contextual relationships and patterns, thus providing basis for problem solving and decision making. • I nformation = that which aids in making decisions 4

  5. 5 I mportant point document the data. is not The

  6. Bertillon system: Public Records DB Early public records • Intricate data collection • Potential for error in data entry • Potential for error in filing • No machine retrieval or analysis • Even today, OCR would be impossible

  7. Bertillon system: Public Records DB By 1910… • Indexing system has improved • Typewriters instead of pen • Better haircuts But still … • Null fields • Subject to data entry errors; lost or misfiled cards/data • Limited large ‐ scale analysis resources

  8. Bertillon system: Public Records DB • Early public records • Intricate data collection By 1910… • Data entry potential • Indexing system has improved • Typewriters instead of pen for error • Better haircuts • Filing potential for But still … error • Null fields • No machine retrieval • Subject to data entry errors; lost or misfiled cards/data or analysis • Limited large ‐ scale analysis • Even today, no OCR resources Early “hard drives,” data retrieval and data analysis of public records

  9. Bertillon system: Public Records DB • A public record, but one of limited usage • Early public records • A DOCUMENT , but no • Intricate data efficient, productive, collection insightful way to FIND By 1910… • Data entry potential • Indexing system has improved the data • Typewriters instead of pen for error • A DOCUMENT , but no • Better haircuts • Filing potential for efficient, productive, But still … error • Null fields insightful way to • No machine retrieval • Subject to data entry errors; EXTRACT the data lost or misfiled cards/data or analysis • Limited large ‐ scale analysis • Even today, no OCR resources • Sorta like a PDF Early “hard drives,” data retrieval and data analysis of public records

  10. Traditional Data I n � Analysis � Info Out Data I n � Analysis � Info Out • Notes • Text • Numeric • Images • Maps • How? Who? 10

  11. Digital Age Data I n � Analysis � Info Out • Notes • New data is • Text ubiquitous, • Numeric shareable, scaleable. • I mages • Retrieval, copying • Charts/ Graphs and storage costs • Maps trivial • Audio • Can be validated and • Video explored by • Atoms � Bits individuals and • How? Who? applications 11

  12. Digital Age Data I n � Analysis � Info Out • All data today requires • Notes NEW tools for • Text ANALYSIS and STORY ‐ • Numeric TELLING • Images • Charts/Graphs • Statutes are usually • Maps adequate; the • Audio • Video CULTURES are the • Atoms � Bits challenge. • How? Who? 12

  13. I mportant point The document is not the data. Without analysis, the data are not the story. 13

  14. Four stories • Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail • Craig Harris: “Arizona pension systems a soaring burden” • Waite: water, developers, land use = disappearing wet lands • UK: Investigate Your MPs Expenses “We have 458,832 pages of documents. 27,731 of you have reviewed 223,475 of them. Only 235,357 to go” MP’s expense claims on Google spreadsheet 14

  15. Journalism and GI S • Steve Doig [Miami Herald] 1992 Hurricane Andrew + damage reports + building inspection = jail terms 15

  16. 16 Doig: Hurricane Andrew

  17. Four stories • Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail • Craig Harris: “Arizona pension systems a soaring burden” 17

  18. 18 Search DB info Analysis with real data Sort

  19. Four stories • Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail • Craig Harris: “Arizona pension systems a soaring burden” • Waite: water, developers, land use = “Vanishing Wetlands” 19

  20. 20 Vanishing Wetlands

  21. Four stories • Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail • Craig Harris: “Arizona pension systems a soaring burden” • Waite: water, developers, land use = disappearing wet lands • UK: Investigate Your MPs Expenses “We have 458,832 pages of documents. 27,731 of you have reviewed 223,475 of them. Only 235,357 to go” MP’s expense claims on Google spreadsheet • EFF Seeks Cooperating FOIA Reviewers 21

  22. UK MP’s expenses Solid search tools These are PDFs, POST ‐ search 22

  23. Major questions? As participants in a liberal democracy… • How do we get the necessary data? • And from where? • And in appropriate forms? 23

  24. Files, Transparency, Ease of Analysis Easier Challenging 24

  25. 25 Files, Transparency, Ease of Analysis

  26. Data I n: Objectives/ Requirements • Move data from “out there” to analytic site/tools • Looking for connections; patterns 26

  27. Data I n: Objectives/ Requirements • Seeking fine-grained data, NOT aggregations • Seek data in original form (i.e. NO PDFs) • Get data in lowest common denominator format: - Comma-delimited files in ASCII or Text • Who collected the data? Why? How? • Who proofed/edited the data? Why? How? • If from data base, first ask for “record layout” or “code sheet” or “schema” • Definitions of variables or fields. Constant or ??? 27

  28. Data I n: “Typical” problems with gov sites Barriers data = barriers to analysis • NO site search capability; no site map • Failure to use open-standard HTML; using closed- standard Adobe Flash/Shockwave environment. • Page formats/layouts not consistent; too many drill-downs instead of search-driven generators • Jiggly roll-overs; too much effort spent on bling • Impossible to download or scrape data for analysis • Information available only in Adobe PDF files; notoriously unfriendly to data analysis. 28

  29. 29 Feedback! Español Search! Good NM sites

  30. NM Legis. Bill Finder Download bill in TWO formats Could be better: no way to find what bills were introduced by X legislator 30

  31. Data I n: Challenges • New site in New Mexico: www.sunshineportalnm.com • “ Beta ,” but facade for taxpayers; a secondary tax bcs of minimal utility; torture for journos 31

  32. Data I n: Challenges in SunshinePort • Comprehensive Annual Financial Reports • Possible to machine download, but laborious to format for analysis • Investment Holdings reports are far worse • They are poor-quality static image files, not machine- readable. • Tabular data roughly formatted; makes conversion for analysis an arduous, if not impossible task. 32

  33. Bottom line on SunshinePortalNM.com “This is not even a web page, it’s a Flash application, so there’s not going to be much sunlight escaping from this portal. “ “If the State of New Mexico takes the position that through this site it is discharging all of its disclosure obligations with respect to these particular records, open government is in trouble there.” 33

  34. Bottom line on SunshinePortalNM.com “This is not even a web page, it’s a Flash application, so there’s not going to be “A perfect example of creating the much sunlight escaping from this portal. “ appearance of transparency without “If the State of New Mexico takes the actually being transparent.” position that through this site it is discharging all of its disclosure obligations with respect to these particular records, open government is in trouble there.” 34

  35. Good data sites – Gov and NGO • Data.gov [A beta site] www.data.gov/ • Metrics www.data.gov/metric • DataSF - http://datasf.org/ a clearinghouse of datasets available from the City & County of San Francisco • San Francisco Enterprise GI S Program - http://gispub02.sfgov.org/data.asp • Maplight.com – an example of how citizens can use data Nonprofit, nonpartisan research organization, provides citizens and journalists the transparency tools to shine a light on the influence of money on politics. • Prize-winning gov’t agency web sites: http://www.centerdigitalgov.com/survey/88/2010 35

  36. Common aspects? • All have up-front search capabilities • All are written in “data-accessible” code • All data can be downloaded with “relative” ease • Some have various languages available • ALL are run by GOVERNMENT; no commercial sites 36

Recommend


More recommend