principles of data management for biologists
play

Principles of Data Management (for Biologists) Dr Joe Thorley - PowerPoint PPT Presentation

Principles of Data Management (for Biologists) Dr Joe Thorley R.P.Bio. Poisson Consulting August 14th, 2017 Introduction Biologists spends $1,000,000s of dollars collecting data with little regard for its management. Study Design Study


  1. Principles of Data Management (for Biologists) Dr Joe Thorley R.P.Bio. Poisson Consulting August 14th, 2017

  2. Introduction Biologists spends $1,000,000s of dollars collecting data with little regard for its management.

  3. Study Design Study design should preceed data management ◮ Identify question(s) ◮ what do we want to know and why? ◮ Assess existing data/understanding ◮ what do we already know? ◮ Develop field protocol ◮ how much will it cost? ◮ how useful is the answer likely to be?

  4. Data Management Once a study design has been developed data management begins. Data management cycles through the 10 stages of 1. data collection 2. data backup 3. data security 4. data digitization 5. data cleansing 6. data tidying 7. data documentation 8. data analysis 9. data reporting 10. data archiving

  5. Data Collection Field crews should be trained and informed and provided with standard protocols and data collection forms. Printed forms on waterproof paper provide a cheap robust solution.

  6. Data Backup Duplicate data as soon as possible. A smartphone camera is a simple way to duplicate data and sync to the cloud.

  7. Data Security Ensure the right people have access. Dropbox (https://www.dropbox.com) provides simple data security and sharing.

  8. Data Digitization Get the data into a useable electronic form. Excel is a useful data entry tool in the hands of a trained user.

  9. Data Cleansing Correct the inevitable errors. At best, errors add noise; at worse, they invalidate subsequent analyses!

  10. Data Tidying Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. Wickham 2014 SQLite (https://sqlite.org) is free, open-source, cross-platform, embedded database software.

  11. Relational Data From R For Data Science (http://r4ds.had.co.nz) available via CC BY-NC-ND 3.0 US.

  12. Data Documentation Data are just numbers and categories unless people know what they mean. A simple metadata table can provide a description and units for each variable Table Column Units Description Site Depth m The tidally corrected depth Visit Hour PST8PDT The hour of the visit

  13. Data Analysis Analytic code can be shared on GitHub (https://github.com).

  14. GitHub bcgov The province already has a GitHub account for sharing code.

  15. Data Reporting An answer only has value if decision-makers are aware of it. Zotero (https://www.zotero.org) is a free, easy-to-use tool to help you collect, organize, cite, and share your research sources. ResearchGate (https://www.researchgate.net) is a free way to share and discover research.

  16. Data Archiving Ensure others are able to use it in perpetuity. Zenodo (https://zenodo.org) is free, citeable, discoverable, long-term, with open, restricted and closed access options. Uses same cloud infrastructure as CERN’s own Large Hadron Collider (LHC) research data.

  17. Summary Data management requires trained personnel with an understanding of the principles but does not have to be expensive and pays for itself many times over.

  18. DFO

  19. Parks

  20. DataBC The provincial government has DataBC.

  21. CKAN CKAN (https://ckan.org) is the world’s leading Open Source data portal platform. It is free and open source with teams and private data. A key feature is an API (application program interface) that allows code to interact with the repository.

Recommend


More recommend