Introduction to data management with applications in hydrobiology david.kneis@tu-dresden.de TU Dresden, Institute of Hydrobiology
Outline Motivation Basics about tables Example data set How to arrange data properly Software options for data storage Working with data in base R Topics not covered
Outline Motivation Basics about tables Example data set How to arrange data properly Software options for data storage Working with data in base R Topics not covered
. Motivation 4 Typical sources of data ◮ Monitoring (e.g. water quality recorded over time) ◮ Snapshot sampling (e.g. abundance of river bed organisms) ◮ Experiments (e.g. response of system to treatment; with replication) ◮ Model outputs (e.g. scenario or sensitivity analysis)
. Motivation 5 Why care about data management? ◮ the key to efficient data analysis ◮ avoids inconsistency / loss of information ◮ ensures re-usability by others (and yourself at a later time) ◮ a must for serious research (traceability of results) ◮ enables efficient version control and archiving Investment in good data management always pays out.
. Motivation 6 What is data management about? 1. Arranging data in tables with proper layout 2. Selecting a software for data storage and manipulation 3. Understanding operations on tables ◮ merging, filters, aggregation 4. Knowing how to create inputs for specific analysis ◮ plotting, statistical tests
. Motivation 7 What is data management about? 1. Arranging data in tables with proper layout 2. Selecting a software for data storage and manipulation 3. Understanding operations on tables ◮ merging, filters, aggregation 4. Knowing how to create inputs for specific analysis ◮ plotting, statistical tests These will be the main subjects of this course.
Outline Motivation Basics about tables Example data set How to arrange data properly Software options for data storage Working with data in base R Topics not covered
. Basics about tables 9 Data types numeric Weights, dimensions, concentrations, ... integer Number of offspring, ordinal and nominal data (classes), ID character nominal data (classes), ID logical All kinds of dichotomous data special types dates and times, images, ...
. Basics about tables 10 Tables ◮ Most common and versatile data container. ◮ Columns are vectors of a particular data type. ◮ A table row is, in general, not a vectors but a list (because types differ).
. Basics about tables 11 Tables Representation of tables in data.frame Classic, commonly used, but ’ugly’ defaults will likely confuse beginners tibble Good alternative data.table Another alternative
. Basics about tables 12 Exercise: A simple data frame rm ( list = ls ()) options (stringsAsFactors=FALSE) x <- read . table (file="data / lakedepth.txt", sep="\t", header=TRUE) print ( typeof (x)) # type of object print (str(x)) # structure print ( lapply (x, typeof )) # type of columns print (head(x)) # top rows print (x $ maxDepth) # access a column print (x["maxDepth"]) # ... print (x[,"maxDepth"]) # ... print (x[1,]) # access a row
Outline Motivation Basics about tables Example data set How to arrange data properly Software options for data storage Working with data in base R Topics not covered
. Example data set 14 Screening a river for AMR genes
. Example data set 15 Screening a river for AMR genes mcr1 −Inf −2.5 −4.5 −2 −4 −1.5 25 24 −3.5 −1 ● ● −3 ● ● ● ● POS 10 14 20 26 HIR HER ● 11 ● 15 ● 21 ● 28 3 8 RHG 22 GOM ● ● ● ● ● ● ● ● ● ● ● 9 12 13 19 23 29 30 27 ● 7 ● 18 1 2 ● ● ● ● ● 6 4 16 5 ● ● 17 HAU
. Example data set 16 Summary We sampled ... ◮ water and bottom sediment ◮ at multiple locations ◮ repeatedly, in monthly intervals to analyze DNA extracts for ... ◮ the abundance of various antibiotic resistance genes ◮ the abundance of marker genes (e.g. 16S rRNA) and we took physical and technical replicates.
. Example data set 17 Why is this bad practice?
. Example data set 18 Why is this bad practice? ◮ Mixed information in column and even cells ◮ Multiple values per cell ◮ Many sub-tables on spreadsheet ◮ Missing headers ◮ No software can read this out of the box ◮ Data become useless soon (missing headers and meta data)
Outline Motivation Basics about tables Example data set How to arrange data properly Software options for data storage Working with data in base R Topics not covered
. How to arrange data properly 20 Objectives Understand ... ◮ the main structure of a data set. ◮ how to split the data over separate tables. ◮ how individual tables are linked to each other. ◮ basic rules to achieve data integrity.
. How to arrange data properly 21 Data dimensions Consider the example data set (page 16). What are the major dimensions of the data?
. How to arrange data properly 22 Data dimensions Consider the example data set (page 16). What are the major dimensions of the data? ◮ Compartment (water, sediment) ◮ Space (2-dimensional, sampling locations) ◮ Time ◮ Gene
. How to arrange data properly 23 Data dimensions Consider the example data set (page 16). What are the major dimensions of the data? ◮ Compartment (water, sediment) ◮ Space (2-dimensional, sampling locations) ◮ Time ◮ Gene Variable
. How to arrange data properly 24 Data dimensions Consider the example data set (page 16). What are the major dimensions of the data? ◮ Compartment (water, sediment) ◮ Space (2-dimensional, sampling locations) ◮ Time ◮ Variable A very common case in hydro-biological field research.
. How to arrange data properly 25 Data dimensions Consider the example data set (page 16). What are the major dimensions of the data? ◮ Compartment (water, sediment) ◮ Space (2-dimensional, sampling locations) ◮ Time ◮ Variable A very common case in hydro-biological field research. If you are not sure about dimensions, imagine some plots of the data. Which item(s) would appear on the x-axis or in the legend?
. How to arrange data properly 26 Entities Consider the example data set (page 16). What are the important entities?
. How to arrange data properly 27 Entities Consider the example data set (page 16). What are the important entities? ◮ Samples ◮ Locations ◮ Compartments ◮ Variables ◮ Values (measured numerical properties)
. How to arrange data properly 28 Entities Consider the example data set (page 16). What are the important entities? ◮ Samples ◮ Locations ◮ Compartments (Dropped for simplicity) ◮ Variables ◮ Values (measured numerical properties)
. How to arrange data properly 29 Entities Consider the example data set (page 16). What are the important entities? ◮ Samples ◮ Locations ◮ Compartments (Dropped for simplicity) ◮ Variables ◮ Values (measured numerical properties) This leads us to the entity-relationship model (ERM) https://en.wikipedia.org/wiki/Entity-relationship_model
. How to arrange data properly 30 Entities and relations
. How to arrange data properly 31 Entities and relations ◮ Multiple values, each measured on one particular sample ◮ Multiple samples, each taken at one particular location ◮ Each value relates to just one variable ◮ ...
. How to arrange data properly 32 Entities and relations ◮ Multiple values, each measured on one particular sample ◮ Multiple samples, each taken at one particular location ◮ Each value relates to just one variable ◮ ... Relations of type 1:1 and n:m also exist and those need to be resolved (not discussed here).
. How to arrange data properly 33 Attributes of entities
. How to arrange data properly 34 Attributes of entities → Attributes become table columns
. How to arrange data properly 35 Tables and relations
. How to arrange data properly 36 Tables and relations ◮ No orphaned records (e.g. only samples from known locations) ◮ No ambiguity (e.g. two samples cannot share the same ID)
. How to arrange data properly 37 Additional constraints
. How to arrange data properly 38 Additional constraints ◮ Each table needs a unique primary key (green color) ◮ Further columns may require uniqueness (blue color) ◮ Constraints can apply to a single column or to a set of columns
. How to arrange data properly 39 Summary of basic steps ◮ Identify entities, attributes, and relations ◮ Optimize tables following the rules of ’normalization’ ◮ Introduce single-table constraints (primary key, unique, non-emptiness) for data integrity ◮ Ensure integrity of table relations (foreign key constraints) → Look for courses and books on ’relational database design’
. How to arrange data properly 40 Indicators of proper design ◮ Tables are strictly rectangular (well defined number of rows and columns) ◮ Data is self-contained (all relevant meta data included) ◮ Tables and columns have intuitive names ◮ No redundancies (eliminates risk of inconsistency) ◮ Limited number of explicit missing values (saves memory)
. How to arrange data properly 41 Why is redundancy bad?
Recommend
More recommend