Data preparation & presentation Gary Collins EQUATOR Network, Centre for Statistics in Medicine NDORMS, University of Oxford EQUATOR Network – OUCAGS training course 24 October 2015
Data preparation Prior to ANY data analysis it is important to check your data – often more time is spent ‘cleaning’ and examining the data than the actual analysis – reliable results rely on accurate and reliable data • garbage in garbage out Data accuracy can be improved by prospectively thinking about the data before collecting it – produce a data collection sheet (also systematic reviews) – set up database – is possible, include basic checks in the database • often done for clinical trials 2
Prospective data collection 3
Data Definitions (CRASH dataset) Variable Label Comments Max Type Codes length Patient ID Six digit unique treatment 7 string identifier box-pack number Sex Gender of the 1 Number 0=Male patient 1=Female DOB Date of birth of DD/MM/YYYY 10 Date -7303=DOB not the patient known TRAND Time of HH:MM:SSS 8 Time randomisation GCS_EYE Glasgow Coma 1 Number 4=spontaneous Scale: Eye 3=to sound opening 2=to pain 1=none 4
Data preparation (after collection) For data types – discrete (e.g. number of children): check for non-integer values – continuous (e.g. height): check values line in plausible range – binary (e.g. male/female) – nominal (e.g. blood type) Check for unlikely values – ordinal (e.g. cancer stage) – dates: check for valid dates, date of event before date of birth – most statistical software doesn ’ t by default differentiate between upper and lower case (e.g. yes, Yes, yES, yeS, yES, Yes, YeS, YES) Missing data: consistency in how this is recorded – how much is missing 5
Van den Broeck et al, PLoS Med 2005 6
Implausible values Cancer Male Female Total Breast 0 45 45 Colorectal 23 17 40 Lung 8 7 15 Prostate 12 1 13 Total 43 70 113 7
Outliers • Outliers will seem incompatible with the rest of the data • May deviate from the main body of the data • Usually extreme values (high or low) • Can be genuine observations • Can have considerable influence on the results • Any suspicious values should be checked • The decision to include or exclude outliers should be made with caution • Provided all the measurements are valid, often useful to analyse with and without the outlier 8
Data presentation (exploration/presentation) Plot the data, plot the data, plot the data !!! – prior to any statistical analysis – look at the data! – get familiar with the data – what does it look like – can highlight/indicate any errors (spurious values) in the data part of data cleaning) – what is the distribution of the data (univariate)? • do they have an approximate Normal distribution? • if not, can may be transformed (e.g. log transformation)? Plot the data, plot the data, plot the data !!! 9
Unfortunately, not all data look like this 10
Positively skewed data Bland & Altman. BMJ 1996 11
CRASH-2 trial (distribution of baseline Glasgow Coma Score) 12
13
14
Results: Graphs Statisticians like to (ideally) see the raw data! Use graphs to describe results, for example – Dotplots (or boxplots for large data sets) • good for comparing groups – Scatter plots • good to accompany correlation analyses – Survival curves • time-to-event analyses – Forest plots • meta-analyses 15
What can you tell me about these data? For each dataset would you believe that the: Mean of x is 9 Variance of x is 11 Mean of y is 7.5 Variance of y is 4.122 Correlation between x and y is 0.816 Regression line: y = 3 + 0.5x Called ‘Anscombe’s Quartet’ Illustrates the importance of showing your data 16
Data presentation (exploration/presentation) Where possible, plot the raw values – there is NO reason to hide it – let the data tell the story Beware of ‘ chartjunk ’ (Edward Tufte) – anything that distracts the viewer (including you) from the information that graph is intended to present – let the data speak for themselves, don’t clutter the plot 17
Dynamite plots 18
Dynamite plots 19
Dynamite plots Mean=81.905 mm Mean=77.445 mm Schriger & Cooper. Ann Emerg Med 2001 20
The same data presented differently Schriger & Cooper. Ann Emerg Med 2001 21
22
Summary Think upfront about data collection – How are the data coded? – Be consistent throughout – Have data checks to flag implausible values Plot the data, plot the data, plot the data! – Get to know your data – Check for outliers Avoid ‘ chartjunk ’ – Maximise data:ink ratio 23
Recommend
More recommend