data preparation amp presentation
play

Data preparation & presentation Gary Collins EQUATOR Network, - PowerPoint PPT Presentation

Data preparation & presentation Gary Collins EQUATOR Network, Centre for Statistics in Medicine NDORMS, University of Oxford EQUATOR Network OUCAGS training course 24 October 2015 Data preparation Prior to ANY data analysis it is


  1. Data preparation & presentation Gary Collins EQUATOR Network, Centre for Statistics in Medicine NDORMS, University of Oxford EQUATOR Network – OUCAGS training course 24 October 2015

  2. Data preparation  Prior to ANY data analysis it is important to check your data – often more time is spent ‘cleaning’ and examining the data than the actual analysis – reliable results rely on accurate and reliable data • garbage in  garbage out  Data accuracy can be improved by prospectively thinking about the data before collecting it – produce a data collection sheet (also systematic reviews) – set up database – is possible, include basic checks in the database • often done for clinical trials 2

  3. Prospective data collection 3

  4. Data Definitions (CRASH dataset) Variable Label Comments Max Type Codes length Patient ID Six digit unique treatment 7 string identifier box-pack number Sex Gender of the 1 Number 0=Male patient 1=Female DOB Date of birth of DD/MM/YYYY 10 Date -7303=DOB not the patient known TRAND Time of HH:MM:SSS 8 Time randomisation GCS_EYE Glasgow Coma 1 Number 4=spontaneous Scale: Eye 3=to sound opening 2=to pain 1=none 4

  5. Data preparation (after collection)  For data types – discrete (e.g. number of children): check for non-integer values – continuous (e.g. height): check values line in plausible range – binary (e.g. male/female) – nominal (e.g. blood type) Check for unlikely values – ordinal (e.g. cancer stage) – dates: check for valid dates, date of event before date of birth – most statistical software doesn ’ t by default differentiate between upper and lower case (e.g. yes, Yes, yES, yeS, yES, Yes, YeS, YES)  Missing data: consistency in how this is recorded – how much is missing 5

  6. Van den Broeck et al, PLoS Med 2005 6

  7. Implausible values Cancer Male Female Total Breast 0 45 45 Colorectal 23 17 40 Lung 8 7 15 Prostate 12 1 13 Total 43 70 113 7

  8. Outliers • Outliers will seem incompatible with the rest of the data • May deviate from the main body of the data • Usually extreme values (high or low) • Can be genuine observations • Can have considerable influence on the results • Any suspicious values should be checked • The decision to include or exclude outliers should be made with caution • Provided all the measurements are valid, often useful to analyse with and without the outlier 8

  9. Data presentation (exploration/presentation)  Plot the data, plot the data, plot the data !!! – prior to any statistical analysis – look at the data! – get familiar with the data – what does it look like – can highlight/indicate any errors (spurious values) in the data part of data cleaning) – what is the distribution of the data (univariate)? • do they have an approximate Normal distribution? • if not, can may be transformed (e.g. log transformation)?  Plot the data, plot the data, plot the data !!! 9

  10. Unfortunately, not all data look like this 10

  11. Positively skewed data Bland & Altman. BMJ 1996 11

  12. CRASH-2 trial (distribution of baseline Glasgow Coma Score) 12

  13. 13

  14. 14

  15. Results: Graphs  Statisticians like to (ideally) see the raw data!  Use graphs to describe results, for example – Dotplots (or boxplots for large data sets) • good for comparing groups – Scatter plots • good to accompany correlation analyses – Survival curves • time-to-event analyses – Forest plots • meta-analyses 15

  16. What can you tell me about these data? For each dataset would you believe that the: Mean of x is 9 Variance of x is 11 Mean of y is 7.5 Variance of y is 4.122 Correlation between x and y is 0.816 Regression line: y = 3 + 0.5x Called ‘Anscombe’s Quartet’ Illustrates the importance of showing your data 16

  17. Data presentation (exploration/presentation)  Where possible, plot the raw values – there is NO reason to hide it – let the data tell the story  Beware of ‘ chartjunk ’ (Edward Tufte) – anything that distracts the viewer (including you) from the information that graph is intended to present – let the data speak for themselves, don’t clutter the plot 17

  18. Dynamite plots 18

  19. Dynamite plots 19

  20. Dynamite plots Mean=81.905 mm Mean=77.445 mm Schriger & Cooper. Ann Emerg Med 2001 20

  21. The same data presented differently Schriger & Cooper. Ann Emerg Med 2001 21

  22. 22

  23. Summary  Think upfront about data collection – How are the data coded? – Be consistent throughout – Have data checks to flag implausible values  Plot the data, plot the data, plot the data! – Get to know your data – Check for outliers  Avoid ‘ chartjunk ’ – Maximise data:ink ratio 23

Recommend


More recommend