LARGE DATASETS rogier.kievit@mrc-cbu.cam.ac.uk/@rogierK
Outline • 1) What is big data? • 2) Why bother? • 3) Where to find large datasets • 4) Challenges, pitfalls and opportunities
Big data? • ‘The Australian Square Kilometre Array Pathfinder (ASKAP) project currently acquires 7.5 terabytes/second of sample image data, a rate projected to increase 100-fold to 750 terabytes/second (~25 zettabytes per year) by 2025
Outline • 1) What is big data? • 2) Why bother? • 3) Where to find them • 4) Challenges, pitfalls and opportunities
‘Well if I found an effect in a small sample, then there must be something there right?’
Why bother BF 10 =5.26*10 8315 • 1) It’s (almost) free • 2) More statistical power is always better • 3) Reproducible • 4) Generalizability/ replicability • 5) Inspires new questions • 6) Look beyond your current domain • 7) Develop/apply/test new methods on real data
Procedure • 1) Find suitable data • 2) Apply • 3) Wait • 4) (Wait some more) • 5) Data!
Open data types: Databases (cognitive neuro) Sample size cost age data Biobank 500.000 2000 £ 43-73 everything LARGE cognitive, neural, mental ABCD 10000 free 9-11 DATASETS health cognitive, neural, mental HCP 1000 <£1000 21-35 health cognitive, behavioural, some IMAGEN 1500 free 0-3 neural PNC 800 free 11–17 cognitive, behavioural, neural Reach out online/ cognitive, behavioural, some NKI Rockland 800 free 6-18 neural Google data OASIS, ADNI, HABS, ENIGMA, and many more
Integration with other open science practices? • Data sharing: By definition • Preregistration: Possible • Reproducibility
Large public datasets in practice
• https://openpsychometrics.org/ _rawdata/ • Freely downloadable • e.g.: Stress, anxiety, depression • N=48.000 in 5 seconds • (demanding) Model fit excellently • Personality and demographic covariates explained >50% (!) of the variance in depression/ anxiety/stress Jacobucci, R., Brandmaier, A. M., & Kievit, R. A. (2018). Variable selection in structural equation models with regularized MIMIC Models. In press, AMPPS
Big data by leveraging technical tools • ‘Math Garden’ • Incentivisation • Free ‘participation’ • Accessible through signed form 'improving an online practice environment for math, currently containing over a billion responses’ Brinkhuis, M., Savi, A., Hofman, A., Coomans, F., van der Maas, H., & Maris, G. (2018). Learning As It Happens: A Decade of Analyzing and Shaping a Large-Scale Online Learning System.
• Cognitive health • Immediate recall word list of 10 words • 0-10, 4 waves • Proportion remembered • Survey of Health, Ageing and • Mental health Retirement in Europe • EURO-D scale • Depressive • Freely and easily available symptoms • N=111.000 (!) in 60 minutes • Inverted so that higher scores -> • 6 waves better mental health • 27 European countries and Israel Decline in mental health • Fit a series of complex growth models r=.94 Decline in memory
• Case study: Biobank Me preregistering • Age-related decline (3 waves) in fluid intelligence • Data access, acquisition: All excellent • But cognitive data • Not 3 waves • Not fluid reasoning • Self-paced • Ceiling effects • Floor effects Me now • Easy to remember • No slope variance in N=160.000 • At the mercy of the data available Kievit, R. A., Fuhrmann, D., Borgeest, G. S., Simpson-Kent, I. L., & Henson, R. N. (2018). The neural determinants of age-related changes in fluid intelligence: a pre-registered, longitudinal analysis in UK Biobank. Wellcome open research, 3.
• 1) Time Beyond CBU: • 2) Effort 36 emails… • 3) Requirements 10 phone calls… Within CBU: 3 months…. -Anyone who shares an office with you has to sign an NDA to get a single signature. -The computer cannot be on if anybody who has NOT signed the NDA is in the same room -The computer with the data cannot connected to the internet or the CBU network -You have to enter a password every ^me you load the data
Interim summary • Many benefits for researchers • Power, replication, precision • Widening access • Enrich existing paradigms • Learning/teaching data analysis • But… Where does this data come from?
18 citations
Cam-CAN data portal • 400 downloads • Managed access
Summary • There is an ocean of data out there • It can be your primary focus, or complement other (e.g. experimental) work • Benefits • Your data can, and where possible • Power should, contribute to the ecosystem • Generalisability • Adapt your ethics forms to allow • Extensions/scope sharing • Challenges • Don’t decide which data is valuable • Cost (modest) enough • Time/effort (negligible, relatively) • Suitability (low to high) • Slanted towards individual differences/ epidemiological (some experimental exists!)
Questions? rogier.kievit@mrc-cbu.cam.ac.uk/@rogierK
Recommend
More recommend