large datasets
play

LARGE DATASETS rogier.kievit@mrc-cbu.cam.ac.uk/@rogierK Outline - PowerPoint PPT Presentation

LARGE DATASETS rogier.kievit@mrc-cbu.cam.ac.uk/@rogierK Outline 1) What is big data? 2) Why bother? 3) Where to find large datasets 4) Challenges, pitfalls and opportunities Big data? The Australian Square Kilometre Array


  1. LARGE DATASETS rogier.kievit@mrc-cbu.cam.ac.uk/@rogierK

  2. Outline • 1) What is big data? • 2) Why bother? • 3) Where to find large datasets • 4) Challenges, pitfalls and opportunities

  3. Big data? • ‘The Australian Square Kilometre Array Pathfinder (ASKAP) project currently acquires 7.5 terabytes/second of sample image data, a rate projected to increase 100-fold to 750 terabytes/second (~25 zettabytes per year) by 2025

  4. Outline • 1) What is big data? • 2) Why bother? • 3) Where to find them • 4) Challenges, pitfalls and opportunities

  5. ‘Well if I found an effect in a small sample, then there must be something there right?’

  6. Why bother BF 10 =5.26*10 8315 • 1) It’s (almost) free • 2) More statistical power is always better • 3) Reproducible • 4) Generalizability/ replicability • 5) Inspires new questions • 6) Look beyond your current domain • 7) Develop/apply/test new methods on real data

  7. Procedure • 1) Find suitable data • 2) Apply • 3) Wait • 4) (Wait some more) • 5) Data!

  8. Open data types: Databases (cognitive neuro) Sample size cost age data Biobank 500.000 2000 £ 43-73 everything LARGE cognitive, neural, mental ABCD 10000 free 9-11 DATASETS health cognitive, neural, mental HCP 1000 <£1000 21-35 health cognitive, behavioural, some IMAGEN 1500 free 0-3 neural PNC 800 free 11–17 cognitive, behavioural, neural Reach out online/ cognitive, behavioural, some NKI Rockland 800 free 6-18 neural Google data OASIS, ADNI, HABS, ENIGMA, and many more

  9. Integration with other open science practices? • Data sharing: By definition • Preregistration: Possible • Reproducibility

  10. Large public datasets in practice

  11. • https://openpsychometrics.org/ _rawdata/ • Freely downloadable • e.g.: Stress, anxiety, depression • N=48.000 in 5 seconds • (demanding) Model fit excellently • Personality and demographic covariates explained >50% (!) of the variance in depression/ anxiety/stress Jacobucci, R., Brandmaier, A. M., & Kievit, R. A. (2018). Variable selection in structural equation models with regularized MIMIC Models. In press, AMPPS

  12. Big data by leveraging technical tools • ‘Math Garden’ • Incentivisation • Free ‘participation’ • Accessible through signed form 'improving an online practice environment for math, currently containing over a billion responses’ Brinkhuis, M., Savi, A., Hofman, A., Coomans, F., van der Maas, H., & Maris, G. (2018). Learning As It Happens: A Decade of Analyzing and Shaping a Large-Scale Online Learning System.

  13. • Cognitive health • Immediate recall word list of 10 words • 0-10, 4 waves • Proportion remembered • Survey of Health, Ageing and • Mental health Retirement in Europe • EURO-D scale • Depressive • Freely and easily available symptoms • N=111.000 (!) in 60 minutes • Inverted so that higher scores -> • 6 waves better mental health • 27 European countries and Israel Decline in mental health • Fit a series of complex growth models r=.94 Decline in memory

  14. • Case study: Biobank Me preregistering • Age-related decline (3 waves) in fluid intelligence • Data access, acquisition: All excellent • But cognitive data • Not 3 waves • Not fluid reasoning • Self-paced • Ceiling effects • Floor effects Me now • Easy to remember • No slope variance in N=160.000 • At the mercy of the data available Kievit, R. A., Fuhrmann, D., Borgeest, G. S., Simpson-Kent, I. L., & Henson, R. N. (2018). The neural determinants of age-related changes in fluid intelligence: a pre-registered, longitudinal analysis in UK Biobank. Wellcome open research, 3.

  15. • 1) Time Beyond CBU: • 2) Effort 36 emails… • 3) Requirements 10 phone calls… Within CBU: 3 months…. -Anyone who shares an office with you has to sign an NDA to get a single signature. -The computer cannot be on if anybody who has NOT signed the NDA is in the same room -The computer with the data cannot connected to the internet or the CBU network -You have to enter a password every ^me you load the data

  16. Interim summary • Many benefits for researchers • Power, replication, precision • Widening access • Enrich existing paradigms • Learning/teaching data analysis • But… Where does this data come from?

  17. 18 citations

  18. Cam-CAN data portal • 400 downloads • Managed access

  19. Summary • There is an ocean of data out there • It can be your primary focus, or complement other (e.g. experimental) work • Benefits • Your data can, and where possible • Power should, contribute to the ecosystem • Generalisability • Adapt your ethics forms to allow • Extensions/scope sharing • Challenges • Don’t decide which data is valuable • Cost (modest) enough • Time/effort (negligible, relatively) • Suitability (low to high) • Slanted towards individual differences/ epidemiological (some experimental exists!)

  20. Questions? rogier.kievit@mrc-cbu.cam.ac.uk/@rogierK

Recommend


More recommend