A FRAMEWORK FOR STATISTICAL ANALYSIS OF MASS SPECTROMETRY IMAGING EXPERIMENTS Kylie Bemis Purdue University Department of Statistics
OUTLINE • Statement of the problem • Biotechnological problem • Statistical and computational problem • Statement of contributions • Open-source software • matter: Rapid prototyping with data on disk • Cardinal: Statistical toolbox for mass spectrometry imaging experiments • Statistical methods • Spatial shrunken centroids • Evaluation and case studies • Summary • Conclusions • Future work 2
OUTLINE • Statement of the problem • Biotechnological problem • Statistical and computational problem • Statement of contributions • Open-source software • matter: Rapid prototyping with data on disk • Cardinal: Statistical toolbox for mass spectrometry imaging experiments • Statistical methods • Spatial shrunken centroids • Evaluation and case studies • Summary • Conclusions • Future work 3
MASS SPECTROMETRY IMAGING Investigate spatial distribution of analytes • Scan with laser/spray • Collect mass spectra • Reconstruct ion images • Date “cube” R. Graham Cooks and lab y y y y y
BIOTECHNOLOGICAL PROBLEM • Rapidly advancing technology • Increasing mass resolutions • Greater mass accuracy and range • More features (larger P) • Increasing spatial resolutions • Approaching 1 µm resolution • More pixels (larger N) • More complex experiments • 3D experiments • Time-course experiments • Increasing sample size • More biological replicates • More pixels (larger N) 5
STATISTICAL & COMPUTATIONAL PROBLEM • Complex, high-dimensional data • Spatial x, y dimensions • Potentially z, t dimensions • Mass spectral features (m/z values) • Correlation structures • Spatial (and possibly temporal) • Between mass spectral features • Increasing mass+spatial resolutions • Larger(-than-memory) datasets • Can range from 100 MB to 100 GB • Experimental design • Variation across samples+slides • What counts as a replicate? 6
PROBLEM STATEMENT • Biotechnological problem • Mass spectrometry (MS) imaging has advanced at a rapid pace • Computational tools have not advanced at a comparable pace • Lack of free, open-source statistical tools for statistical analysis • Need for classification/segmentation with statistical inference: • Classification : Classify pixels based on their mass spectral profiles into pre-defined classes (such as healthy/disease status) • Segmentation : Assign pixels to newly discovered segments with relatively homogenous and distinct mass spectral profiles • Select a subset of informative mass spectral features • Statistical and computational problem • MS imaging experiments result in complex, high-dimensional experiments • Spatial structure in datasets with large P and large N • Statistical computing on larger-than-memory data is a challenge
OUTLINE • Statement of the problem • Biotechnological problem • Statistical and computational problem • Statement of contributions • Open-source software • matter: Rapid prototyping with data on disk • Cardinal: Statistical toolbox for mass spectrometry imaging experiments • Statistical methods • Spatial shrunken centroids • Evaluation and case studies • Summary • Conclusions • Future work 8
STATEMENT OF CONTRIBUTIONS • Statistical methods: spatial shrunken centroids • Classification and segmentation for MS imaging experiments • Probabilistic model using spatial information • Selection of most informative mass spectral features • Open-source software: Cardinal • Free, open-source R package for MS imaging experiments • Full pipeline including processing, visualization, and statistical analysis • For experimentalists, provides accessible statistical methods • For statisticians, provides infrastructure for method development • Open-source software: matter • Free, open-source R package for rapid prototyping with data-on-disk • Flexible statistical computing and method development for larger-than-memory datasets • Enables Cardinal to scale to high-resolution, high-throughput MS imaging experiments • Evaluation and case studies • Public datasets and reproducible results in CardinalWorkflows • Community impact of this work y z x 9
OUTLINE • Statement of the problem • Biotechnological problem • Statistical and computational problem • Statement of contributions • Open-source software • matter: Rapid prototyping with data on disk • Cardinal: Statistical toolbox for mass spectrometry imaging experiments • Statistical methods • Spatial shrunken centroids • Evaluation and case studies • Summary • Conclusions • Future work 10
PROBLEM: LARGER-THAN-MEMORY DATA challenges statistical method development • MS imaging experiments rapidly advancing m/z = 715.03 m/z = 715.03 m/z = 715.03 t = 4 t = 8 t = 11 z z z • Increasing mass and spatial resolutions • Larger sample sizes, multiple files y y y • Growing data size poses difficulty for statistics x x x • Need to test methods on larger-than-memory data • Need to work with domain-specific formats • Current R solutions are inflexible Cardinal help Google group 11
CONTRIBUTION: MATTER open-source statistical computing with data on disk • Work with larger-than-memory datasets on disk in R Storage • Emphasizes flexibility with a minimal memory footprint matter object • Adaptable to more datasets than File 1 Atom 1 bigmemory and ff • Potentially slower computation Atom 2 • Designed for statistical method Atom 3 File 2 development in R Atom 4 • Rapid prototyping with minimal additional effort Atom 5 • Works with many existing algorithms Atom 6 File 3 • Efficient calculation of summary statistics • Infrastructure for statistical computing on large data 12
NEED TO WORK WITH MS IMAGING FILES e.g., “processed” and “continuous” imzML • Open-source format for MS imaging experiments “processed” “continuous” • XML metadata file defines imzML imzML binary data file structure UUID UUID • Binary data schema is incompatible with bigmemory and ff mzArray 1 mzArray • Prefer to avoid additional file intensityArray 1 intensityArray 1 conversion • Need random access into different mzArray 2 intensityArray 2 parts of the file intensityArray 2 intensityArray 3 • Often one-sample-per-file mzArray 3 intensityArray 4 • Need to seamlessly work with multiple files in an experiment intensityArray 3 intensityArray 5 • Each file can be very large • matter solves these problems 13
FLEXIBLE ACCESS TO DATA ON DISK any binary format, any file structure • User-defined file structure • Data can come from anywhere File 1 File 2 • Any part of a file Metadata Metadata • Any combination of files Column A Column E • Representation in R can be Column B Column F different from on disk Column C Column G • Access as ordinary R vector/matrix Column D Column H • No need to worry about data size or memory management matter matrix Column A Column C Column F Column H 14
EXAMPLE: LINEAR REGRESSION with a 1.2 GB simulated data and biglm • 1.2 GB dataset Memory Used (MB) Memory Overhead (MB) • N = 15,000,000 observations R matrices + lm • P = 9 variables R matrices + biglm • Linear regression bigmemory + biglm • Using biglm package matter + biglm 0 1750 3500 5250 7000 • Specifically for large datasets Memory Used Memory Overhead Time R matrices + lm 7 GB 1.4 GB 33 sec R matrices + biglm 2.7 GB 1.3 GB 158 sec bigmemory + biglm 1.7 GB 397 MB 21 sec matter + biglm 466 MB 319 MB 42 sec 15
EXAMPLE: PRINCIPAL COMPONENTS ANALYSIS with a 1.2 GB simulated data and irlba • 1.2 GB dataset Memory Used (MB) Memory Overhead (MB) • N = 15,000,000 observations R matrices + svd • P = 10 variables R matrices + irlba • PCA bigmemory + irlba • Using irlba package matter + irlba 0 1000 2000 3000 4000 • Not specifically for large datasets Memory Used Memory Overhead Time R matrices + svd 3.6 GB 2.4 GB 62 sec R matrices + irlba 2.3 GB 961 MB 9 sec bigmemory + irlba 3.5 GB 962 MB 9 sec matter + irlba 522 MB 427 MB 171 sec 16
EXAMPLE: PRINCIPAL COMPONENTS ANALYSIS with a 2.85 GB microbial time-course experiment • 3D microbial time-course • 2.85 GB on disk 418 sec • 17,672 pixels per PC • 40,299 features Oetjen et al, Gigascience, 2015 234 MB to compute 3 PC 79 MB memory overhead PC1 loadings t = 11 t = 8 t = 4 t = 11 t = 8 t = 4 z z z z z z y y y y y y x x x x x x m/z 262 PC1 scores 17
EXAMPLE: VISUALIZATION of a 26.45 GB mouse pancreas experiment m/z 5086 • 3D mouse pancreas cannot • 26.45 GB on disk load at all • 497,225 pixels without y matter • 13,312 features Oetjen et al, Gigascience, 2015 z x 1.25 GB used in-memory m/z 3121 223 MB to calculate mean spectrum Mean spectrum y z x m/z 3922 y z x 18
OUTLINE • Statement of the problem • Biotechnological problem • Statistical and computational problem • Statement of contributions • Open-source software • matter: Rapid prototyping with data on disk • Cardinal: Statistical toolbox for mass spectrometry imaging experiments • Statistical methods • Spatial shrunken centroids • Evaluation and case studies • Summary • Conclusions • Future work 19
Recommend
More recommend