STATISTICAL ANALYSIS OF MASS SPECTROMETRY IMAGING EXPERIMENTS - PowerPoint PPT Presentation

A FRAMEWORK FOR STATISTICAL ANALYSIS OF MASS SPECTROMETRY IMAGING EXPERIMENTS Kylie Bemis Purdue University Department of Statistics

OUTLINE • Statement of the problem • Biotechnological problem • Statistical and computational problem • Statement of contributions • Open-source software • matter: Rapid prototyping with data on disk • Cardinal: Statistical toolbox for mass spectrometry imaging experiments • Statistical methods • Spatial shrunken centroids • Evaluation and case studies • Summary • Conclusions • Future work 2

MASS SPECTROMETRY IMAGING Investigate spatial distribution of analytes • Scan with laser/spray • Collect mass spectra • Reconstruct ion images • Date “cube” R. Graham Cooks and lab y y y y y

BIOTECHNOLOGICAL PROBLEM • Rapidly advancing technology • Increasing mass resolutions • Greater mass accuracy and range • More features (larger P) • Increasing spatial resolutions • Approaching 1 µm resolution • More pixels (larger N) • More complex experiments • 3D experiments • Time-course experiments • Increasing sample size • More biological replicates • More pixels (larger N) 5

STATISTICAL & COMPUTATIONAL PROBLEM • Complex, high-dimensional data • Spatial x, y dimensions • Potentially z, t dimensions • Mass spectral features (m/z values) • Correlation structures • Spatial (and possibly temporal) • Between mass spectral features • Increasing mass+spatial resolutions • Larger(-than-memory) datasets • Can range from 100 MB to 100 GB • Experimental design • Variation across samples+slides • What counts as a replicate? 6

PROBLEM STATEMENT • Biotechnological problem • Mass spectrometry (MS) imaging has advanced at a rapid pace • Computational tools have not advanced at a comparable pace • Lack of free, open-source statistical tools for statistical analysis • Need for classification/segmentation with statistical inference: • Classification : Classify pixels based on their mass spectral profiles into pre-defined classes (such as healthy/disease status) • Segmentation : Assign pixels to newly discovered segments with relatively homogenous and distinct mass spectral profiles • Select a subset of informative mass spectral features • Statistical and computational problem • MS imaging experiments result in complex, high-dimensional experiments • Spatial structure in datasets with large P and large N • Statistical computing on larger-than-memory data is a challenge

STATEMENT OF CONTRIBUTIONS • Statistical methods: spatial shrunken centroids • Classification and segmentation for MS imaging experiments • Probabilistic model using spatial information • Selection of most informative mass spectral features • Open-source software: Cardinal • Free, open-source R package for MS imaging experiments • Full pipeline including processing, visualization, and statistical analysis • For experimentalists, provides accessible statistical methods • For statisticians, provides infrastructure for method development • Open-source software: matter • Free, open-source R package for rapid prototyping with data-on-disk • Flexible statistical computing and method development for larger-than-memory datasets • Enables Cardinal to scale to high-resolution, high-throughput MS imaging experiments • Evaluation and case studies • Public datasets and reproducible results in CardinalWorkflows • Community impact of this work y z x 9

PROBLEM: LARGER-THAN-MEMORY DATA challenges statistical method development • MS imaging experiments rapidly advancing m/z = 715.03 m/z = 715.03 m/z = 715.03 t = 4 t = 8 t = 11 z z z • Increasing mass and spatial resolutions • Larger sample sizes, multiple files y y y • Growing data size poses difficulty for statistics x x x • Need to test methods on larger-than-memory data • Need to work with domain-specific formats • Current R solutions are inflexible Cardinal help Google group 11

CONTRIBUTION: MATTER open-source statistical computing with data on disk • Work with larger-than-memory datasets on disk in R Storage • Emphasizes flexibility with a minimal memory footprint matter object • Adaptable to more datasets than File 1 Atom 1 bigmemory and ff • Potentially slower computation Atom 2 • Designed for statistical method Atom 3 File 2 development in R Atom 4 • Rapid prototyping with minimal additional effort Atom 5 • Works with many existing algorithms Atom 6 File 3 • Efficient calculation of summary statistics • Infrastructure for statistical computing on large data 12

NEED TO WORK WITH MS IMAGING FILES e.g., “processed” and “continuous” imzML • Open-source format for MS imaging experiments “processed” “continuous” • XML metadata file defines imzML imzML binary data file structure UUID UUID • Binary data schema is incompatible with bigmemory and ff mzArray 1 mzArray • Prefer to avoid additional file intensityArray 1 intensityArray 1 conversion • Need random access into different mzArray 2 intensityArray 2 parts of the file intensityArray 2 intensityArray 3 • Often one-sample-per-file mzArray 3 intensityArray 4 • Need to seamlessly work with multiple files in an experiment intensityArray 3 intensityArray 5 • Each file can be very large • matter solves these problems 13

FLEXIBLE ACCESS TO DATA ON DISK any binary format, any file structure • User-defined file structure • Data can come from anywhere File 1 File 2 • Any part of a file Metadata Metadata • Any combination of files Column A Column E • Representation in R can be Column B Column F different from on disk Column C Column G • Access as ordinary R vector/matrix Column D Column H • No need to worry about data size or memory management matter matrix Column A Column C Column F Column H 14

EXAMPLE: LINEAR REGRESSION with a 1.2 GB simulated data and biglm • 1.2 GB dataset Memory Used (MB) Memory Overhead (MB) • N = 15,000,000 observations R matrices + lm • P = 9 variables R matrices + biglm • Linear regression bigmemory + biglm • Using biglm package matter + biglm 0 1750 3500 5250 7000 • Specifically for large datasets Memory Used Memory Overhead Time R matrices + lm 7 GB 1.4 GB 33 sec R matrices + biglm 2.7 GB 1.3 GB 158 sec bigmemory + biglm 1.7 GB 397 MB 21 sec matter + biglm 466 MB 319 MB 42 sec 15

EXAMPLE: PRINCIPAL COMPONENTS ANALYSIS with a 1.2 GB simulated data and irlba • 1.2 GB dataset Memory Used (MB) Memory Overhead (MB) • N = 15,000,000 observations R matrices + svd • P = 10 variables R matrices + irlba • PCA bigmemory + irlba • Using irlba package matter + irlba 0 1000 2000 3000 4000 • Not specifically for large datasets Memory Used Memory Overhead Time R matrices + svd 3.6 GB 2.4 GB 62 sec R matrices + irlba 2.3 GB 961 MB 9 sec bigmemory + irlba 3.5 GB 962 MB 9 sec matter + irlba 522 MB 427 MB 171 sec 16

EXAMPLE: PRINCIPAL COMPONENTS ANALYSIS with a 2.85 GB microbial time-course experiment • 3D microbial time-course • 2.85 GB on disk 418 sec • 17,672 pixels per PC • 40,299 features Oetjen et al, Gigascience, 2015 234 MB to compute 3 PC 79 MB memory overhead PC1 loadings t = 11 t = 8 t = 4 t = 11 t = 8 t = 4 z z z z z z y y y y y y x x x x x x m/z 262 PC1 scores 17

EXAMPLE: VISUALIZATION of a 26.45 GB mouse pancreas experiment m/z 5086 • 3D mouse pancreas cannot • 26.45 GB on disk load at all • 497,225 pixels without y matter • 13,312 features Oetjen et al, Gigascience, 2015 z x 1.25 GB used in-memory m/z 3121 223 MB to calculate mean spectrum Mean spectrum y z x m/z 3922 y z x 18

STATISTICAL ANALYSIS OF MASS SPECTROMETRY IMAGING EXPERIMENTS - PowerPoint PPT Presentation

A FRAMEWORK FOR STATISTICAL ANALYSIS OF MASS SPECTROMETRY IMAGING EXPERIMENTS Kylie Bemis Purdue University Department of Statistics OUTLINE Statement of the problem Biotechnological problem Statistical and computational problem

Mass Spectrometry in Clinical Chemistry David Hardy What is Mass Spectrometry? The

Mass Spectrometry Mass Spectrometry is a technique used to determine the molecular weight and

Mass Spectrometry MALDI-TOF ESI/MS/MS Mass spectrometer Basic components Ionization

Nuclear Imaging Medical Imaging Medical Imaging Nuclear Imaging Nuclear Imaging Nuclear

Inductively coupled plasma mass spectrometry (ICPMS) What is ICP MS Inductively coupled plasma

Proteomics and Protein Mass Proteomics and Protein Mass Spectrometry 2004 Spectrometry 2004

Targeted mass spectrometry Marina Zajec Dept. of Neurology and Clinical Chemistry Lab. of

Content A brief introduction to mass spectrometry Mass spectrometry instrumentation

How We Handle Mass Spectra NIST Mass Spectrometry Data Center NIST/EPA/NIH Mass Spectral Library

What is a mass spectrum? What is a mass spectrum? 1265.6038 100 MALDI-DE-RE-TOF MS tryptic

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data M. Cannataro, P. H.

Direct analysis of neutrals using superconducting detector in tandem mass spectrometry TOD:

Mass Spectrometry - an essential tool to understand and produce proteins Mark Abbott CEO Peak

Problem 4: Mass Spectrometry The Quadrupole The Problem Initial Ideas Binary System Ion

Characterization of complex polymer systems by MALDI Mass Spectrometry Concetto Puglisi

Applications of Isochronous Mass Spectrometry (IMS) at HIRFL-CSR OUTLINE Introduction to

Ideology Estimation, Media Slant, and Opinion Segregation: Facebook as a Social Barometer

Needs of reliable nuclear data and covariance matrices for Burnup Credit in JEFF-3 library WONDER

Modeling Science : Discovering Themes in Large Collections of Documents David M. Blei Department

Autoregressive and Invertible Models CSC2541 Fall 2016 Haider Al-Lawati Christopher Meaney

Low Scale Testable Leptogenesis Jacobo Lpez-Pavn Neutrino Physics at the High Energy

T8: Predicting Structures in NLP: Constrained Conditional Models and Integer Linear Programming

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 11 Jan-Willem van de Meent

MacSeNet/SpaRTan Spring School on Sparse Representations and Compressed Sensing Sp Spar arse