Extreme Spatio Temporal Data Analysis in Biomedical Informatics • Joel Saltz MD, PhD • Director Center for Comprehensive Informatics
Center for Com prehensive I nform atics Contributions • Computer Science: Methods and middleware for analysis, classification of very large datasets from low dimensional spatio- temporal sensors; methods to carry out comparisons and change detection between sensor datasets • Biomedical: Mine whole slide image datasets to better predict outcome and response to treatments, generate basic insights into pathophysiology and identify new treatment targets
Center for Com prehensive I nform atics Outline of Talk • Pathology: Analysis of Digitized Tissue for Research and Practice • Feature Clustering: Morphologic Tumor Subtypes in GBM Brain Tumors and Relationship to “omic” classifications • Whole Slide Image Analysis in Clinical Practice: Neuroblastoma • Tissue Flow: Multiplex Quantum Dot • HPC/ BIGDATA Feature Pipeline • Pathology data analytic tools and techniques
Center for Com prehensive I nform atics Whole Slide Imaging: Scale
Center for Com prehensive I nform atics Pathology Computer Assisted Diagnosis Shimada, Gurcan, Kong, Saltz
Computerized Classification System for Grading Neuroblastoma Yes Initialization Image Tile Background? Label I = L • Background Identification No • Image Decomposition (Multi- Create Image I (L) resolution levels) Training Tiles • Image Segmentation Segmentation I = I -1 Down-sampling (EMLDA) Feature Construction (2 nd • Feature Construction Segmentation Yes order statistics, Tonal No Feature Extraction I > 1? Features) Feature Construction • Feature Extraction (LDA) + Feature Extraction Classification Classification (Bayesian) Classifier Training • Multi-resolution Layer No Within Confidence Controller (Confidence Region ? Yes Region) TRAINING TESTING
Center for Com prehensive I nform atics
Center for Com prehensive I nform atics Direct Study of Relationship Between vs
In Silico Brain Tumor Center Anaplastic Astrocytoma (WHO grade III) Glioblastoma (WHO grade IV)
Center for Com prehensive I nform atics Morphological Tissue Classification Whole Slide Imaging Cellular Features Nuclei Segmentation Lee Cooper, Jun Kong
Nuclear Features Used to Classify GBMs Center for Comprehensive Informatics 3 2 1 50 1 45 Silhouette Area 40 Cluster 2 35 30 3 25 2 3 4 5 6 7 20 40 60 80 100 120 140 160 0 0.5 1 # Clusters Silhouette Value Consensus clustering of m orphological signatures Study includes 200 million nuclei taken from 480 slides corresponding to 167 distinct patients Each possibility evaluated using 2000 iterations of K- means to quantify co-clustering
Clustering identifies three morphological groups Center for Comprehensive Informatics • Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides) • Named for functions of associated genes: Cell Cycle (CC), Chromatin Modification (CM), Protein Biosynthesis (PB) • Prognostically-significant (logrank p = 4.5e-4) CC CM PB 1 CC 10 0.8 CM PB 20 Feature Indices 0.6 Survival 30 0.4 40 0.2 50 0 0 500 1000 1500 2000 2500 3000 Days
Gene Expression Class Associations Center for Comprehensive Informatics • Cox proportional hazards – Gene expression class not significant p = 0.58 – Morphology clustering p = 5.0e-3 100 Classical Mesenchymal Subtype Percentage (%) 80 Neural Proneural 60 40 20 0 CC CM PB Cluster
Clustering Validation Center for Comprehensive Informatics • Separate set of 84 GBMs from Henry Ford Hospital • ClusterRepro: CC p = 7.2e-3, CM p = 1.3e-2 CC Mixed CM 1 CC 10 Mixed 0.8 Feature Indices CM 20 0.6 Survival 30 0.4 40 0.2 50 0 0 20 40 60 80 100 Months
Associations Center for Comprehensive Informatics
Novel Pathology Modalities Genomics Imaging Excellent Molecular Resolution Excellent Spatial Resolution Limited Spatial Resolution Limited Molecular Resolution 1000’s of genes
Quantum Dots Professor Robin Bostick
Imaging Pipeline – Feature Extraction
Example Application: Cancer Stem Cell Niche • Cancer stem cells – Rare(?), proliferative cells, regenerative – Do they prefer to live near blood vessels, or necrosis?
Extreme Spatio-Temporal Sensor Data Analytics Center for Comprehensive Informatics • Leverage exascale data and computer resources to squeeze the most out of image, sensor or simulation data • Run lots of different algorithms to derive sam e features • Run lots of algorithms to derive com plem entary features • Data models and data management infrastructure to manage data products, feature sets and results from classification and machine learning algorithms
Application Targets Center for Comprehensive Informatics • Multi-dimensional spatial-temporal datasets – Microscopy image analyses – Biomass monitoring using satellite imagery – Weather prediction using satellite and ground sensor data – Large scale simulations • Can we analyze 100,000+ microscopy images per hour? • Correlative and cooperative analysis of data from multiple sensor modalities and sources • What-if scenarios and multiple design choices or initial conditions
Biomass Monitoring (joint with ORNL) • Investigate changes in vegetation and land use Center for Comprehensive Informatics • Hierarchical, multi-resolution coarse/fine-grained analytics into a unified framework • Changes identified using high temporal/low spatial resolution MODIS data • Segmentation and classification methods used to characterize changes using higher resolution data (e.g. multitemporal AWiFS data) • Segmentation and classification to identify man-made structures.
Center for Comprehensive Informatics
Core Transformations • Data Cleaning and Low Level Transformations • Data Subsetting, Filtering, Subsampling • Spatio-temporal Mapping and Registration • Object Segmentation • Feature Extraction, Object Classification • Spatio-temporal Aggregation • Change Detection, Comparison, and Quantification
Extreme DataCutter DataCutter Pipeline of filters connected though logical streams In transit processing Flow control between filters and streams Developed 1990s-2000s; led to IBM System S Extreme DataCutter Two level hierarchical pipeline framework In transit processing Coarse grained components coordinated by Manager that coordinates work on pipeline stages between nodes Fine grained pipeline operations managed at the node level Both levels employ filter/stream paradigm
Extreme DataCutter – Two Level Model Center for Comprehensive Informatics
Node Level Work Scheduling Center for Comprehensive Informatics • Features of Node Level Architectures – Nodes contain CPUs, GPUs – Each CPU contains multiple cores – GPU has complex internal architecture – Data locality within node – Data paths between CPUs and GPUs Keeneland Node
Node Level Work Scheduling Center for Comprehensive Informatics • Attempt to minimize data movement • Identify and assign operations that perform well on GPU • Balance load between CPUs and GPUs • Prefetch data • Identify and use high bandwidth CPU/ GPU data paths • Schedule exclusive GPU access for components (e.g. morphological reconstruction) requiring fine grained parallelism
Node Level Work Scheduling Center for Comprehensive Informatics
Brain Tumor Pipeline Scaling on Keeneland Center for Comprehensive Informatics (100 Nodes)
Control Structures for Handling Fine Grained/ Runtime Dependent Parallelism in GPUs Center for Comprehensive Informatics Morphological Reconstruction: 8-15 Fold speedup vis one CPU core (Intel i7 2.66 GHz) on NVIDIA C2070 and GTX580 GPUs
Large Scale Data Management Center for Comprehensive Informatics Implemented with IBM DB2 for large scale pathology image metadata (~ million markups per slide) Represented by a complex data model capturing multi-faceted information including markups, annotations, algorithm provenance, specimen, etc. Support for complex relationships and spatial query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships Highly optimized spatial query and analyses
Spatial Centric – Pathology Imaging “GIS” Point query: human marked point Window query: return markups inside a nucleus contained in a rectangle . Containmen t query: nuclear feature Spatial join query: algorithm aggregation in tumor regions validation/comparison
PAI S PAIS (Pathology Analytical Imaging Standards) Supported by caBIG, R01 and ACTSI PAIS Logical Model 62 UML classes markups, annotations, imageReferences, provenance PAIS Data Representation XML (compressed) or HDF5 PAIS Databases loading, managing and querying and sharing data Native XML DBMS or RDBMS + SDBMS
Recommend
More recommend