extreme spatio temporal data analysis in biomedical
play

Extreme Spatio Temporal Data Analysis in Biomedical Informatics - PowerPoint PPT Presentation

Extreme Spatio Temporal Data Analysis in Biomedical Informatics Joel Saltz MD, PhD Director Center for Comprehensive Informatics Center for Com prehensive I nform atics Contributions Computer Science: Methods and middleware for


  1. Extreme Spatio Temporal Data Analysis in Biomedical Informatics • Joel Saltz MD, PhD • Director Center for Comprehensive Informatics

  2. Center for Com prehensive I nform atics Contributions • Computer Science: Methods and middleware for analysis, classification of very large datasets from low dimensional spatio- temporal sensors; methods to carry out comparisons and change detection between sensor datasets • Biomedical: Mine whole slide image datasets to better predict outcome and response to treatments, generate basic insights into pathophysiology and identify new treatment targets

  3. Center for Com prehensive I nform atics Outline of Talk • Pathology: Analysis of Digitized Tissue for Research and Practice • Feature Clustering: Morphologic Tumor Subtypes in GBM Brain Tumors and Relationship to “omic” classifications • Whole Slide Image Analysis in Clinical Practice: Neuroblastoma • Tissue Flow: Multiplex Quantum Dot • HPC/ BIGDATA Feature Pipeline • Pathology data analytic tools and techniques

  4. Center for Com prehensive I nform atics Whole Slide Imaging: Scale

  5. Center for Com prehensive I nform atics Pathology Computer Assisted Diagnosis Shimada, Gurcan, Kong, Saltz

  6. Computerized Classification System for Grading Neuroblastoma Yes Initialization Image Tile Background? Label I = L • Background Identification No • Image Decomposition (Multi- Create Image I (L) resolution levels) Training Tiles • Image Segmentation Segmentation I = I -1 Down-sampling (EMLDA) Feature Construction (2 nd • Feature Construction Segmentation Yes order statistics, Tonal No Feature Extraction I > 1? Features) Feature Construction • Feature Extraction (LDA) + Feature Extraction Classification Classification (Bayesian) Classifier Training • Multi-resolution Layer No Within Confidence Controller (Confidence Region ? Yes Region) TRAINING TESTING

  7. Center for Com prehensive I nform atics

  8. Center for Com prehensive I nform atics Direct Study of Relationship Between vs

  9. In Silico Brain Tumor Center Anaplastic Astrocytoma (WHO grade III) Glioblastoma (WHO grade IV)

  10. Center for Com prehensive I nform atics Morphological Tissue Classification Whole Slide Imaging Cellular Features Nuclei Segmentation Lee Cooper, Jun Kong

  11. Nuclear Features Used to Classify GBMs Center for Comprehensive Informatics 3 2 1 50 1 45 Silhouette Area 40 Cluster 2 35 30 3 25 2 3 4 5 6 7 20 40 60 80 100 120 140 160 0 0.5 1 # Clusters Silhouette Value Consensus clustering of m orphological signatures Study includes 200 million nuclei taken from 480 slides corresponding to 167 distinct patients Each possibility evaluated using 2000 iterations of K- means to quantify co-clustering

  12. Clustering identifies three morphological groups Center for Comprehensive Informatics • Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides) • Named for functions of associated genes: Cell Cycle (CC), Chromatin Modification (CM), Protein Biosynthesis (PB) • Prognostically-significant (logrank p = 4.5e-4) CC CM PB 1 CC 10 0.8 CM PB 20 Feature Indices 0.6 Survival 30 0.4 40 0.2 50 0 0 500 1000 1500 2000 2500 3000 Days

  13. Gene Expression Class Associations Center for Comprehensive Informatics • Cox proportional hazards – Gene expression class not significant p = 0.58 – Morphology clustering p = 5.0e-3 100 Classical Mesenchymal Subtype Percentage (%) 80 Neural Proneural 60 40 20 0 CC CM PB Cluster

  14. Clustering Validation Center for Comprehensive Informatics • Separate set of 84 GBMs from Henry Ford Hospital • ClusterRepro: CC p = 7.2e-3, CM p = 1.3e-2 CC Mixed CM 1 CC 10 Mixed 0.8 Feature Indices CM 20 0.6 Survival 30 0.4 40 0.2 50 0 0 20 40 60 80 100 Months

  15. Associations Center for Comprehensive Informatics

  16. Novel Pathology Modalities Genomics Imaging Excellent Molecular Resolution Excellent Spatial Resolution Limited Spatial Resolution Limited Molecular Resolution 1000’s of genes

  17. Quantum Dots Professor Robin Bostick

  18. Imaging Pipeline – Feature Extraction

  19. Example Application: Cancer Stem Cell Niche • Cancer stem cells – Rare(?), proliferative cells, regenerative – Do they prefer to live near blood vessels, or necrosis?

  20. Extreme Spatio-Temporal Sensor Data Analytics Center for Comprehensive Informatics • Leverage exascale data and computer resources to squeeze the most out of image, sensor or simulation data • Run lots of different algorithms to derive sam e features • Run lots of algorithms to derive com plem entary features • Data models and data management infrastructure to manage data products, feature sets and results from classification and machine learning algorithms

  21. Application Targets Center for Comprehensive Informatics • Multi-dimensional spatial-temporal datasets – Microscopy image analyses – Biomass monitoring using satellite imagery – Weather prediction using satellite and ground sensor data – Large scale simulations • Can we analyze 100,000+ microscopy images per hour? • Correlative and cooperative analysis of data from multiple sensor modalities and sources • What-if scenarios and multiple design choices or initial conditions

  22. Biomass Monitoring (joint with ORNL) • Investigate changes in vegetation and land use Center for Comprehensive Informatics • Hierarchical, multi-resolution coarse/fine-grained analytics into a unified framework • Changes identified using high temporal/low spatial resolution MODIS data • Segmentation and classification methods used to characterize changes using higher resolution data (e.g. multitemporal AWiFS data) • Segmentation and classification to identify man-made structures.

  23. Center for Comprehensive Informatics

  24. Core Transformations • Data Cleaning and Low Level Transformations • Data Subsetting, Filtering, Subsampling • Spatio-temporal Mapping and Registration • Object Segmentation • Feature Extraction, Object Classification • Spatio-temporal Aggregation • Change Detection, Comparison, and Quantification

  25. Extreme DataCutter DataCutter Pipeline of filters connected though logical streams In transit processing Flow control between filters and streams Developed 1990s-2000s; led to IBM System S Extreme DataCutter Two level hierarchical pipeline framework In transit processing Coarse grained components coordinated by Manager that coordinates work on pipeline stages between nodes Fine grained pipeline operations managed at the node level Both levels employ filter/stream paradigm

  26. Extreme DataCutter – Two Level Model Center for Comprehensive Informatics

  27. Node Level Work Scheduling Center for Comprehensive Informatics • Features of Node Level Architectures – Nodes contain CPUs, GPUs – Each CPU contains multiple cores – GPU has complex internal architecture – Data locality within node – Data paths between CPUs and GPUs Keeneland Node

  28. Node Level Work Scheduling Center for Comprehensive Informatics • Attempt to minimize data movement • Identify and assign operations that perform well on GPU • Balance load between CPUs and GPUs • Prefetch data • Identify and use high bandwidth CPU/ GPU data paths • Schedule exclusive GPU access for components (e.g. morphological reconstruction) requiring fine grained parallelism

  29. Node Level Work Scheduling Center for Comprehensive Informatics

  30. Brain Tumor Pipeline Scaling on Keeneland Center for Comprehensive Informatics (100 Nodes)

  31. Control Structures for Handling Fine Grained/ Runtime Dependent Parallelism in GPUs Center for Comprehensive Informatics Morphological Reconstruction: 8-15 Fold speedup vis one CPU core (Intel i7 2.66 GHz) on NVIDIA C2070 and GTX580 GPUs

  32. Large Scale Data Management Center for Comprehensive Informatics  Implemented with IBM DB2 for large scale pathology image metadata (~ million markups per slide)  Represented by a complex data model capturing multi-faceted information including markups, annotations, algorithm provenance, specimen, etc.  Support for complex relationships and spatial query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships  Highly optimized spatial query and analyses

  33. Spatial Centric – Pathology Imaging “GIS” Point query: human marked point Window query: return markups inside a nucleus contained in a rectangle . Containmen t query: nuclear feature Spatial join query: algorithm aggregation in tumor regions validation/comparison

  34. PAI S PAIS (Pathology Analytical Imaging Standards) Supported by caBIG, R01 and ACTSI  PAIS Logical Model  62 UML classes  markups, annotations, imageReferences, provenance  PAIS Data Representation  XML (compressed) or HDF5  PAIS Databases  loading, managing and querying and sharing data  Native XML DBMS or RDBMS + SDBMS

Recommend


More recommend