c lustering s trategy swift
play

C LUSTERING S TRATEGY : SWIFT GMM clustering with Sampling for k [ K - PowerPoint PPT Presentation

SWIFT: S CALABLE W EIGHTED I TERATIVE F LOW - CLUSTERING T ECHNIQUE Iftekhar Naim , Gaurav Sharma , Suprakash Datta , James S. Cavenaugh , Jyh-Chiang E. Wang , Jonathan A. Rebhahn , Sally A. Quataert , and Tim R. Mosmann


  1. SWIFT: S CALABLE W EIGHTED I TERATIVE F LOW - CLUSTERING T ECHNIQUE Iftekhar Naim ∗ , Gaurav Sharma ∗ , Suprakash Datta † , James S. Cavenaugh ∗ , Jyh-Chiang E. Wang ∗ , Jonathan A. Rebhahn ∗ , Sally A. Quataert ∗ , and Tim R. Mosmann ∗ ∗ University of Rochester, Rochester, NY † York University, Toronto, ON FlowCAP Summit, 2010 1 / 48 SWIFT

  2. O UTLINE 1 I NTRODUCTION Flow cytometry (FC) Data Analysis Automated Multivariate clustering of FC Data 2 SWIFT METHOD FOR FC DATA ANALYSIS SWIFT Algorithm Weighted Iterative Sampling based EM Bimodality Splitting Graph-based Merging 3 D OES I T W ORK ? Does It Work? How Do We Know It Works? 4 F LOW CAP C ONTEST Results on FlowCAP Datasets Few Thoughts for FlowCAP II 5 C ONCLUSION 2 / 48 SWIFT

  3. O UTLINE 1 I NTRODUCTION Flow cytometry (FC) Data Analysis Automated Multivariate clustering of FC Data 2 SWIFT METHOD FOR FC DATA ANALYSIS SWIFT Algorithm Weighted Iterative Sampling based EM Bimodality Splitting Graph-based Merging 3 D OES I T W ORK ? Does It Work? How Do We Know It Works? 4 F LOW CAP C ONTEST Results on FlowCAP Datasets Few Thoughts for FlowCAP II 5 C ONCLUSION 3 / 48 SWIFT

  4. F LOW CYTOMETRY (FC) O VERVIEW ◮ Rapid multivariate analysis of individual cells. ◮ High throughput data generation (description of ∼ 1 million cells). ◮ High dimensionality ( ∼ 20 measurements per cell). Fluorochrome Antibody Antigen Cell F IGURE : Flow cytometry system (Ref: http://probes.invitrogen.com) 4 / 48 SWIFT

  5. FC D ATA A NALYSIS ◮ Traditionally FC data analyzed by Manual Gating ◮ Subjective, Scales poorly with increasing dimensions ◮ 1D/2D Projections may not represent full picture ◮ Inaccurate for overlapping clusters (a) Two overlapping (b) Combined view (c) Manual gating clusters F IGURE : Manual gating for overlapping clusters. ◮ Automated multivariate clustering is desirable for FC Data analysis . ◮ Repeatable, nonsubjective, comprehends multivariate structure. 5 / 48 SWIFT

  6. C HALLENGES OF AUTOMATED CLUSTERING OF FC D ATA ◮ Challenges of Automated Clustering: ◮ Large FC datasets ◮ ∼ 1 million events ◮ High dimensionality ( 20 or more dimensions) ◮ Very small clusters that are important in immunological analysis (100 − 200 cells out of millions) ◮ Overlapping clusters and background noise 6 / 48 SWIFT

  7. C HALLENGES OF AUTOMATED CLUSTERING OF FC D ATA ◮ Challenges of Automated Clustering: ◮ Large FC datasets ◮ ∼ 1 million events ◮ High dimensionality ( 20 or more dimensions) ◮ Very small clusters that are important in immunological analysis (100 − 200 cells out of millions) ◮ Overlapping clusters and background noise ◮ Our goal: Design automated clustering method capable of addressing these challenges 6 / 48 SWIFT

  8. M ANY D IFFERENT C LUSTERING M ETHODS Patitional Clustering Soft Hard Mixture Fuzzy Grid Spectral .... K-means Model Clustering Based Clustering 7 / 48 SWIFT

  9. M ANY D IFFERENT C LUSTERING M ETHODS Patitional Clustering Soft Hard Mixture Fuzzy Grid Spectral .... K-means Model Clustering Based Clustering 8 / 48 SWIFT

  10. M ODEL BASED CLUSTERING FOR FC DATA ◮ Model based clustering offers several advantages: ◮ Soft clustering- comprehends overlapping clusters, background noise ◮ BUT, computationally expensive and choice of model imposes limitation 9 / 48 SWIFT

  11. M ODEL BASED CLUSTERING FOR FC DATA ◮ Model based clustering offers several advantages: ◮ Soft clustering- comprehends overlapping clusters, background noise ◮ BUT, computationally expensive and choice of model imposes limitation ◮ Recent proposals for statistical model based FC clustering (Chan et al. [2008], Lo et al. [2008],Finak et al. [2009], Pyne et al. [2009]) 9 / 48 SWIFT

  12. M ODEL BASED CLUSTERING FOR FC DATA ◮ Model based clustering offers several advantages: ◮ Soft clustering- comprehends overlapping clusters, background noise ◮ BUT, computationally expensive and choice of model imposes limitation ◮ Recent proposals for statistical model based FC clustering (Chan et al. [2008], Lo et al. [2008],Finak et al. [2009], Pyne et al. [2009]) ◮ We propose computationally efficient model-based clustering method SWIFT (Naim et al. [2010]) that offers two advantages: ◮ Scalability: Faster Computation + Less Memory Usage ◮ Detection of Small Populations: ∼ 100 cells out of 1 million 9 / 48 SWIFT

  13. O UTLINE 1 I NTRODUCTION Flow cytometry (FC) Data Analysis Automated Multivariate clustering of FC Data 2 SWIFT METHOD FOR FC DATA ANALYSIS SWIFT Algorithm Weighted Iterative Sampling based EM Bimodality Splitting Graph-based Merging 3 D OES I T W ORK ? Does It Work? How Do We Know It Works? 4 F LOW CAP C ONTEST Results on FlowCAP Datasets Few Thoughts for FlowCAP II 5 C ONCLUSION 10 / 48 SWIFT

  14. SWIFT A LGORITHM FOR FC D ATA C LUSTERING SWIFT: a three stage algorithm: 11 / 48 SWIFT

  15. SWIFT A LGORITHM FOR FC D ATA C LUSTERING SWIFT: a three stage algorithm: 1 Weighted Iterative Sampling based EM : Gaussian mixture model clustering + novel weighted iterative sampling ◮ Bayesian Information Criterion (BIC) 11 / 48 SWIFT

  16. SWIFT A LGORITHM FOR FC D ATA C LUSTERING SWIFT: a three stage algorithm: 1 Weighted Iterative Sampling based EM : Gaussian mixture model clustering + novel weighted iterative sampling ◮ Bayesian Information Criterion (BIC) 2 Bimodality Splitting: Split any cluster that is, ◮ Bimodal in any dimensions or any principal components. ◮ Useful for clustering high dimensional data. 11 / 48 SWIFT

  17. SWIFT A LGORITHM FOR FC D ATA C LUSTERING SWIFT: a three stage algorithm: 1 Weighted Iterative Sampling based EM : Gaussian mixture model clustering + novel weighted iterative sampling ◮ Bayesian Information Criterion (BIC) 2 Bimodality Splitting: Split any cluster that is, ◮ Bimodal in any dimensions or any principal components. ◮ Useful for clustering high dimensional data. 3 Graph-based Merging: Merge overlapping Gaussians ( Hennig [2009], Finak et al. [2009], Baudry et al. [2010]). ◮ Allows representation of non-Gaussian clusters 11 / 48 SWIFT

  18. C LUSTERING S TRATEGY : SWIFT GMM clustering with Sampling for k ∈ [ K min , K max ] BIC to decide number of Gaussians ( ˆ K ) Split Bimodal Clusters until Unimodal. Results in K split Clusters Graph-based Merging using Overlap/Entropy criteria Results in K entropy Clusters Soft clustering for K entropy clusters 12 / 48 SWIFT

  19. S TAGE 1: G AUSSIAN M IXTURE M ODEL C LUSTERING GMM clustering with Sampling for k ∈ [ K min , K max ] BIC to decide number of Gaussians ( ˆ K ) Split Bimodal Clusters until Unimodal. Results in K split Clusters Graph-based Merging using Overlap/Entropy criteria Results in K entropy Clusters Soft clustering for K entropy clusters 13 / 48 SWIFT

  20. S TAGE 1: G AUSSIAN MIXTURE MODEL CLUSTERING ◮ Gaussian mixture model (GMM) clustering is chosen among the model based methods ◮ Faster than other model based clustering methods ◮ Closed form solution 14 / 48 SWIFT

  21. S TAGE 1: G AUSSIAN MIXTURE MODEL CLUSTERING ◮ Gaussian mixture model (GMM) clustering is chosen among the model based methods ◮ Faster than other model based clustering methods ◮ Closed form solution ◮ Expectation Maximization (EM) algorithm for parameter estimation ◮ Computational complexity of each iteration: O ( Nkd 2 ) ◮ N = the number of data-vectors in the dataset ◮ k = is the number of Gaussian components ◮ d = is the dimension of each data-vectors 14 / 48 SWIFT

  22. S TAGE 1: S AMPLING FOR S CALABILITY ◮ Operate on smaller subsample of dataset for better computation performance. ◮ Challenge: Poor representation of smaller clusters. (a) 4 Gaussians with 150K, 100K, 50K (b) After 10% sampling and 150 datapoints 15 / 48 SWIFT

  23. S TAGE 1: S AMPLING FOR S CALABILITY ◮ Operate on smaller subsample of dataset for better computation performance. ◮ Challenge: Poor representation of smaller clusters. (c) 4 Gaussians with 150K, 100K, 50K (d) After 10% sampling and 150 datapoints ◮ Solution: Weighted iterative sampling ◮ Faster computation ◮ Better detection of small clusters 15 / 48 SWIFT

  24. S TAGE 1: W EIGHTED I TERATIVE S AMPLING BASED EM FCS Dataset X Subsample S from X GMM fitting to S using EM Fix p largest clusters and add Resample S from X l �∈ F γ ( i ) with probability � them to F . Initially F = ∅ j All the No clusters fixed? Yes Perform few EM iterations on X Output model parameters ( θ ) 16 / 48 SWIFT

  25. S TAGE 1: W EIGHTED I TERATIVE S AMPLING BASED EM FCS Dataset X F = set of clusters whose parameters are fixed. Subsample S from X GMM fitting to S using EM Fix p largest clusters and add Resample S from X l �∈ F γ ( i ) with probability � them to F . Initially F = ∅ j All the No clusters fixed? Yes Perform few EM iterations on X Output model parameters ( θ ) 16 / 48 SWIFT

  26. S TAGE 1: W EIGHTED I TERATIVE S AMPLING BASED EM FCS Dataset X F = set of clusters whose parameters are fixed. Subsample S from X P ( X ( i ) is selected in S ) = ∑ l �∈ F γ ( i ) l GMM fitting to S using EM Fix p largest clusters and add Resample S from X l �∈ F γ ( i ) with probability � them to F . Initially F = ∅ j All the No clusters fixed? Yes Perform few EM iterations on X Output model parameters ( θ ) 16 / 48 SWIFT

  27. S TAGE 1: W EIGHTED I TERATIVE S AMPLING BASED EM F IGURE : 4 Gaussian clusters with 150K, 100K, 50K and 150 datapoints 17 / 48 SWIFT

Recommend


More recommend