Automated High-dimensional Cytometric Data Analysis Cytometric Data Analysis Philip L. De Jager, M.D. Ph.D. Director, Program in Translational NeuroPsychiatric Genomics Brigham & Women’s Hospital Assistant Professor of Neurology Harvard Medical School
Challenges in cytometric analysis Challenges in cytometric analysis • • Large amount of high dimensional data Large amount of high dimensional data • Manual data processing (subjective, slow) • Not suitable for high-throughput study g g p y • Difficult to use in inferential analysis – “hypothesis limited” • • Sub optimal usage of data dimensions Sub-optimal usage of data dimensions - Increasingly multi-parametric - Restricted visualization Solution: Automated & Multivariate Analysis Automated & Multivariate Analysis 2
FLAME FLow cytometry analysis with Automated y y y Multivariate Estimation • Clustering – parametric and multivariate mixture l i i d l i i i modeling of the populations in each flow sample • Meta clustering • Meta-clustering – match the corresponding match the corresponding populations from multiple samples to compare features of these matched populations • Feature selection – identify features that distinguish populations between different classes (such as normal vs. disease, wt vs. mutant, (suc as o a s. d sease, t s. uta t, longitudinal observations, etc.) • Classification – predict class membership for new samples based on those distinctive features l b d h di i i f
2. Meta-clustering FLAME summary FLAME summary 1. Clustering flow data Sample 1 3. Feature Selection class1 class2 class3 • Frequencies • Locations • Means • Modes Sample 2 • Variances • Scales • Orientations • Shapes Downstream Analyses Sample 3 • Visualization • Class Discovery Cl Di • Class Prediction • Etc.
FLAME Methodology Methodology
Concept: Finite Mixture Model Finite Mixture Model : weighted sum of g univariate or multivariate densities Univariate Gaussian mixture Univariate Gaussian mixture Bivariate Gaussian mixture Bivariate Gaussian mixture w 1 =0.5 w 2 =0.5 Fitted curve curve µ 1 µ 2 µ 3 6 g=3 g=2 Sum of 3 Gaussians
Different distributions Skew N N Skew
Model Selection options in FLAME p Skew N N Skew
Step 1: Fitting a distribution Step 1: Fitting a distribution • Lymphoblastic cell line 9
Fitting skew t deals with asymmetry g y y Gaussian Skew Skew Asymmetric Data Asymmetric Data Distribution Distribution Density Plot
Step 2: Meta-clustering Step 2: Meta clustering 1. Input: Individual samples clustered by mixture model 2. Take all samples and pool their cluster locations T k ll l d l th i l t l ti 3. 3. Algorithm: Run Partitioning Around Medoids (PAM) to go t : u a t t o g ou d edo ds ( ) to obtain k meta-clusters 4. Output: Matched features used for classification of samples 1 1
Example 2: Identifying discriminating features •Experiment: examine ZAP70 and SLP76 phosphorylation events before p p y and after T cell receptor activation in naïve and memory T cells •Lymphocytes stained with four Lymphocytes stained with four markers: •CD4 •CD45RA •ZAP70Y292 •SLP76Y128 •SLP76Y128 •60 samples: 30 subjects x two time points: pre- and post- anti-CD3 antibody stimulation ib d i l i
Registering populations across samples Pre ‐ stimulation samples p Post ‐ stimulation samples p
a. c. CD45RA CD45RA C C e. Sample 121106A_0min CD45RA b. b d d. 5RA RA CD45R CD45 Pre-stimulation P Post-stimulation t ti l ti Sample 121106A_5min
Step 3: Discriminating features Pre-stimulation Post-stimulation zero minute zero-minute five minute five-minute
Discriminating features ∆ mean Feature [five- feature name Type Cluster # Dimension(s) min] p-value III vars11.4 Variance 4 1 -0.156 1.65E-18 IV orientation 72 Orientation 5 3 -0.649 1.01E-14 orientation 56 Orientation 4 3 -0.609 1.13E-12 vars11.5 Variance 5 1 -0.082 4.00E-08 orientation 66 Orientation 5 1 -0.515 1.37E-05 shape 11 Shape 3 3 -0.175 2.62E-08 scale4 Scale 4 NA -0.052 3.32E-06 II II orientation 19 Orientation 2 1 -0.632 1.34E-06 shape 8 Shape 2 2 -0.141 4.41E-09 shape 15 Shape 4 4 -0.178 5.17E-07 vars41.5 Variance 5 1,4 -0.024 2.63E-05 orientation 42 Orientation 3 3 -0.422 9.73E-04 shape 20 Shape 5 4 -0.060 7.93E-05 scale5 Scale 5 NA -0.038 7.23E-04 vars43.3 Variance 3 3,4 -0.020 7.10E-04 vars31.4 Variance 4 1,3 -0.015 3.34E-03 I I vars11.3 Variance 3 1 0.314 6.22E-12 CD45RA A orientation 52 Orientation 4 1 0.552 1.87E-10 V vars21.2 Variance 2 1,2 0.251 1.22E-10 vars21.3 Variance 3 1,2 0.259 1.14E-11 vars21.4 Variance 4 1,2 0.060 3.42E-08 orientation 20 Orientation 2 1 0.504 2.17E-11 shape 10 Shape 3 3 0.740 1.31E-09 shape 7 Shape 2 2 0.682 4.49E-16 shape 13 Shape 4 4 1.023 4.37E-09 mus1.4 Mean 4 1 1.761 6.13E-22 orientation 54 orientation 54 Orientation Orientation 4 4 2 2 0.534 0 534 1 26E-08 1.26E 08 mus1.5 Mean 5 1 1.657 2.47E-21 vars22.2 Variance 2 2 0.282 5.45E-05 orientation 59 Orientation 4 3 0.548 1.51E-04 vars22.3 Variance 3 2 0.146 1.09E-05 orientation 47 Orientation 3 4 0.561 4.65E-05 orientation 43 Orientation 3 3 0.066 8.01E-05 orientation 70 Orientation 5 2 0.308 4.07E-03 scale3 Scale 3 NA 0.063 2.19E-04 mus1.2 Mean 2 1 1.571 1.52E-18 vars11 2 vars11.2 Variance Variance 2 2 1 1 0 131 0.131 1 62E 04 1.62E-04 vars22.5 Variance 5 2 0.023 2.65E-04
Example 3: Identifying a rare cell population p y g p p Regulatory T cells occur as a less Than 0 5 1 0% population in human Than 0.5-1.0% population in human peripheral blood mononuclear cells 3-PE -PE Foxp3- Foxp3 1 7 Baecher-Allan et al., JI , 2006
Stepwise detection of Tregs Stepwise detection of Tregs Step 1 Step 2
Overview Operator/QC p /Q FLAME
FLAME • Automated analysis method • Deconstructs the components of a mixture of cells • Cross-registers cell clusters across samples C i ll l l • Provides a specific record of analysis parameters allowing exact replication of an parameters, allowing exact replication of an analysis by a third party • Operator Modes: Ope ato odes – Cell population discovery mode – Clinical trial mode
Availability • Free software • Available through the GenePattern toolkit on the Broad Institute website – http://www.broadinstitute.org/cancer/software/genepattern/index.htm http://www broadinstitute org/cancer/software/genepattern/index htm l • GenePattern – an environment with pipelining capabilities and a repertoire of downstream analysis tools and a repertoire of downstream analysis tools • Pyne et al. Proc Natl Acad Sci USA 2009; 106: 8519-8524.
Acknowledgements • De Jager lab • Jill Mesirov – Cristin Aubin Cristin Aubin – Saumyadipta Pyne Saumyadipta Pyne – Aaron Brandes – Pablo Tamayo – Becky Briskin – Lori Chibnik • Geoff McLachlan – Portia Chipendo • Kui Wang – Xinli Hu – Linda Ottoboni • David Hafler – Nikolaos Patsopoulos – Clare Baecher-Allan – Joshua Shulman – Lisa Maier – Dong Tran – Irene Wood Irene Wood – Zongqi Xia Funding Sources • • National MS Society National MS Society • NIH: NIA, NINDS
Illustrative Examples Illustrative Examples
E xample 3: Feature selection – Phosphorylation of naïve & memory T cells pre- and post-stimulation 4-dimensional samples Mixture modeling Mixture modeling (ZAP70Y292 (ZAP70Y292 not shown) t h ) 5RA CD45 2 CD4 4
P hosphorylation causes feature alterations in populations lt ti i l ti 5 min. 0 min. Pre-stimulation Post-stimulation
M atching pre- and post- stimulation populations ti l ti l ti 2 pre-stimulation post-stimulation 6
M atching pre- and post- stimulation populations across all samples populations across all samples
F eature Selection Heatmap Zero-minutes Five-minutes 2 8
D t QC/ t Data QC/standardization d di ti • Carefully selected panels • Carefully selected panels • Minimal cross ‐ sample variation
FLow analysis with Automated Multivariate Estimation 10/01/2008 10/01/2008
L ow dimension t - mixture mixture Outliers ? Low dimension clustering is not good enough
M lti M ultivariate t -mixture is better i t t i t i b tt ? 3 Symmetric density is often not good enough 2
M odeling with skewed distributions distributions Better fit with skew Sk Skew N Skew-normal distribution 3 Photo courtesy: Azzalini J.M. et al. Statistical applications of the multivariate skew-normal distribution, 1999. 3
parametric mixture modeling •A biological population is assumed to follow a mathematical distribution, such as Gaussian •Each population can be abstracted as a cluster described by parameters such •Each population can be abstracted as a cluster , described by parameters, such as mean, mode, standard deviation, and skew, etc. •A mixture of populations can be abstracted as a mixture of distributions
Modeling with Gaussian G Gaussian may be too “skinny” to i b t “ ki ” t capture the entire population
Recommend
More recommend