FlowCAP-I: Results Ryan Brinkman Senior Scientist, Terry Fox Laboratory, BC Cancer Agency Associate Professor, Medical Genetics, UBC Sept 22, 2010 Ryan Brinkman – British Columbia Cancer Agency FlowCAP
Outline Sections What it means to be better (F-measure, ranking) Challenge 1 results Challenge 2 results Challenge 3 results Challenge 4 results So, which method should you use? Ryan Brinkman – British Columbia Cancer Agency FlowCAP
What it means to be better - Part I Some comparisons are easy to quantify and understand intuitively Does Raphael have more hair/ cm 2 of skull than Richard? Some aren’t Is Richard better looking than Raphael? In which case you can use a gold standard Ryan Brinkman – British Columbia Cancer Agency FlowCAP
Problems with gold standards 1 It is possible they are flawed You are unaware of intrinsic problems of your standard You start over-optimizing for some qualities of the standard Rogain vs. Steroids - remember this now 2 Can never be better (looking) than the standard Ryan Brinkman – British Columbia Cancer Agency FlowCAP
How to evaluate gating vs. the gold standard? Ryan Brinkman – British Columbia Cancer Agency FlowCAP
How to evaluate gating vs. the gold standard? Several categories of clustering comparison metrics Pair counting Measures likelihood of grouping pair of data points together Set-matching Measures overlap between gold standard “classes” and hypothesized “clusters” Entropy-based Measures how well clusters only contain data points from a single class ( i.e. , homogeneity & completeness) Several examples within each category MCR, V-measure, VI, Rand Index, F-measure F-measure has the minimum overall error for flow data C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979 Ryan Brinkman – British Columbia Cancer Agency FlowCAP
F-measure Everything you need to know about the F-measure* Mis-classification rate is generally used for evaluating classifiers For FlowCAP, we performed cluster matching to label the clusters & calculate the misclassifications But its very time consuming to find the best cluster matching F-measure uses heuristic cluster matching algorithm Does not guarantee best answer but is significantly faster Mis-classification rate is then normalized by the size of the cluster. *Andrew Rosenberg and Julia Hirschberg. V-Measure: A conditional entropy-based external cluster evaluation measure. Ryan Brinkman – British Columbia Cancer Agency FlowCAP
What it means to be better - Part II Some differences are easy to test for significance Null hypothesis: Raphael has significantly different hair thickness than Richard Count # hairs in 30 random 1 cm 2 patches on Raphael’s head Count # hairs in 30 matched 1 cm 2 locations on Richard’s Do a paired t-test & check significance table Some aren’t H 0 : flowMeans’ results are significantly different than SamSpectral n = 5 (datasets) is too small Gold standard is manual gating Is an F-measure of .72 significantly different than .73? What does such a difference even mean ? Ryan Brinkman – British Columbia Cancer Agency FlowCAP
Scoring: fractional ranking and Borda count Reducing complex data by evaluating it using certain criteria Evaluate match to human gating per sample using F-measure Rank F-measures high to low Score “best” algorithm N = # algorithms points Rank second highest algorithm N-1 points Group algorithms with overlapping F-measure 95% CI Give grouped algorithms average score of the group Sum scores across datasets Ryan Brinkman – British Columbia Cancer Agency FlowCAP
Challenge 1: Automated algorithms Unsupervised clustering The “We really don’t know what we are looking for challenge” Given FCS files, markers (sometimes), general biology No tweaking of algorithms across datasets Compare to manual gates Ryan Brinkman – British Columbia Cancer Agency FlowCAP
F-Measure Distributions: Challenge 1: GvHD Ryan Brinkman – British Columbia Cancer Agency FlowCAP Figure 1: Distributions of F-Measures for the GvHD dataset, challenge 1.
Example Boxplots of F-measure values of different algorithms for Challenge 4: GvHD. There is a general agreement between the algorithms and the manual analysis. Sample 2: A sharp change in the F-measure values: The algorithms don’t agree with the human expert. Ryan Brinkman – British Columbia Cancer Agency FlowCAP
F-Measure CIs: Challenge 1: GvHD Figure 2: Confidence Intervals of F-Measures for the GvHD dataset, challenge 1. Ryan Brinkman – British Columbia Cancer Agency FlowCAP
Challenge 1: GvHD GvHD Rank Score FlowVB 0.85 (0.78, 0.90) 8.0 FLOCK 0.84 (0.77, 0.90) 8.0 flowMeans 0.88 (0.82, 0.93) 8.0 FLAME 0.85 (0.76, 0.92) 8.0 MM&PCA 0.84 (0.74, 0.93) 8.0 MM 0.83 (0.74, 0.91) 8.0 SamSPECTRAL 0.87 (0.82, 0.93) 8.0 CDP 0.52 (0.46, 0.57) 2.5 FEK 0.64 (0.57, 0.71) 2.5 flowClust/Merge 0.69 (0.56, 0.79) 2.5 SWIFT 0.63 (0.56, 0.69) 2.5 Table 1: Mean and 95 percent CIs for the F-Measures and rank scores for challenge 1 dataset GvHD Ryan Brinkman – British Columbia Cancer Agency FlowCAP
Challenge 1: DLBCL DLBCL Rank Score FLOCK 0.88 (0.85, 0.91) 8.80 flowMeans 0.92 (0.90, 0.95) 8.80 FLAME 0.91 (0.88, 0.93) 8.80 MM 0.90 (0.86, 0.92) 8.80 SamSPECTRAL 0.86 (0.83, 0.90) 8.80 FlowVB 0.87 (0.85, 0.90) 4.75 CDP 0.85 (0.81, 0.88) 4.75 flowClust/Merge 0.84 (0.81, 0.86) 4.75 MM&PCA 0.85 (0.82, 0.88) 4.75 FEK 0.79 (0.74, 0.83) 2.00 SWIFT 0.67 (0.63, 0.71) 1.00 Table 2: Mean and 95 percent CIs for the F-Measures and rank scores for challenge 1 dataset DLBCL Ryan Brinkman – British Columbia Cancer Agency FlowCAP
Challenge 1: HSCT HSCT Rank Score flowMeans 0.92 (0.90, 0.94) 10 FLAME 0.94 (0.92, 0.95) 10 MM&PCA 0.91 (0.88, 0.94) 10 FLOCK 0.86 (0.83, 0.89) 7 flowClust/Merge 0.81 (0.77, 0.85) 7 SamSPECTRAL 0.85 (0.82, 0.88) 7 FlowVB 0.75 (0.70, 0.79) 4 FEK 0.70 (0.65, 0.74) 4 MM 0.73 (0.66, 0.80) 4 SWIFT 0.59 (0.55, 0.63) 2 CDP 0.50 (0.48, 0.52) 1 Table 3: Mean and 95 percent CIs for the F-Measures and rank scores for challenge 1 dataset HSCT Ryan Brinkman – British Columbia Cancer Agency FlowCAP
Challenge 1: WNV WNV Rank Score FLOCK 0.83 (0.80, 0.86) 10.5 flowMeans 0.88 (0.86, 0.90) 10.5 FlowVB 0.81 (0.78, 0.83) 7.0 FEK 0.78 (0.75, 0.81) 7.0 flowClust/Merge 0.77 (0.74, 0.79) 7.0 FLAME 0.80 (0.76, 0.84) 7.0 SamSPECTRAL 0.75 (0.61, 0.85) 7.0 CDP 0.71 (0.67, 0.74) 2.5 MM&PCA 0.64 (0.52, 0.72) 2.5 MM 0.69 (0.60, 0.75) 2.5 SWIFT 0.69 (0.64, 0.74) 2.5 Table 4: Mean and 95 percent CIs for the F-Measures and rank scores for challenge 1 dataset WNV Ryan Brinkman – British Columbia Cancer Agency FlowCAP
Challenge 1: ND ND Rank Score SamSPECTRAL 0.92 (0.92, 0.93) 11.00 FLOCK 0.91 (0.89, 0.92) 8.33 flowMeans 0.85 (0.76, 0.92) 8.33 FLAME 0.90 (0.89, 0.91) 8.33 CDP 0.86 (0.81, 0.89) 7.50 SWIFT 0.87 (0.86, 0.88) 7.50 FEK 0.81 (0.80, 0.82) 4.00 FlowVB 0.85 (0.84, 0.86) 3.00 flowClust/Merge 0.73 (0.58, 0.85) 3.00 MM&PCA 0.76 (0.75, 0.77) 2.50 MM 0.75 (0.74, 0.76) 2.50 Table 5: Mean and 95 percent CIs for the F-Measures and rank scores for challenge 1 dataset ND Ryan Brinkman – British Columbia Cancer Agency FlowCAP
Challenge 1: Overall (lots of choice for automate analysis) Rank Score Total Runtime flowMeans 45.6 00:04:23:27 FLOCK 42.6 00:00:37:38 FLAME 42.1 00:05:31:12 SamSPECTRAL 41.8 00:07:21:44 MM&PCA 27.8 00:00:04:35 FlowVB 26.8 03:02:23:09 MM 25.8 00:00:13:00 (sorry) flowClust/Merge 24.2 10:13:00:00 FEK 19.5 00:15:25:00 CDP 18.2 00:01:48:06 SWIFT 15.5 05:23:24:30 Table 6: Total runtimes (dd:hh:mm:ss) and rank scores for challenge 1 Ryan Brinkman – British Columbia Cancer Agency FlowCAP
Challenge 1: Overall (lots of choice for automate analysis) Ryan Brinkman – British Columbia Cancer Agency FlowCAP
Challenge 2: Tuned Algorithms (in the Absence of Example Human-Provided Gates) Add in the number of clusters Same as challenge 1, and ... You can tweak algorithm parameters to get a better “fit” to the data Ryan Brinkman – British Columbia Cancer Agency FlowCAP
Challenge 2: GvHD GvHD Rank Score NMF-curvHDR 0.76 (0.69, 0.82) 5.0 FLOCK 0.84 (0.76, 0.90) 5.0 FLAME 0.81 (0.75, 0.87) 5.0 SamSPECTRAL 0.87 (0.79, 0.93) 5.0 SamSPECTRAL-Fixed-K 0.87 (0.80, 0.93) 5.0 CDP 0.59 (0.52, 0.64) 1.5 flowClust/Merge 0.69 (0.54, 0.79) 1.5 Table 7: Mean and 95 percent CIs for the F-Measures and rank scores for challenge 2 dataset GvHD Ryan Brinkman – British Columbia Cancer Agency FlowCAP
Challenge 2: DLBCL DLBCL Rank Score FLOCK 0.88 (0.85, 0.91) 5.5 flowClust/Merge 0.87 (0.85, 0.90) 5.5 FLAME 0.87 (0.84, 0.90) 5.5 SamSPECTRAL 0.92 (0.89, 0.94) 5.5 NMF-curvHDR 0.84 (0.82, 0.86) 2.5 SamSPECTRAL-Fixed-K 0.85 (0.81, 0.89) 2.5 CDP 0.75 (0.69, 0.81) 1.0 Table 8: Mean and 95 percent CIs for the F-Measures and rank scores for challenge 2 dataset DLBCL Ryan Brinkman – British Columbia Cancer Agency FlowCAP
Recommend
More recommend