Learning algorithms and statistical software, with applications to bioinformatics PhD defense of Toby Dylan Hocking toby.hocking@inria.fr http://cbio.ensmp.fr/~thocking/ 20 November 2012 1
Summary of contributions ◮ Ch. 2: clusterpath for finding groups in data, ICML 2011. ◮ Ch. 3: breakpoint annotations for smoothing model training and evaluation, HAL-00663790. ◮ Ch. 4-5: penalties for breakpoint detection in simulated and real signals, under review. ◮ Statistical software contributions in R: ◮ Ch. 7: direct labels for readable statistical graphics, Best Student Poster at useR 2011. ◮ Ch. 8: documentation generation to convert comments into a package for distribution, accepted in JSS. ◮ Ch. 9: named capture regular expressions for extracting data from text files, talk for useR 2011, accepted into R-2.14. 2
Cancer cells show chromosomal copy number alterations Spectral karyotypes show the number of copies of the sex chromosomes (X,Y) and autosomes (1-22). Source: Alberts et al. 2002. Normal cell with 2 copies of Cancer cell with many copy each autosome. number alterations. 3
Copy number profiles of neuroblastoma tumors 4
Ch. 2: clusterpath finds groups in data Ch 3: breakpoint annotations for smoothing model selection Ch. 4–5: penalties for breakpoint detection 5
The clusterpath relaxes a hard fusion penalty || α − X || 2 min F α ∈ R n × p � subject to 1 α i � = α j ≤ t α 2 number of breakpoints X 2 i < j α C = α 2 = α 3 Combinatorial! Relaxation: X 3 � || α i − α j || q w ij ≤ t i < j The clusterpath is the path of optimal α obtained by varying t . Related work: “fused lasso” α 1 Tibshirani and Saunders (2005), X 1 “convex clustering shrinkage” Pel- ckmans et al. (2005), “grouping α 1 survival pursuit” Shen and Huang (2010), “sum of norms” Lindsten et al. (2011). 6
Choice of norm and weights alters the clusterpath norm = 1 norm = 2 norm = ∞ Take X ∈ R 10 × 2 , solve weights γ = 0 α || α − X || 2 min F subject to Ω( α ) / Ω( X ) ≤ 0. Penalty with ℓ q norm: weights γ = 1 � Ω( Y ) = || Y i − Y j || q w ij i < j Weights: w ij = exp( − γ || X i − X j || 2 2 ) 7
Choice of norm and weights alters the clusterpath norm = 1 norm = 2 norm = ∞ Take X ∈ R 10 × 2 , solve weights γ = 0 α || α − X || 2 min F subject to Ω( α ) / Ω( X ) ≤ 0 . 1. Penalty with ℓ q norm: weights γ = 1 � Ω( Y ) = || Y i − Y j || q w ij i < j Weights: w ij = exp( − γ || X i − X j || 2 2 ) 8
Choice of norm and weights alters the clusterpath norm = 1 norm = 2 norm = ∞ Take X ∈ R 10 × 2 , solve weights γ = 0 α || α − X || 2 min F subject to Ω( α ) / Ω( X ) ≤ 0 . 2. Penalty with ℓ q norm: weights γ = 1 � Ω( Y ) = || Y i − Y j || q w ij i < j Weights: w ij = exp( − γ || X i − X j || 2 2 ) 9
Choice of norm and weights alters the clusterpath norm = 1 norm = 2 norm = ∞ Take X ∈ R 10 × 2 , solve weights γ = 0 α || α − X || 2 min F subject to Ω( α ) / Ω( X ) ≤ 0 . 3. Penalty with ℓ q norm: weights γ = 1 � Ω( Y ) = || Y i − Y j || q w ij i < j Weights: w ij = exp( − γ || X i − X j || 2 2 ) 10
Choice of norm and weights alters the clusterpath norm = 1 norm = 2 norm = ∞ Take X ∈ R 10 × 2 , solve weights γ = 0 α || α − X || 2 min F subject to Ω( α ) / Ω( X ) ≤ 0 . 4. Penalty with ℓ q norm: weights γ = 1 � Ω( Y ) = || Y i − Y j || q w ij i < j Weights: w ij = exp( − γ || X i − X j || 2 2 ) 11
Choice of norm and weights alters the clusterpath norm = 1 norm = 2 norm = ∞ Take X ∈ R 10 × 2 , solve weights γ = 0 α || α − X || 2 min F subject to Ω( α ) / Ω( X ) ≤ 0 . 5. Penalty with ℓ q norm: weights γ = 1 � Ω( Y ) = || Y i − Y j || q w ij i < j Weights: w ij = exp( − γ || X i − X j || 2 2 ) 12
Choice of norm and weights alters the clusterpath norm = 1 norm = 2 norm = ∞ Take X ∈ R 10 × 2 , solve weights γ = 0 α || α − X || 2 min F subject to Ω( α ) / Ω( X ) ≤ 0 . 6. Penalty with ℓ q norm: weights γ = 1 � Ω( Y ) = || Y i − Y j || q w ij i < j Weights: w ij = exp( − γ || X i − X j || 2 2 ) 13
Choice of norm and weights alters the clusterpath norm = 1 norm = 2 norm = ∞ Take X ∈ R 10 × 2 , solve weights γ = 0 α || α − X || 2 min F subject to Ω( α ) / Ω( X ) ≤ 0 . 7. Penalty with ℓ q norm: weights γ = 1 � Ω( Y ) = || Y i − Y j || q w ij i < j Weights: w ij = exp( − γ || X i − X j || 2 2 ) 14
Choice of norm and weights alters the clusterpath norm = 1 norm = 2 norm = ∞ Take X ∈ R 10 × 2 , solve weights γ = 0 α || α − X || 2 min F subject to Ω( α ) / Ω( X ) ≤ 0 . 8. Penalty with ℓ q norm: weights γ = 1 � Ω( Y ) = || Y i − Y j || q w ij i < j Weights: w ij = exp( − γ || X i − X j || 2 2 ) 15
Choice of norm and weights alters the clusterpath norm = 1 norm = 2 norm = ∞ Take X ∈ R 10 × 2 , solve weights γ = 0 α || α − X || 2 min F subject to Ω( α ) / Ω( X ) ≤ 0 . 9. Penalty with ℓ q norm: weights γ = 1 � Ω( Y ) = || Y i − Y j || q w ij i < j Weights: w ij = exp( − γ || X i − X j || 2 2 ) 16
Choice of norm and weights alters the clusterpath norm = 1 norm = 2 norm = ∞ Take X ∈ R 10 × 2 , solve weights γ = 0 α || α − X || 2 min F subject to Ω( α ) / Ω( X ) ≤ 1. Penalty with ℓ q norm: weights γ = 1 � Ω( Y ) = || Y i − Y j || q w ij i < j Weights: w ij = exp( − γ || X i − X j || 2 2 ) 17
Clusterpath learns a tree, even for odd cluster shapes Comparison with other methods for finding 2 clusters. Caveat: does not recover overlapping clusters, e.g. iris data, gaussian mixture. 18
Contributions in chapter 2, future work Hocking et al. Clusterpath: an Algorithm for Clustering using Convex Fusion Penalties. ICML 2011. ◮ Theorem. No splits in the ℓ 1 clusterpath with identity weights w ij = 1. What about other situations? ◮ Convex and hierarchical clustering algorithms. ◮ ℓ 1 homotopy method O ( pn log n ). ◮ ℓ 2 active-set method O ( pn 2 ). ◮ ℓ ∞ Franck-Wolfe algorithm. ◮ Implementation in R package clusterpath on R-Forge. 19
Ch. 2: clusterpath finds groups in data Ch 3: breakpoint annotations for smoothing model selection Ch. 4–5: penalties for breakpoint detection 20
How to detect breakpoints in 23 × 575 =13,225 signals? 21
Which model should we use? ◮ GLAD: adaptive weights smoothing (Hup´ e et al. , 2004) ◮ DNAcopy: circular binary segmentation (Venkatraman and Olshen, 2007) ◮ cghFLasso: fused lasso signal approximator with heuristics (Tibshirani and Wang, 2007) ◮ HaarSeg: wavelet smoothing (Ben-Yaacov and Eldar, 2008) ◮ GADA: sparse Bayesian learning (Pique-Regi et al. , 2008) ◮ flsa: fused lasso signal approximator path algorithm (Hoefling 2009) ◮ cghseg: pruned dynamic programming (Rigaill 2010) ◮ PELT: pruned exact linear time (Killick et al. , 2011) ... and how to select the smoothing parameter in each model? 22
575 copy number profiles, each annotated in 6 regions 23
Not enough breakpoints 24
Too many breakpoints 25
Good agreement with annotated regions 26
Select the best model using the breakpoint annotations Breakpoint detection training errors for 3 models of data(neuroblastoma,package="neuroblastoma") . cghseg.k, pelt.n flsa.norm dnacopy.sd 80 predicted annotations percent incorrectly in training set 60 statistic false.positive 40 false.negative errors 20 11.5 ● 4.8 2.2 ● ● 0 −5 −4 −3 −2 −1 0 −2 −1 0 1 2 3 0.0 0.5 1.0 log10(smoothing parameter lambda) <− more breakpoints fewer breakpoints −> Idea: for several smoothing parameters λ , calculate the annotation error function E ( λ ), (black line) then select the model with least error. (black dot) ˆ λ = arg min E ( λ ) . λ 27
PELT/cghseg show the best breakpoint detection ROC curves for breakpoint detection training errors of each model, by varying the smoothness parameter λ . probability(predict breakpoint | breakpoint) optimization−based models approximate optimization glad ● 1.0 ● ● dnacopy.alpha glad.haarseg ● cghseg.mBIC ● glad.default dnacopy ● pelt.n ● ● 0.9 default cghseg.k True positive rate = ● glad.MinBkpWeight flsa dnacopy ● ● norm 0.8 prune gada ● glad.lambdabreak 0.7 ● ● dnacopy.sd flsa 0.6 ● pelt.default 0.5 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 False positive rate = probability(predict breakpoint | normal) Open circle shows smoothness λ selected using annotations. 28
Few annotations required for a good breakpoint detector Percent of correctly predicted annotations on test set profiles 100 cghseg.k, pelt.n 98 96 94 flsa.norm 92 90 dnacopy.sd 88 86 84 glad.lambdabreak 82 80 1 5 10 15 20 25 30 Annotated profiles in global model training set 29
Interactive web site for annotation and model building Takita J et al. Aberrations of NEGR1 on 1p31 and MYEOV on 11q13 in neuroblastoma. Cancer Sci. 2011 Sep;102(9):1645-50. 30
Recommend
More recommend