Including prior knowledge in machine learning for genomic data Jean-Philippe Vert Mines ParisTech / Curie Institute / Inserm StatLearn workshop, Grenoble, March 17, 2011 J.P Vert (ParisTech) Prior knowlege in ML StatLearn 1 / 68
Outline Motivations 1 Finding multiple change-points in a single profile 2 Finding multiple change-points shared by many signals 3 Supervised classification of genomic profiles 4 Learning molecular classifiers with network information 5 Conclusion 6 J.P Vert (ParisTech) Prior knowlege in ML StatLearn 2 / 68
Outline Motivations 1 Finding multiple change-points in a single profile 2 Finding multiple change-points shared by many signals 3 Supervised classification of genomic profiles 4 Learning molecular classifiers with network information 5 Conclusion 6 J.P Vert (ParisTech) Prior knowlege in ML StatLearn 3 / 68
Chromosomic aberrations in cancer J.P Vert (ParisTech) Prior knowlege in ML StatLearn 4 / 68
Comparative Genomic Hybridization (CGH) J.P Vert (ParisTech) Prior knowlege in ML StatLearn 5 / 68
Can we identify breakpoints and "smooth" each profile? 1.4 1.2 1 0.8 0.6 0.4 0.2 0 −0.2 0 100 200 300 400 500 600 700 800 900 1000 J.P Vert (ParisTech) Prior knowlege in ML StatLearn 6 / 68
Can we detect frequent breakpoints? 1 0.5 0 − 0.5 − 1 0 200 400 600 800 1000 1200 1400 1600 1800 2000 1 0.5 0 − 0.5 − 1 0 200 400 600 800 1000 1200 1400 1600 1800 2000 1 0.5 0 − 0.5 − 1 0 200 400 600 800 1000 1200 1400 1600 1800 2000 1 0.5 0 − 0.5 − 1 0 200 400 600 800 1000 1200 1400 1600 1800 2000 A collection of bladder tumour copy number profiles. J.P Vert (ParisTech) Prior knowlege in ML StatLearn 7 / 68
Can we detect discriminative patterns? 0.5 0.5 0 0 −0.5 −0.5 −1 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 2 2 1 0 0 −2 −1 −4 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 1 2 0 0 −1 −2 −2 −4 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 4 1 2 0 0 −2 −1 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 2 0.5 0 0 −2 −0.5 −4 −1 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 Aggressive (left) vs non-aggressive (right) melanoma. J.P Vert (ParisTech) Prior knowlege in ML StatLearn 8 / 68
DNA → RNA → protein CGH shows the (static) DNA Cancer cells have also abnormal (dynamic) gene expression (= transcription) J.P Vert (ParisTech) Prior knowlege in ML StatLearn 9 / 68
Tissue profiling with DNA chips Data Gene expression measures for more than 10 k genes Measured typically on less than 100 samples of two (or more) different classes (e.g., different tumors) J.P Vert (ParisTech) Prior knowlege in ML StatLearn 10 / 68
Can we identify the cancer subtype? (diagnosis) J.P Vert (ParisTech) Prior knowlege in ML StatLearn 11 / 68
Can we predict the future evolution? (prognosis) J.P Vert (ParisTech) Prior knowlege in ML StatLearn 12 / 68
Summary 0.5 0.5 0 0 −0.5 −0.5 −1 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 2 2 1 0 0 −2 −1 0 500 1000 1500 2000 2500 −4 0 500 1000 1500 2000 2500 1 2 0 0 −1 −2 −2 −4 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 4 1 2 0 0 −2 −1 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 2 0.5 0 0 −2 −0.5 −4 0 500 1000 1500 2000 2500 −1 0 500 1000 1500 2000 2500 Many problems... Data are high-dimensional, but "structured" Classification accuracy is not all, interpretation is necessary (pattern discovery) A general strategy min R ( β ) + λ Ω( β ) J.P Vert (ParisTech) Prior knowlege in ML StatLearn 13 / 68
Outline Motivations 1 Finding multiple change-points in a single profile 2 Finding multiple change-points shared by many signals 3 Supervised classification of genomic profiles 4 Learning molecular classifiers with network information 5 Conclusion 6 J.P Vert (ParisTech) Prior knowlege in ML StatLearn 14 / 68
The problem 1.4 1.2 1 0.8 0.6 0.4 0.2 0 −0.2 0 100 200 300 400 500 600 700 800 900 1000 Let Y ∈ R p the signal U ∈ R p with We want to find a piecewise constant approximation ˆ at most k change-points. J.P Vert (ParisTech) Prior knowlege in ML StatLearn 15 / 68
The problem 1.4 1.2 1 0.8 0.6 0.4 0.2 0 −0.2 0 100 200 300 400 500 600 700 800 900 1000 Let Y ∈ R p the signal U ∈ R p with We want to find a piecewise constant approximation ˆ at most k change-points. J.P Vert (ParisTech) Prior knowlege in ML StatLearn 15 / 68
An optimal solution? 1.4 1.2 1 0.8 0.6 0.4 0.2 0 −0.2 0 100 200 300 400 500 600 700 800 900 1000 We can define an "optimal" piecewise constant approximation U ∈ R p as the solution of ˆ p − 1 � U ∈ R p � Y − U � 2 min such that 1 ( U i + 1 � = U i ) ≤ k i = 1 � p � This is an optimization problem over the partitions... k Dynamic programming finds the solution in O ( p 2 k ) in time and O ( p 2 ) in memory But: does not scale to p = 10 6 ∼ 10 9 ... J.P Vert (ParisTech) Prior knowlege in ML StatLearn 16 / 68
An optimal solution? 1.4 1.2 1 0.8 0.6 0.4 0.2 0 −0.2 0 100 200 300 400 500 600 700 800 900 1000 We can define an "optimal" piecewise constant approximation U ∈ R p as the solution of ˆ p − 1 � U ∈ R p � Y − U � 2 min such that 1 ( U i + 1 � = U i ) ≤ k i = 1 � p � This is an optimization problem over the partitions... k Dynamic programming finds the solution in O ( p 2 k ) in time and O ( p 2 ) in memory But: does not scale to p = 10 6 ∼ 10 9 ... J.P Vert (ParisTech) Prior knowlege in ML StatLearn 16 / 68
An optimal solution? 1.4 1.2 1 0.8 0.6 0.4 0.2 0 −0.2 0 100 200 300 400 500 600 700 800 900 1000 We can define an "optimal" piecewise constant approximation U ∈ R p as the solution of ˆ p − 1 � U ∈ R p � Y − U � 2 min such that 1 ( U i + 1 � = U i ) ≤ k i = 1 � p � This is an optimization problem over the partitions... k Dynamic programming finds the solution in O ( p 2 k ) in time and O ( p 2 ) in memory But: does not scale to p = 10 6 ∼ 10 9 ... J.P Vert (ParisTech) Prior knowlege in ML StatLearn 16 / 68
An optimal solution? 1.4 1.2 1 0.8 0.6 0.4 0.2 0 −0.2 0 100 200 300 400 500 600 700 800 900 1000 We can define an "optimal" piecewise constant approximation U ∈ R p as the solution of ˆ p − 1 � U ∈ R p � Y − U � 2 min such that 1 ( U i + 1 � = U i ) ≤ k i = 1 � p � This is an optimization problem over the partitions... k Dynamic programming finds the solution in O ( p 2 k ) in time and O ( p 2 ) in memory But: does not scale to p = 10 6 ∼ 10 9 ... J.P Vert (ParisTech) Prior knowlege in ML StatLearn 16 / 68
Promoting sparsity with the ℓ 1 penalty The ℓ 1 penalty (Tibshirani, 1996; Chen et al., 1998) If R ( β ) is convex and "smooth", the solution of p � β ∈ R p R ( β ) + λ min | β i | i = 1 is usually sparse. J.P Vert (ParisTech) Prior knowlege in ML StatLearn 17 / 68
Promoting piecewise constant profiles penalty The total variation / variable fusion penalty If R ( β ) is convex and "smooth", the solution of p − 1 � β ∈ R p R ( β ) + λ | β i + 1 − β i | min i = 1 is usually piecewise constant (Rudin et al., 1992; Land and Friedman, 1996). Proof: Change of variable u i = β i + 1 − β i , u 0 = β 1 We obtain a Lasso problem in u ∈ R p − 1 u sparse means β piecewise constant J.P Vert (ParisTech) Prior knowlege in ML StatLearn 18 / 68
TV signal approximator p − 1 � β ∈ R p � Y − β � 2 min such that | β i + 1 − β i | ≤ µ i = 1 Adding additional constraints does not change the change-points: � p i = 1 | β i | ≤ ν (Tibshirani et al., 2005; Tibshirani and Wang, 2008) � p i = 1 β 2 i ≤ ν (Mairal et al. 2010) J.P Vert (ParisTech) Prior knowlege in ML StatLearn 19 / 68
Solving TV signal approximator p − 1 β ∈ R p � Y − β � 2 � min such that | β i + 1 − β i | ≤ µ i = 1 QP with sparse linear constraints in O ( p 2 ) -> 135 min for p = 10 5 (Tibshirani and Wang, 2008) Coordinate descent-like method O ( p ) ? -> 3s s for p = 10 5 (Friedman et al., 2007) For all µ with the LARS in O ( pK ) (Harchaoui and Levy-Leduc, 2008) For all µ in O ( p ln p ) (Hoefling, 2009) For the first K change-points in O ( p ln K ) (Bleakley and V., 2010) J.P Vert (ParisTech) Prior knowlege in ML StatLearn 20 / 68
Speed trial : 2 s. for K = 100, p = 10 7 Speed for K=1, 10, 1e2, 1e3, 1e4, 1e5 1 0.9 0.8 0.7 0.6 seconds 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 9 10 signal length 5 x 10 J.P Vert (ParisTech) Prior knowlege in ML StatLearn 21 / 68
Summary 1.4 1.2 1 0.8 0.6 0.4 0.2 0 −0.2 0 100 200 300 400 500 600 700 800 900 1000 A fast method for multiple change-point detection An embedded method that boils down to a dichotomic wrapper method (very different from dynamic programming) J.P Vert (ParisTech) Prior knowlege in ML StatLearn 22 / 68
Recommend
More recommend