A segmentation-clustering problem for the analysis of array CGH data F. Picard, S. Robin, E. Lebarbier, J-J. Daudin UMR INA-PG / INRA, Paris Bio-Info-Math Workshop, Tehran, April 2005
Microarray CGH technology - Known effects of big size chromosomal aberrations (ex: trisomy). → experimental tool: Karyotype (Resolution ∼ chromosome). - Change of scale: what are the effects of small size DNA sequences dele- tions/amplifications? → experimental tool: "conventional" CGH (resolution ∼ 10Mb). - CGH= Comparative Genomic Hybridization : method for the comparative measurement of relative DNA copy numbers between two samples (normal/disease, test/reference). → Application of the microarray technology to CGH : 1997. → last generation of chips: resolution ∼ 100kb.
Microarray technology in its principle
Interpretation of a CGH profile 3 2 segment amplifié 1 0 −1 segment "normal" −2 segment délété −3 0 10 20 30 40 50 60 70 80 90 A dot on the graph represents � � ♯ copies of BAC(t) in the test genome log 2 ♯ copies of BAC(t) in the reference genome
First step of the statistical analysis Break-points detection in a gaussian signal - Y = ( Y 1 , ..., Y n ) a random process such that Y t ∼ N ( µ t , σ 2 t ) . - Suppose that the parameters of the distribution of the Y s are affected by K-1 abrupt-changes at unknown coordinates T = ( t 1 , ..., t K − 1 ) . - Those break-points define a partition of the data into K segments of size n k : I k = { t, t ∈ ] t k − 1 , t k ] } , Y k = { Y t , t ∈ I k } . - Suppose that those parameters are constant between two changes: ∀ t ∈ I k , Y t ∼ N ( µ k , σ 2 k ) . - The parameters of this model are : T = ( t 1 , ..., t K − 1 ) , Θ = ( θ 1 , . . . , θ K ) , θ k = ( µ k , σ 2 k ) . - Break-points detection aims at studying the spatial structure of the signal .
Estimating the parameters in a model of abrupt-changes detection Log-Likelihood K K � � � log f ( y k ; θ k ) = L K ( T, Θ) = log f ( y t ; θ k ) k =1 k =1 t ∈ I k Estimating the parameters with K fixed by maximum likelihood - Joint estimation of T and Θ with dynamic programming. - Necessary property of the likelihood : additivity in K (sum of local likeli- hoods calculated on each segment). Model Selection : choice of K � � - Penalized Likelihood : ˆ ˆ K = Argmax L K − β × pen ( K ) . K - With pen ( K ) = 2 K . - β is adaptively estimated to the data (Lavielle(2003)).
Example of segmentation on array CGH data 2 2 1.5 1.5 1 1 0.5 0.5 log 2 rat log 2 rat 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 −2 −2 1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 genomic position 6 x 10 genomic position 5 x 10 BT474 chromosome 1, ˆ BT474 chromosome 9, ˆ K = 5 K = 4
Considering biologists objective and the need for a new model y y structure sur les y µ 2 , σ 2 m 2 , s 2 µ 3 , σ 3 x x x x x x x x x x x x x x µ 4 , σ 4 m 1 , s 1 µ 1 , σ 1 m 1 , s 1 x x x x x x x x x x x x x x x x x x x x x x t t structure sur les t structure sur les t Segmentation: structure spatiale du signal Segmentation/Classification θ k = ( µ k , σ 2 θ p = ( m p , s 2 k ) p )
A new model for segmentation-clustering purposes - We suppose there exists a secondary underlying structure of the segments into P populations with weights π 1 , ..., π P ( � p π p = 1) . - We introduce hidden variables, Z kp indicators of the population of origin of segment k . - Those variables are supposed independent, with multinomial distribution: ( Z k 1 , . . . , Z kP ) ∼ M (1; π 1 , . . . , π P ) . - Conditionnally to the hidden variables, we know the distribution of Y : Y k | Z kp = 1 ∼ N (1 l n k m p , s 2 p I n k ) . - It is a model of segmentation/clustering . - The parameters of this model are T = ( t 1 , ..., t K − 1 ) , Θ = ( π 1 , . . . , π P ; θ 1 , . . . , θ P ) , avec θ p = ( m p , s 2 p ) .
Likelihood and statistical units of the model - Mixture Model of segments : ⋆ the statistical units are segments : Y k , ⋆ the density of Y k is a mixture density: K K P � � � log f ( y k ; Θ) = π p f ( y k ; θ p ) log L KP ( T, Θ) = log k =1 k =1 p =1 ⋆ If the Y t s are independent, we have: K P � � � log L KP ( T, Θ) = log f ( y t ; θ p ) π p . p =1 k =1 t ∈ I k - Classical mixture model : ⋆ the statistical units are the Y t s, K P � � � log L P (Θ) = log π p f ( y t ; θ p ) p =1 k =1 t ∈ I k
An hybrid algorithm for the optimization of the likelihood Alternate parameters estimation with K and P known 1 When T is fixed, the EM algorithm estimates Θ : Θ ( ℓ +1) = Argmax � � Θ , T ( ℓ ) �� ˆ log L KP . Θ log L KP (ˆ Θ ( ℓ +1) ; ˆ T ( ℓ ) ) ≥ log L KP (ˆ Θ ( ℓ ) ; ˆ T ( ℓ ) ) 2 When Θ is fixed, dynamic programming estimates T : T ( ℓ +1) = Argmax � � �� ˆ ˆ Θ ( ℓ +1) , T log L KP . T log L KP (ˆ Θ ( ℓ +1) ; ˆ T ( ℓ +1) ) ≥ log L KP (ˆ Θ ( ℓ +1) ; ˆ T ( ℓ ) ) An increasing sequence of likelihoods: log L KP (ˆ Θ ( ℓ +1) ; ˆ T ( ℓ +1) ) ≥ log L KP (ˆ Θ ( ℓ ) ; ˆ T ( ℓ ) )
Mixture Model when the segmentation is knwon Mixture model parameters estimators π p f ( y k ; ˆ ˆ θ p ) τ kp = ˆ . π ℓ f ( y k ; ˆ � P ℓ =1 ˆ θ ℓ ) � k ˆ τ kp - the estimator the the mixing proportions is: ˆ π p = . K - In the gaussian case, θ p = ( m p , s 2 p ) : � � k ˆ τ kp t ∈ I k y t m p = ˆ , � k ˆ τ kp n k m p ) 2 � � k ˆ t ∈ I k ( y t − ˆ τ kp s 2 ˆ p = . � k ˆ τ kp n k - Big size vectors will have a bigger impact in the estimation of the parameters, via the term � k ˆ τ kp n k
Influence of the vectors size on the affectation (MAP) - The density of Y k can be written as follows: � � y k − m p ) 2 ��� p ) + 1 − n k � ( ¯ log(2 πs 2 y 2 f ( y k ; θ p ) = exp y 2 k − ¯ k ) + (¯ s 2 2 p y k − m p ) 2 : distance of the mean of vector k to population p ⋆ (¯ ⋆ ( ¯ y 2 y 2 k − ¯ k ) : intra-vector k variability - Big size Individuals will be affected with certitude to the closest population n k →∞ τ kp 0 = lim 1 n k →∞ τ kp = 0 lim n k → 0 τ kp 0 = π p 0 lim n k → 0 τ kp = π p lim
Segmentation with a fixed mixture Back to dynamic programming - the incomplete mixture log-likelihood can be written as a sum of local log- likelihoods: k ℓ kP ( y k ; Θ) L KP ( T, Θ) = � - the local log-likelihood of segment k corresponds to the mixture log-density of vector Y k P � � ℓ kP ( y k ; Θ) = log f ( y t ; θ p ) π p . p =1 t ∈ I k - log L KP ( T, Θ) can be optimized in T with Θ fixed, by dynamix programming.
A decreasing log-Likelihood? 8 6 4 2 0 −2 0 10 20 30 40 50 60 0 −100 −200 −300 −400 −500 0 10 20 30 40 50 60 Evolution of the incomplete log-likelihood with respect to the number of segments. f ( y k ; Θ) = 0 . 5 N (0 , 1) + 0 . 5 N (5 , 1)
What is going on? 6 4 2 0 −2 5 10 15 20 25 30 35 40 45 50 55 60 6 4 2 0 −2 5 10 15 20 25 30 35 40 45 50 55 60 6 4 2 0 −2 5 10 15 20 25 30 35 40 45 50 55 60 When the true number of segments is reached (6), segments are cut on the edges.
Explaining the behavior of the likelihood Optimization of the incomplete likelihood with dynamic programming : log L KP ( T ; Θ) = Q KP ( T ; Θ) − H KP ( T ; Θ) � � � � τ kp log f ( y k ; θ p ) Q KP ( T ; Θ) = τ kp log( π p ) + p p k k � � H KP ( T ; Θ) = τ kp log τ kp p k Hypothesis : 1 We suppose that the true number of segments is K ∗ and that the partitions are nested for K ≥ K ∗ . ⋆ Segment Y K is cut into ( Y K 1 , Y K 2 ) : f ( Y K ; θ p ) = f ( Y K 1 ; θ p ) × f ( Y K 2 ; θ p ) . 2 We suppose that if Y K ∈ p then ( Y K 1 , Y K 2 ) ∈ p : τ p ( Y K ) ≃ τ p ( Y K 1 ) ≃ τ p ( Y K 2 ) ≃ τ p .
An intrinsic penality Under hypothesis 1-2 : ∀ K ≥ K ∗ , log ˆ L ( K +1) ,P − log ˆ � � L ( K ) ,P ≃ π p ) − τ p ) ≤ 0 π p log(ˆ ˆ τ p log(ˆ ˆ p p The log-likelihood is decomposed into two terms - A term of fit that increases with K , and is constant from a certain K ∗ (nested partitions) � � τ kp log f ( y k ; ˆ ˆ θ p ) . p k - A term of differences of entropies that decreases with K : plays the role of penalty for the choice of K � � � π p ) − K π p log(ˆ ˆ τ kp log ˆ ˆ τ kp . p p k Choosing the number of segments K when P is fixed can be done with a penalized likelihood
Incomplete Likelihood behavior with respect to the number of segments 60 50 40 30 20 10 P=2 P=3 0 P=4 P=5 P=6 −10 2 4 6 8 10 12 14 16 18 20 The incomplete log-likelihood is decreasing from de K = 8 �� � L KP ( ˆ ˆ T ; ˆ π p f ( y k ; ˆ Θ) = � k log p ˆ θ p ) .
Decomposition of the log-likelihood 90 0 P=2 80 P=3 P=4 −5 P=5 70 P=6 60 −10 50 40 −15 30 −20 20 10 P=2 −25 P=3 P=4 0 P=5 P=6 −10 −30 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 term of fit differences of entropies τ kp log f ( y k ; ˆ � � K � π p ) − � � p ˆ θ p ) p ˆ π p log(ˆ p ˆ τ kp log ˆ τ kp k k
Recommend
More recommend