Curve Clustering and Functional Mixed Models. Modeling, variable selection and application to Genomics Franck Picard, LBBE - Lyon Madison Giacofci LJK (Grenoble) Sophie Lambert-Lacroix (TIMC - Grenoble) Guillemette Marot (Univ. Lille 2) Carlos Correa-Shokiche, (LBBE - Lyon) F. Picard (LBBE) JSF - June 2012 1 / 35
Introduction Outline Introduction 1 Functional Clustering Model with random effects 2 Estimation and model selection 3 Applications 4 Dimension Reduction for FANOVA 5 Conclusions & Perspectives 6 F. Picard (LBBE) JSF - June 2012 2 / 35
Introduction The Genomic RevolutionS Genomics is the field that investigates biological processes at the scale of Genomes. It started in the 70s-80s with the development of Molecular Biology techniques (sequencing, transcripts quantification). Genomics (and Post-Genomics) exploded in the 90s-2000s thanks to the miniaturization and industrialization of quantification processes. Sequencing the Human Genome ? took ∼ 10 years and can be done within a week now. F. Picard (LBBE) JSF - June 2012 3 / 35
Introduction Towards Population-Based Genomic Studies Quantification mainly concern : Copy Number Variations, messenger RNAs, and proteins mostly using microarrays and Mass Spectrometry For long the task has been to extract signal from noise for one individual experiment (sometimes with replicates !) Prices decreasing, these technologies are now used at the population levels: this is the rise of Population Genomics Statistical Tasks remain standard Differential Analysis, Clustering, Discrimination but the dimensionality of the data is overwhelming F. Picard (LBBE) JSF - June 2012 4 / 35
Introduction Example with Mass Spectrometry data 40 30 Aim: characterize the content of control 20 a mixture of peptides by 10 mass-spect 0 One peak corresponds to one peptite (signature) Each spectra contains 15154 40 cancer ionised peptides defined by a 30 20 m / z ratio. 10 253 ovarian cancer samples: 91 0 Controls, 162 Cases [10] Figure: MALDI-TOF Spectra. http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp F. Picard (LBBE) JSF - June 2012 5 / 35
Introduction Example with array CGH data 2 1 1q16q Aim: characterize copy number 0 variations between 2 genomes −1 −2 Segments with positive mean corresponds to regions that are amplified (negative/deleted) 2 55 aCGH profile from Breast 1 other Cancer patients 0 −1 Subgroup discovery: hierarchical −2 clustering based on segmentation [11]. Figure: Breast Cancer CGH profiles [8] (log scale) F. Picard (LBBE) JSF - June 2012 6 / 35
Introduction Towards Functional Models Proteomic Data : records are sampled on a very fine grid (m/z) and spectra have long been modeled using FDA Genomic Data are mapped on a reference genome and show a spatial (1D?) structure Functional models can account for this kind of structure, and working on curves should be more efficient than working on peaks or segments F. Picard (LBBE) JSF - June 2012 7 / 35
Introduction Towards Functional Mixed Models Subject specific fluctuations are known to be the largest source of variability in Mass-Spec data [6] Inter-Individual variability is the “curse” of biological data ! (Technical / Biological Variabilities), and often under-estimated Mixed Linear Models: well known in Genetics to structure the variance according to experimental design and pedigrees We propose to analyze genomic data using functional mixed models F. Picard (LBBE) JSF - June 2012 8 / 35
Functional Clustering Model with random effects Outline Introduction 1 Functional Clustering Model with random effects 2 Estimation and model selection 3 Applications 4 Dimension Reduction for FANOVA 5 Conclusions & Perspectives 6 F. Picard (LBBE) JSF - June 2012 9 / 35
Functional Clustering Model with random effects Functional ANOVA Model We observe N replicates of a noisy version of function µ over a fine grid t = { t 1 , . . . , t M } , t j ∈ [0 , 1], such that: Y i ( t m ) = µ ( t m ) + E i ( t m ) , E i ( t ) ∼ N (0 , σ 2 ) , with i = 1 , . . . , N , m = 1 , . . . , M = 2 J In the following we use notations Y i ( t ) = [ Y i ( t 1 ) , . . . , Y i ( t M )] , µ ( t ) = [ µ ( t 1 ) , . . . , µ ( t M )] We propose to use wavelets to analyse such data: - Modelling curves with irregularities - Computationaly efficiency (the DWT is in O ( M )) - Dimension Reduction F. Picard (LBBE) JSF - June 2012 10 / 35
Functional Clustering Model with random effects Definition of wavelets and wavelet coefficients Wavelets provide an orthonormal basis of L 2 ([0 , 1]) with a scaling function φ and a mother wavelet ψ such that: � � φ j 0 k ( t ) , k = 0 , . . . , 2 j 0 − 1; ψ jk ( t ) , j ≥ j 0 , k = 0 , . . . , 2 j − 1 Any function Y ∈ L 2 ([0 , 1]) is then expressed in the form: 2 j 0 − 1 2 j − 1 � � � c ∗ d ∗ Y i ( t ) = i , j 0 k φ j 0 k ( t ) + i , jk ψ jk ( t ) k =0 j ≥ j 0 k =0 where c ∗ i , j 0 k = � Y i , φ j 0 k � and d ∗ i , jk = � Y i , ψ jk � are the theorical scaling and wavelet coefficients. F. Picard (LBBE) JSF - June 2012 11 / 35
Functional Clustering Model with random effects The DWT and empirical wavelet coefficients Denote by W an orthogonal matrix of filters (wavelet specific), The Discrete Wavelet Transform is given by � c i � [ M × M ] Y i ( t ) = W d i [ M × 1] ( c i , d i ) are empirical scaling and wavelet coefficients Once the data are in the coefficient domain we retrieve a linear model such that: ( α = [ α j 0 , k ] k =0 ,..., 2 j 0 − 1 , β = [ β jk ] k =0 ,..., 2 j − 1 ) j = j 0 ,..., 2 J WY i ( t ) = W µ ( t ) + WE i ( t ) � c i � � α � + ε i , ε i ∼ N ( 0 M , σ 2 = ε I M ) d i β F. Picard (LBBE) JSF - June 2012 12 / 35
Functional Clustering Model with random effects Functional Clustering Model (FCM) The idea is to cluster individuals based on functional observations We suppose that the cluster structure concerns the fixed effects of the model When using a mixture model we introduce the label variable ζ i ℓ ∼ M (1 , π = ( π 1 , . . . , π L )) such that given { ζ i ℓ = 1 } Y i ( t m ) = µ ℓ ( t m ) + E i ( t m ) In the coefficient domain, we retrieve a Multivariate Gaussian Mixture such that given { ζ i ℓ = 1 } [3]: � c i � � α ℓ � = + ε i . d i β ℓ F. Picard (LBBE) JSF - June 2012 13 / 35
Functional Clustering Model with random effects Functional Clustering Mixed Models Functional Mixed models are considered to introduce inter-individual functional variability such that given { ζ i ℓ = 1 } : Y i ( t m ) = µ ℓ ( t m ) + U i ( t m ) + E i ( t m ) N (0 , K ℓ ( t , t ′ )) , U i ( t ) ⊥ E i ( t ) U i ( t ) |{ ζ i ℓ = 1 } ∼ In the wavelet domain, and given { ζ i ℓ = 1 } the model resumes to � c i � � α ℓ � � ν i � + ε i , ε i ∼ N ( 0 M , σ 2 = + ε I M ) d i β ℓ θ i � ν i � � � G ν �� 0 ∼ N 0 M , 0 θ i G θ � ν i � ⊥ ε i θ i F. Picard (LBBE) JSF - June 2012 14 / 35
Functional Clustering Model with random effects Specification of the covariance of random effects Suppose G θ is diagonal by the whitening property of wavelets [7] The fixed and random effects should lie in the same Besov space. Introduce parameter η related to the regularity of process U i Theorem Abramovich & al. [1] Suppose µ ( t ) ∈ B s p , q and V ( θ i , jk ) = 2 − j η γ 2 θ then � η = 2 s + 1 , if 1 ≤ p < ∞ and q = ∞ U i ( t ) ∈ B s ↔ p , q [0 , 1] a.s. η > 2 s + 1 , otherwise . The structure of the random effect can also vary wrt position and scale ( γ 2 θ, jk ), and/or group membership ( γ 2 θ, jk ℓ ) F. Picard (LBBE) JSF - June 2012 15 / 35
Estimation and model selection Outline Introduction 1 Functional Clustering Model with random effects 2 Estimation and model selection 3 Applications 4 Dimension Reduction for FANOVA 5 Conclusions & Perspectives 6 F. Picard (LBBE) JSF - June 2012 16 / 35
Estimation and model selection Using the EM algorithm In the coefficient domain, the model is a Gaussian mixture with structured variance Both label variables ζ and random effects ( ν , θ ) are unobserved The complete data log-likelihood can be written such that: � � � � c , d , ν , θ , ζ ; π , α , β , G , σ 2 c , d | ν , θ , ζ ; π , α , β , σ 2 log L = log L ε ε + log L ( ν , θ | ζ ; G ) + log L ( ζ ; π ) . This likelihood can be easily computed thanks to the properties of mixed linear models such that: � c i �� � ν i � �� α ℓ + ν i � � � � , σ 2 , { ζ i ℓ = 1 } ∼ N ε I . � d i θ i β ℓ + θ i F. Picard (LBBE) JSF - June 2012 17 / 35
Estimation and model selection Predictions of hidden variables The EM algorithm provides posterior probabilities of membership: � � π [ h ] c i , d i ; α [ h ] ℓ , β [ h ] ℓ , G [ h ] + σ 2[ h ] ℓ f I ε τ [ h +1] � � . = � i ℓ p π [ h ] c i , d i ; α [ h ] p , β [ h ] p , G [ h ] + σ 2[ h ] p f I ε The E-step also provides the BLUP of random effects: � � � � ν [ h +1] c i − α [ h ] 1 + λ [ h ] , λ ν = σ 2 ε /γ 2 � = / ν , i ℓ ℓ ν � � � � [ h +1] � d i − β [ h ] 1 + 2 j η λ [ h ] , λ θ = σ 2 ε /γ 2 = / θ . θ i ℓ ℓ θ F. Picard (LBBE) JSF - June 2012 18 / 35
Recommend
More recommend