Biweight Correlation as a Measure of Distance between Genes on a Microarray Aya Mitani Pitzer College ’06 Advisor: Professor Johanna Hardin Pomona College April 29, 2006 1
About microarray • Small chip • Contains thousands of probes • Measures mRNA activity in a particular cell type • Contains control and treatment sample • Expression level is measured from light intensity 2
3
4
Problem with microarray • Noisy data • Needs robust estimation of correlation • Pearson correlation is often used -One outlier can greatly affect correlation 5
Last summer M-estimation weighed average with points farther from the center given less weight � µ ) ′ ˜ Σ − 1 ( x i − ˜ = ( x i − ˜ µ ) (1) d i � i w ( d i ) x i ˜ = (2) µ � i w ( d i ) µ ) ′ � i w ( d i )( x i − ˜ µ )( x i − ˜ ˜ Σ = (3) � i w ( d i ) Tukey’s biweight � d i (1 − ( d i c ) 2 ) 2 d i ≤ c w ( d i ) = 0 d i > c Use Minimum Covariance Determinant (MCD) for initial estimation of µ and Σ 6
Plot of Biweight weight function ( w ) 1.0 0.8 0.6 weight 0.4 0.2 0.0 0 1 2 3 4 distance 7
Biweight Correlation Coefficient σ jk bwc jk = σ jj σ kk where σ jk is biweight estimate of covariance of gene j and gene k and σ jj is biweight estimate of variance of gene j Want to find out the correlation(similarities/differences) of two genes 8
0.5 Pearson correlation 0.0 −0.5 9 −0.5 0.0 0.5 1.0 Biweight correlation
0.5 0 0.0 −1 Gene 86 Gene 11 −1.0 −2 −3 −2.0 −4 −0.5 0.0 0.5 −0.5 0.0 0.5 1.0 1.5 2.0 Gene 14 Gene 26 10
Further work to be done • Computational time • Biweight correlation on clean data 11
This Spring • Matrix correlation vs Pair by pair correlation • One-step M-estimation • Median vs MCD • Biweight correlation good for clean data? 12
Instead of computing pair by pair correlation, compute correla- tion matrix from biweight covariance matrix simultaneously � µ ) ′ ˜ Σ − 1 ( x i − ˜ = ( x i − ˜ µ ) (4) d i � i w ( d i ) x i ˜ = (5) µ � i w ( d i ) µ ) ′ � i w ( d i )( x i − ˜ µ )( x i − ˜ ˜ Σ = (6) � i w ( d i ) ⎛ ⎞ � − 1 ⎛ ⎞ mat.bwc 11 . . . mat.bwc 1 n σ 11 . . . σ 1 n � − 1 � � 0 0 σ 11 . . . σ 11 . . . mat.bwc 21 . . . mat.bwc 2 n σ 21 . . . σ 2 n . . . . ... ... ⎠ = . . . . . . . . ... . . ... . . . . . . ⎝ ⎝ ⎠ . . . . 0 0 . . . σ nn . . . σ nn mat.bwc n 1 . . . mat.bwc nn σ n 1 . . . σ nn mat.bwc jk = bwc jk ??? 13
10 genes 1.0 0.5 Pair by pair correlation 0.0 −0.5 15 −0.5 0.0 0.5 1.0 Matrix Correlation
One-step M-estimation 20 genes 1.0 0.5 Converged 0.0 −0.5 −0.5 0.0 0.5 1.0 one−step Converged M-estimation was doing 10-25 iterations on average (Takes 11 seconds to compute 190 pairs of genes) 16
Few-step 20 genes 20 genes 1.0 1.0 0.5 0.5 Converged Converged 0.0 0.0 −0.5 −0.5 −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 3−step 5−step 3.5 seconds 5.5 seconds 17
1.5 1.0 Gene 18 0.5 0.0 −0.5 −1.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 Gene 11 18
10-step 20 genes 1.0 0.5 Converged 0.0 −0.5 −0.5 0.0 0.5 1.0 10−step 8 seconds 19
Median instead of MCD • Median for ˜ µ • Median absolute deviation (MAD) for ˜ Σ MAD( X ) = median | x i − median( x i ) | If converged → no difference 20
20 genes 1.0 0.5 MCD converged 0.0 −0.5 −0.5 0.0 0.5 1.0 Median converged 7 seconds 21
Few-step median 20 genes 20 genes 1.0 1.0 0.5 0.5 MCD converged MCD converged 0.0 0.0 −0.5 −0.5 −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 Median 3−step Median 5−step 1.5 seconds 2.5 seconds 22
2 1 Gene 7 0 −1 −2 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 Gene 17 23
10-step median 20 genes 1.0 0.5 MCD converged 0.0 −0.5 −0.5 0.0 0.5 1.0 Median 10−step 5 seconds 24
10-step median 5-step MCD 20 genes 20 genes 1.0 1.0 0.5 0.5 MCD converged Converged 0.0 0.0 −0.5 −0.5 −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 Median 10−step 5−step 5 seconds 5.5 seconds 25
Biweight correlation on clean data How biased/variable compared to Pearson correlation? Pearson correlation Biweight correlation 1.0 1.0 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.7636 0.8482 0.7850 0.7523 0.8541 0.7945 26
What makes the difference? Multivariate normal data 1.0 0.9 Pearson correlation 0.8 0.7 0.6 0.6 0.7 0.8 0.9 1.0 Biweight correlation 27
bw−pearson=0.1166 bw−pearson=0.1108 1 2 0 1 row11 row16 −1 0 −2 −1 −3 −2 −3 −2 −1 0 1 −3 −2 −1 0 1 row2 row2 bw−pearson=0.0523 bw−pearson=0.0003 3 2 2 1 1 row17 row15 0 0 −1 −1 28 −2 −2 −2 −1 0 1 −2 −1 0 1 row6 row5
Concluding remarks • Biweight correlation is unbiased and similarly variable with Pearson correlation • Median and median absolute deviation for initiation of ˜ µ and ˜ Σ is as robust as MCD estimators • Median and median absolute deviation for initiation of ˜ µ and ˜ Σ is faster than MCD estimators • Depending on how robust we want the result to be, compu- tational time can be shortened by number of iterations for speed efficiency -Generally, 5 iterations or more is recommended 29
Recommend
More recommend