Does Sequence Similarity Predict Expression Similarity Kui Zhang Section on Statistical Genetics University of Alabama at Birmingham May 28, 2004
Motivation • Microarray technology allows us to monitor the expression of thousands of genes simultaneously • The estimates of each individual gene effect size are generally very low in precision due to small sample size • The completion of the human genome project provides another type information of genes • Ultimate Goal: combining sequence information to improve the estimation of individual gene effect size in microarray data analysis • Does sequence similarity predict expression similarity?
Some Studies for Correlation of Expression Data and Sequence Data • Correlation between gene expression and gene location: – Kruglyak and Tang, 2000; – Fukuoka et al., 2004; • Correlation between the co-expression of genes and the presence of common sequence elements in their upstream regions: – Bussmaker et al., 2001; – Ge et al., 2001;
Methods • Choose 4 Affymetrix HG-U133 type Microarray data sets • Define the sequence similarity by pair-wise E-value from BLAST search • Define the expression similarity by pair-wise correlation coefficient • Investigate the relationship between sequence-similar pairs and expression-similar pairs
Affymetrix HG-U133 Microarray • Provides 18,400 transcripts and variants • Represents 22,283 genes, including 14,500 well-characterized genes • Contains more than 22,000 probe sets and 500,000 distinct oligonucleotide features • Has 8,645 genes with consensus sequences
Define Sequence Similarity • Use 8,645 genes having consensus sequences in Affymetrix HG-U133 array • Translate each sequence to 6 reading frames • Run the program tblastx (without gap) for all translated sequences against themselves • Provide the bit score and e-value for similar sequences • Set the cut-off e-value as 10 − 5 • Find 7,396 sequences (genes) having at least one similar sequences except themselves and only these genes are retained for further analysis
Distribution of E-Value Histogram of Number of Similar Genes 4000 No. of Genes 2000 0 0 200 400 600 800 1000 Number of Similar Genes Histogram of E−Value 25000 No. of Genes 10000 0 −500 −400 −300 −200 −100 0 Natural Log of E−Value
Microarray Data Sets • Cancer Study (A): containing 15 normal oral muscosal samples, 41 squamous cell carcinomas of the head and neck, and 5 adenocarcenomas of the head and neck, published in Cancer Res. 64:55-63 (2004) • Affymetrix HG-U133 Serial Dilution (B): containing 42 arrays, processed by RMA. • Gene Therapy Study (C): consisting of 20 arrays, divided into 4 groups, each treated with a different viral vector • Breast Cancer Data (D): consisting of 10 breast tumors from old women > 49 years and 9 from young women < 40 years old
Initial Investigation • Calculate the Pearson and Spearman correlation coefficient for each pair of genes • Divide [-1,1] into 20 bins with equal length • Count the number of pairs with correlation coefficient falling in each bin • Count the number of sequence-similar pairs based upon BLAST search results in each bin • Calculate the percentage of sequence-similar pairs in each bin
Distribution of Pearson Correlation Coefficient Data A Data B 4 e+06 ● ● ● ● ● ● ● Number of Pairs Number of Pairs ● 3 e+06 ● 2 e+06 ● ● ● ● ● ● ● ● ● 0 e+00 ● 0 e+00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Bins Bins Data C Data D 3000000 ● ● ● ● ● ● ● 2500000 ● Number of Pairs Number of Pairs ● ● ● ● ● ● ● ● ● 1000000 1000000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Bins Bins
Distribution of Spearman Correlation Coefficient Data A Data B 6 e+06 6 e+06 ● ● ● ● Number of Pairs Number of Pairs ● ● ● ● 3 e+06 3 e+06 ● ● ● ● 0 e+00 0 e+00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Bins Bins Data C Data D 4 e+06 4 e+06 ● ● ● ● ● ● ● ● Number of Pairs Number of Pairs 2 e+06 ● ● ● 2 e+06 ● ● ● ● ● ● ● ● ● 0 e+00 0 e+00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Bins Bins
Percentage of Sequence-similar Pairs in Each Bin - Pearson Correlation Coefficient Data A Data B Percentage of Seqence−similar Pairs Percentage of Seqence−similar Pairs 0.5 ● ● 0.4 0.05 0.3 0.03 0.2 ● 0.1 0.01 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Bins Bins Data C Data D Percentage of Seqence−similar Pairs Percentage of Seqence−similar Pairs 0.010 ● ● ● 0.008 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.006 ● ● ● 0.004 ● ● ● ● ● 0.002 0.000 ● ● −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Bins Bins
Percentage of Sequence-similar Pairs in Each Bin - Spearman Correlation Coefficient Data A Data B Percentage of Seqence−similar Pairs Percentage of Seqence−similar Pairs ● ● 0.06 0.08 0.04 0.04 0.02 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.00 ● ● ● −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Bins Bins Data C Data D Percentage of Seqence−similar Pairs Percentage of Seqence−similar Pairs ● ● 0.010 0.15 0.10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.006 ● ● ● 0.05 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.002 ● ● −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Bins Bins
Hierarchical Clustering Of Sequence-Similar Pairs • Group 7,396 genes using hierarchial clustering • Define the distance between each pair of genes as their e-value • Take the distance between two clusters as the geometric average of pair-wise e-value between sequences in each cluster • Use 37 different values to cut trees
The Distance Used for Cutting Trees Level Natural Log of Distance 1 -450 5 -250 10 -80 15 -30 20 -7 25 -2 30 3 35 8
Distribution of Number of Clusters and Number of Genes Clustering Process 6000 4000 Number 2000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 5 10 15 20 25 30 35 Hierarchical Level
Distribution of Cluster Size Distribution of Cluster Size 600 500 400 No. of Clusters 300 200 100 0 0 10 20 30 40 50 60 70 Cluster Size
Methods • Calculate the average correlation coefficient for all possible gene pairs at each hierarchical level • Compute the average correlation coefficient for gene pairs in the same cluster at each hierarchical level • At each hierarchical level, calculate percentage of gene pairs having correlation coefficient less than 0.30 in the same cluster among all gene pairs with correlation coefficient less than 0.30 • At each hierarchical level, calculate percentage of gene pairs having correlation coefficient greater than 0.60 in the same cluster among all gene pairs with correlation coefficient greater than 0.60
Average Pearson Correlation Coefficient in Same Cluster Data A Data B 0.10 Average Correlation Coefficient ● Average Correlation Coefficient ●● ● ●● ●● 0.08 0.25 ● ● ●● ● 0.06 ●● ● ●●●●●● ● ● ● 0.15 ●●●●● ● 0.04 ●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●● 0.02 ●● 0.05 ● ● ● 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 Hierarchical Level Hierarchical Level Data C Data D ● ● Average Correlation Coefficient Average Correlation Coefficient ●●● ● 0.50 ●●● 0.10 ● ● ● ● ●● 0.45 ● ● ●●●● ●● ●● ● 0.06 ● 0.40 ●● ●●●●●●●●● ● ●●●●●●●● ●● ● ●● ●●●●●●● ● 0.35 ●●●●●●●●●●●●● 0.02 ● 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 Hierarchical Level Hierarchical Level
Recommend
More recommend