Mining and Pattern Analysis in Large Data Sets for Biological Information. David W. Mount Arizona Cancer Center • Analysis of gene expression microarray data sets with goal of preventing or curing cancer – Statistical analysis of data – Using biological information to interpret data • Future types of genetic analyses
My major objectives. Develop hypotheses based on data analysis that can be tested in the laboratory or clinic Use and develop new methods for data analysis - pattern analysis, clustering, data mining, biological models Focus 1: early changes in colorectal and prostate cancer Focus 2: drugs for pancreatic cancer Major goal: to discover the unusual based on statistical and biological data analyses
Using data from two types of microarrays for measuring gene expression of ~35,000 human genes. Spotted Affymetrix cDNA, oligo 1, oligo2, oligo3,….for each gene EST arrays arrays collection synthesis on slide matched oligos mismatched oligos control sample mRNA to one slide slide1/slide2 Cy5/Cy3 for each for each gene gene mix hybridized biotin to one slide C 3 Y labeled cDNAs labeled C 5 Y hybridized to cDNA labeled oligos cDNA control test control sample sample test sample mRNA mRNA to sample mRNA to one another slide mRNA slide
Using gene expression microarrays for predicting genetic variation in tissues. Underexpressed - Michigan Prostate Study predicting lost functions Green – down or Red – up <2-4 fold NAP normal adjacent Overexpressed predicting MET metastatic new metabolism PCA localized BPH benign hyperplasia
Use data to find • An unusual gene product or gene expression value that indicates a good drug target • An early change that can help with early detection/diagnosis
Microarrays provide new drug targets - 1 Over-expressed genes in metastatic tissue. What genes, what pathways, what functions, where in cell? Cancer cells need these additional proteins to support their abnormal metabolism. Cancer cell Normal cell A AAA Inhibitor of A product
Microarrays provide new drug targets - 2 Cancer cells lose many gene functions by mutation (A-). They need backup functions to survive (B+). Target these backup functions. Geneticists call these overlapping gene functions synthetic lethals (A. Kamb) Cancer cell Normal cell A+B+ A-B+ Inhibitor of B product
Careful Experimental Design and Statistical Analysis are Extremely Important 1. Plan experiment so as to identify sources of variation 2. Include biological replication 3. Perform data quality analysis 4. Find genes that are varying significantly using data model in 1. 5. Mine this gene list for biological information Complications: genetic variability person to person, cancer stage, tissues are cell mixtures
Analysis of biological data with a variable genetic component is not new!
We are using R statistical computing/BioConductor for data analysis combined with Perl/Bioperl for biological mining.
R has tools for looking at data quality, etc.. Background varies slide to slide bad spotted array good affy array
Antibody used for immunochemical stain reveals which cells are producing a protein (cytokeratin) Labeled cells Unlabeled cells
Example of Pancreatic Cancer • 1/200 people get pancreatic cancer; 1/4 if have pancreatitis • It is a very painful and debilitating disease • Death usually within 1-2 years of discovery • Few drugs available - gemcitabine hopeful ut only helps small percentage of people • There is very little currently being spent on research into pancreatic cancer compared to other cancers • I will describe early results: 4 cancer tissues vs one normal tissue on Aglilent spotted arrays (24K genes).
Boxplots of normalized data of 4 tissues reveal between slide comparisons should be valid. Boxplots illustrate that distribution of M values in each sample is similar. Bars are 25% and 75% levels.
Normalization within arrays corrects for labeling and label detection variation. Red - tumor Blue - normal A = average of R and G values (square root of their product) M = log of R to G ratio to the base 2. MA plot with Loess normalization MA plot with no normalization MA plot from the first cancer tissue sample vs. control. Each point is a one of approx. 24,000 genes. The crowd of spots in the lower part of the graph, two of which are labeled R25, are the +ve control with a deliberately reduced R/G ratio; two -ve controls which should not change are on the center left near 0; and two values of VegF of interest to project 1, and Fos, the most significantly over-expressed gene in these tissues are also shown. Normalization restores M of most genes to approx. 0.
Top 100 genes that are statistically best supported are mostly down regulated . Red - tumor Blue - normal A1 = average of R and G values (square root of their product) M1 = log of R to G ratio to the base 2.
Volcano plot of fold change (x axis) against log odds that gene is differentially expressed (y axis) for 100 most significantly varying genes. Log odds of 5 means that the chance that these genes are NOT varying significantly from M=0 is e 5 = 1/148. This is a measure of the false discovery rate. This plot also shows that the most significantly varying genes in the pancreatic cancer tissues are down regulated, which probably means they are not functional. Some down regulated genes are also tumor suppressor genes and thus are candidates for project 2 drug screens in the Pancreatic PPG.
Example of genes varying significantly between 4 pancr. cancer tissues and a normal pancr. tissue sample. -TGen data - Agilent arrays. A = M = p corr. � RG Gb_ accession GeneNam e Description log 2 ( R/ G) t for FDR B V-fos transcr. BC0 0 4 4 9 0 FOS factor 3 .6 1 0 .3 3 6 .9 0 .0 0 0 9 0 7 .4 1 NM_ 0 3 3 1 9 4 NM_ 0 3 3 1 9 4 .1 Heat shock pr B9 -1 .7 8 .2 -3 0 .0 0 .0 0 0 9 0 6 .6 4 VGF nerve Y1 2 6 6 1 VGF -2 .5 1 3 .7 -2 8 .3 0 .0 0 0 9 0 6 .4 1 grow th factor fam ily G protein AF4 8 8 7 3 9 GABABL coupled rec. -2 .0 1 0 .2 -2 6 .1 0 .0 0 0 9 0 6 .0 6 …… Gliom a tum or NM_ 0 1 5 7 1 1 GLTSCR1 suppressor -1 .0 1 0 .3 -1 5 .3 0 .0 0 1 8 8 3 .5 1 …… Kruppel-like BC0 0 0 3 1 1 COPEB transcr. factor 1 .6 1 0 .8 1 3 .8 0 .0 0 2 5 0 2 .9 9 NM_ 0 0 6 9 9 9 POLS DNA Poly. sigm a 0 .8 9 .0 1 3 .8 0 .0 0 2 5 0 2 .9 8 …… Hypoxia-ind NM_ 0 0 1 5 3 0 HI F1 A factor 1 � 1 .4 7 .4 7 .6 0 .0 0 8 6 5 -0 .1 9 Hypoxia-ind factor 1 � NM_ 0 0 1 5 3 0 HI F1 A 1 .6 7 .3 7 .6 0 .0 0 8 6 7 -0 .2 0 Hypoxia-ind factor 1 � NM_ 0 0 1 5 3 0 HI F1 A 1 .5 7 .3 7 .6 0 .0 0 8 7 2 -0 .2 1 Hypoxia-ind factor 1 � NM_ 0 0 1 5 3 0 HI F1 A 1 .6 7 .3 7 .5 0 .0 0 8 8 8 -0 .2 6 p-value adjusted for false discovery rate (Benjamini and Hochberg) for multiple hypothesis testing. FDR is the expected percent of false predictions in a set of predictions, in this case the percent of genes that are incorrectly reported to change. B = log-odds that gene is differentially expressed. e.g. if B=1.5, odds is e 1.5 = 4.48, i.e, odds of correct prediction is 4.48/1. For B=0, odds = 1/1.
What do you do with a list of genes? • Influence on known metabolic and regulatory pathways (usually ~1/4 of genes) • Gene Ontology (GO) terms • Protein-protein and gene-gene interactions • Where located - genome amplification, rearrangements? • Agreement with models - biological and computational
Local genome databases are maintained at AZCC • Local databases of human, rat, mouse, and model organisms • Direct links to genetic, proteomic, and regulatory/ pathway databases • Information on protein-protein and gene-gene interactions • http:/ / www.biorag.org is public access Web site
Pathway Miner • http://www.biorag.org/pathway.html • Pandey et al. 2004 Bioinformatics. 20:2156-8 • Builds genetic network displays based on regulatory and metabolic relationships • Produces lists of genes in excel format
Genetic network analysis of pancreatic data with Pathway Miner - top 800 pancreatic genes - GenMAPP pathways A java interactive display that can be filtered in many ways. Click on gene names to retrieve all relevant information and on edges to view the pathways in common. Any list of genes can be uploaded for analysis.
Five genes in the top 800 are in MAPK, including FOS
New target and drug strategy used in pancreatic cancer project. • Identify under-expressed tumor suppressor (TS) genes in pancreatic cancer tissues • Make isogenic pancreatic cell lines with combinations of these genes • Screen for differential sensitivity to a large NCI collection of drugs and chemicals/ siRNA knockdowns Cancer cell TS+ Cancer cell TS- A+ A- Find drug or siRNA specific for TS- cells
Pathway miner used for siRNA analysis, TGen data. DPC4+/- cell lines. Gene knockdowns showing largest effects. Red - greater killing PPC4- Green - greater killing DPC4+ Conclusions: -may assist in choice of drug targets -knockdown of genes of closely related function can have quite opposite effect.
siRNA hits on Wnt pathway.
siRNA effects on nuclear receptors.
Recommend
More recommend