[PPT] - Mining and Pattern Analysis in Large Data Sets for Biological PowerPoint Presentation

SLIDE 1

Mining and Pattern Analysis in Large Data Sets for Biological Information.

David W. Mount Arizona Cancer Center

Analysis of gene expression microarray data

sets with goal of preventing or curing cancer – Statistical analysis of data – Using biological information to interpret data

Future types of genetic analyses

SLIDE 2

My major objectives.

Develop hypotheses based on data analysis that can be tested in the laboratory or clinic Use and develop new methods for data analysis - pattern analysis, clustering, data mining, biological models Focus 1: early changes in colorectal and prostate cancer Focus 2: drugs for pancreatic cancer Major goal: to discover the unusual based on statistical and biological data analyses

SLIDE 3

C3Y labeled cDNA matched

ligos

mismatched

ligos

cDNA, EST collection

ligo 1, oligo2, oligo3,….for each gene

control sample mRNA test sample mRNA synthesis

n slide

C5Y labeled cDNA control sample mRNA to one slide test sample mRNA to another slide biotin labeled cDNAs hybridized to

ligos

Cy5/Cy3 for each gene slide1/slide2 for each gene mix hybridized to one slide

Using data from two types of microarrays for measuring gene expression of ~35,000 human genes. Spotted arrays Affymetrix arrays

control sample mRNA to one slide

SLIDE 4

Green – down or Red – up <2-4 fold NAP normal adjacent MET metastatic PCA localized BPH benign hyperplasia

Using gene expression microarrays for predicting genetic variation in tissues.

Michigan Prostate Study

Underexpressed predicting lost functions Overexpressed predicting new metabolism

SLIDE 5

Use data to find

An unusual gene product or

gene expression value that indicates a good drug target

An early change that can help

with early detection/diagnosis

SLIDE 6

Microarrays provide new drug targets - 1

Over-expressed genes in metastatic tissue. What genes, what pathways, what functions, where in cell? Cancer cells need these additional proteins to support their abnormal metabolism. Cancer cell Normal cell A AAA Inhibitor of A product

SLIDE 7

Cancer cells lose many gene functions by mutation (A-). They need backup functions to survive (B+). Target these backup functions. Geneticists call these overlapping gene functions synthetic lethals (A. Kamb) A-B+ A+B+ Normal cell Cancer cell Inhibitor of B product

Microarrays provide new drug targets - 2

SLIDE 8

Careful Experimental Design and Statistical Analysis are Extremely Important

1. Plan experiment so as to identify sources

f variation

2. Include biological replication 3. Perform data quality analysis 4. Find genes that are varying significantly using data model in 1. 5. Mine this gene list for biological information Complications: genetic variability person to person, cancer stage, tissues are cell mixtures

SLIDE 9

Analysis of biological data with a variable genetic component is not new!

SLIDE 10

We are using R statistical computing/BioConductor for data analysis combined with Perl/Bioperl for biological mining.

SLIDE 11

R has tools for looking at data quality, etc..

Background varies slide to slide bad spotted array good affy array

SLIDE 12

Antibody used for immunochemical stain reveals which cells are producing a protein (cytokeratin)

Labeled cells Unlabeled cells

SLIDE 13

Example of Pancreatic Cancer

1/200 people get pancreatic cancer; 1/4 if have

pancreatitis

It is a very painful and debilitating disease
Death usually within 1-2 years of discovery
Few drugs available - gemcitabine hopeful ut only

helps small percentage of people

There is very little currently being spent on research

into pancreatic cancer compared to other cancers

I will describe early results: 4 cancer tissues vs one

normal tissue on Aglilent spotted arrays (24K genes).

SLIDE 14

Boxplots of normalized data of 4 tissues reveal between slide comparisons should be valid.

Boxplots illustrate that distribution of M values in each sample is similar. Bars are 25% and 75% levels.

SLIDE 15

Normalization within arrays corrects for labeling and label detection variation.

MA plot with no normalization MA plot with Loess normalization Red - tumor Blue - normal A = average of R and G values (square root of their product) M = log of R to G ratio to the base 2. MA plot from the first cancer tissue sample vs. control. Each point is a one of approx. 24,000 genes. The crowd of spots in the lower part of the graph, two of which are labeled R25, are the +ve control with a deliberately reduced R/G ratio; two -ve controls which should not change are on the center left near 0; and two values of VegF of interest to project 1, and Fos, the most significantly over-expressed gene in these tissues are also shown. Normalization restores M of most genes to approx. 0.

SLIDE 16

Top 100 genes that are statistically best supported are mostly down regulated.

Red - tumor Blue - normal A1 = average of R and G values (square root of their product) M1 = log of R to G ratio to the base 2.

SLIDE 17

Volcano plot of fold change (x axis) against log odds that gene is differentially expressed (y axis) for 100 most significantly varying genes.

This plot also shows that the most significantly varying genes in the pancreatic cancer tissues are down regulated, which probably means they are not

functional. Some down

regulated genes are also tumor suppressor genes and thus are candidates for project 2 drug screens in the Pancreatic PPG. Log odds of 5 means that the chance that these genes are NOT varying significantly from M=0 is e5 = 1/148. This is a measure

f the false discovery rate.

SLIDE 18

Example of genes varying significantly between 4 pancr. cancer tissues and a normal pancr. tissue sample. -TGen data - Agilent arrays.

Gb_ accession GeneNam e Description M = log2( R/ G) A = RG t p corr. for FDR B BC0 0 4 4 9 0 FOS V-fos transcr. factor 3 .6 1 0 .3 3 6 .9 0 .0 0 0 9 0 7 .4 1 NM_ 0 3 3 1 9 4 NM_ 0 3 3 1 9 4 .1 Heat shock pr B9

1 .7

8 .2

3 0 .0

0 .0 0 0 9 0 6 .6 4 Y1 2 6 6 1 VGF VGF nerve grow th factor

2 .5

1 3 .7

2 8 .3

0 .0 0 0 9 0 6 .4 1 AF4 8 8 7 3 9 GABABL fam ily G protein coupled rec.

2 .0

1 0 .2

2 6 .1

0 .0 0 0 9 0 6 .0 6 …… NM_ 0 1 5 7 1 1 GLTSCR1 Gliom a tum or suppressor

1 .0

1 0 .3

1 5 .3

0 .0 0 1 8 8 3 .5 1 …… BC0 0 0 3 1 1 COPEB Kruppel-like

transcr. factor

1 .6 1 0 .8 1 3 .8 0 .0 0 2 5 0 2 .9 9 NM_ 0 0 6 9 9 9 POLS DNA Poly. sigm a 0 .8 9 .0 1 3 .8 0 .0 0 2 5 0 2 .9 8 …… NM_ 0 0 1 5 3 0 HI F1 A Hypoxia-ind factor 1 1 .4 7 .4 7 .6 0 .0 0 8 6 5

0 .1 9

NM_ 0 0 1 5 3 0 HI F1 A Hypoxia-ind factor 1 1 .6 7 .3 7 .6 0 .0 0 8 6 7

0 .2 0

NM_ 0 0 1 5 3 0 HI F1 A Hypoxia-ind factor 1 1 .5 7 .3 7 .6 0 .0 0 8 7 2

0 .2 1

NM_ 0 0 1 5 3 0 HI F1 A Hypoxia-ind factor 1 1 .6 7 .3 7 .5 0 .0 0 8 8 8

0 .2 6

p-value adjusted for false discovery rate (Benjamini and Hochberg) for multiple hypothesis testing. FDR is the expected percent of false predictions in a set of predictions, in this case the percent of genes that are incorrectly reported to change. B = log-odds that gene is differentially expressed. e.g. if B=1.5, odds is e 1.5 = 4.48, i.e, odds of correct prediction is 4.48/1. For B=0, odds = 1/1.

SLIDE 19

What do you do with a list of genes?

Influence on known metabolic and regulatory

pathways (usually ~1/4 of genes)

Gene Ontology (GO) terms
Protein-protein and gene-gene interactions
Where located - genome amplification,

rearrangements?

Agreement with models - biological and

computational

SLIDE 20

Local genome databases are maintained at AZCC

Local databases of human, rat, mouse, and model organisms
Direct links to genetic, proteomic, and regulatory/ pathway

databases

Information on protein-protein and gene-gene interactions
http:/ / www.biorag.org is public access Web site

SLIDE 21

Pathway Miner

http://www.biorag.org/pathway.html
Pandey et al. 2004 Bioinformatics. 20:2156-8
Builds genetic network displays based on

regulatory and metabolic relationships

Produces lists of genes in excel format

SLIDE 22

Genetic network analysis of pancreatic data with Pathway Miner

top 800 pancreatic genes - GenMAPP pathways

A java interactive display that can be filtered in many ways. Click on gene names to retrieve all relevant information and

n edges to view the

pathways in common. Any list of genes can be uploaded for analysis.

SLIDE 23

Five genes in the top 800 are in MAPK, including FOS

SLIDE 24

New target and drug strategy used in pancreatic cancer project.

Identify under-expressed tumor suppressor (TS) genes in

pancreatic cancer tissues

Make isogenic pancreatic cell lines with combinations of

these genes

Screen for differential sensitivity to a large NCI collection of

drugs and chemicals/ siRNA knockdowns

A- A+ Cancer cell TS+ Cancer cell TS- Find drug or siRNA specific for TS- cells

SLIDE 25

Red - greater killing PPC4- Green - greater killing DPC4+ Conclusions:

may assist in

choice of drug targets

knockdown of

genes of closely related function can have quite

pposite effect.

Pathway miner used for siRNA analysis, TGen data. DPC4+/- cell lines.

Gene knockdowns showing largest effects.

SLIDE 26

siRNA hits on Wnt pathway.

SLIDE 27

siRNA effects on nuclear receptors.

SLIDE 28

About prostate cancer

Men screened for a serum antigen - PSA
If levels go up -> biopsy specimens examined for

evidence of cancer (black box -> Gleason score)

Decision made about prostatectomy (undesirable -

incontinence, sex dysfunction, etc.)

Survival about 2/3
Tissues collected from men used for gene

expression analysis using Affymetrix arrays (about 12,500 genes)

SLIDE 29

Analysis of a large Affy prostate data set (Singh et al. 2002, Cancer Cell)

50 normal tissues
52 staged tissues
Perform BioConductor Linear Models

(LIMMA) analysis

Trying advanced statistical modeling and

clustering of genes e.g. independent component analysis (fastICA, MLICA), mixture models (nlme)

Test models of penetration, altered

metabolism, etc.

SLIDE 30

The data set: human Affy hgu95av2 chip - 1/3 of genome

50 normal prostate tissues
52 cancer tissues at different stages

– 29 negative capsule penetration/20 positive – 13 positive resesection surg. margin/37 negative – 9 non re-occurring/5 re-occurring – Gleason score available – no apparent dissection – no apparent pairing of N/C samples Singh et al. Cancer Cell March 2001

SLIDE 31

Results from prostate data set

Can find about 600-1,000 genes

changing N/C depending on acceptable level of FDR

No significant changes capsular,

margin, or recurrence data (agrees with paper

What next?

SLIDE 32

Biocarta pathways

SLIDE 33

Metabolic changes in prostate cancer - cells deprived of oxygen depend on these changes.

Cancer cells in general learn to survive with reduced oxygen and they make a factor (vascular endothelial growth factor or VEGF) that induces growth of blood vessels. This is clearly observed in gene expression data.

SLIDE 34

Getting at the unknown gene relationships.

Try to identify sets of genes that are regulated

independently of other sets

A new method is independent component analysis

(vs. principal components analysis, etc)

– Can superimpose regulatory models and build a more detailed model

Genes interact to different degrees
Problem: find sets of genes that are statistically most

different across the tissue samples

R provides resources

SLIDE 35

Independent component analysis: suitable for building and testing regulatory models

samples genes components genes = X samples components Do any of these gene groups better Separate the sample classes C and N? Matrix A Matrix S Matrix X NN…CC

33331111 31331311 31311311

SLIDE 36

Use of ICA in analysis of endometrial cancer

Noise Good separation

SA Saidi et al. Oncogene 2004

SLIDE 37

Some samples of ICA: objective - can we find a set to discriminate gene and tissue classes in prostate ca.?

Hierarchical clustering (complete) of 102 prostate tissue samples Boxplots of 102 samples after ICA

SLIDE 38

Another approach - use list of genes that are of biological interest during early stages of prostate Ca. and build model.

14-3-3 sigma
actinin
BP180
BP230
cadherin
catenin
CD151
CD44
CD63
CD81
CD9
connexin 32
desmocollin
desmoglein
desmoplakin
ehm2
EWI
Ezrin
fascin
fibulin
HD1
keratin
laminin
MTA3
Nanos

homolog 1

PKC-delta
plakoglobin
Plectin
SNAI1
tenascin
vinculin
Zona

Occludens 1

Zona

Occludens 2

Genes related to cell adhesion to intracellular matrix

If change these genes - then expect cells to be able to penetrate the capsule and invade surrounding tissues.

SLIDE 39

The source of germline variability in humans

One of my pairs

f chromosomes

Maternal Paternal What I pass on to

ur children

Hundreds of thousands of differences in sequence Called SNPs - single nucleotide polymorphisms

What my wife passes on to our children Inheritance is through haplotype blocks of 10s to 100s of kbases

SLIDE 40

Genotype revealed in humans by haplotype structure of 5q31 (Daly et al. 2001)

SLIDE 41

Goal: relationship between genotype and expression. Pomp et al. 2004 Large scale expression analysis mapped against genotype.

SLIDE 42

Conclusions and future plans

gene expression data are used to identify

drug targets

further analysis

– ICA analysis - Maximum likelihood method – Examine all penetration related genes for possible variation

SLIDE 43

Acknowledgements

Colleagues at UMC/AZCC/SWEHSC

Ritu Pandey, Greg

Thomas, Rob Klein, Raghavendra Guru

Dave Alberts
Anne Cress
Gene Gerner
Serrine Lau - SWEHSC
Clark Lantz
Ray Nagle
Garth Powis
George Tsaprailis and the

proteomics core

Bernie Futscher and

George Watts of the genomics core Colleagues at Tgen, Phoenix

Dan Von Hoff
Jeff Trent
Phillip Stafford
Haiyong Han
Spyro Mousses