Statistical LeaRning Lydia Müller &Katja Nowick Bioinformatics group, Markus Kreuz IMISE
Differential Gene Expression Goals of today: Find out which genes are differentially expressed between human and chimpanzee brains • Using microarrays (Affymetrix arrays) • Using RNA-Seq (Illumina sequencing)
Gene Expression Analysis • Measure RNA levels for all or at least thousands of genes at a time for each sample • “Snapshot” approach A piece of tissue typically consists of different cell types Each cell has a unique set of thousands of different RNAs
The biology of gene expression Gene expression: mRNA = messenger RNA protein ncRNAs = non-coding RNAs active as RNAs All these different RNA “species” are measured by RNA-Seq
The biology of gene expression Gene X consisting of several exons Form A mRNAs of Form B Gene X Form C The different mRNAs produced from gene X are called different transcripts or different isoforms
Dataset for the exercises 5 humans 5 chimpanzees Expression profiles using Affymetrix arrays: HG-U133Plus2 Designed for human samples ~ 54,000 probesets ~21,000 genes
Affymetrix Multiple (11-20) 25-base oligonucleotide probes represent one transcript PM MM
Affymetrix 5´ 3´ Published Gene Sequence Multiple (11-20) 25-base oligonucleotide probes Perfect Match Mismatch PM is exactly complementary to published sequence MM is changed on 13 th base
QC: 1. Affy chip image For Affy arrays, artifacts should be < 10% of image, otherwise summarized probeset values could be affected…
QC: 2. Batch effects • In good experimental design, you compare two groups that only differ in one factor. • Batch effect can occur when subsets of the replicates are handled separately at any stage of the process; handling group becomes in effect another factor. Avoid processing all or most of one factor level together if you can’t do all the samples at once.
QC: 3. RNA degradation • Probes are ordered by location from 5’ to 3’ of transcript • RNA degradation starts from 5’ end signal intensity should decrease from 3’ to 5’ Slope should be <5
Data Preprocessing 1. Background correction 2. Normalization a. Within array b. Between arrays
Background and probe-specific correction Methods • MAS 5: PM-MM, only if PM>MM – Chip is divided into 16 sections – 2% lowest probes = background – PM and MM probes are corrected against background in their section – If MM>PM, adjusted so that MM<PM • RMA: PM = S + Noise; – Predicts signal based on assumption that it is the sum of the real signal and noise – Uses observed PM probes only – Integrates signal of all probes into model • GCRMA: based on amount + location of GC content of the probes
Normalization • Most methods of normalization assume: – Most genes are not changing – Changes are not all in same direction • This applies to both within-array and between- array normalization
Between-Array Normalization • Necessary when comparing intensity values between arrays • Mas5 does no between array normalization • RMA uses quantile normalization • Process all (and only) arrays of an experiment together! • Each chip will get the same empirical distribution of signal intensities tends to underestimate fold change
Summarization PM MM • RMA: uses only PM “median polish”: linear model based on all arrays in the study Expression values on log 2 scale • MAS5: uses PM-MM “1 - step Tukey biweight”, only for individual arrays
MAS 5 Presence Calls • Affymetrix has algorithm to determine presence/absence of transcripts • Despite problems with MM probes, they are useful in determining presence • A data set where all the transcripts present were known indicated that Affy’s P/M/A calls were 85% accurate with only 10% FDR ( Choe et al. 2005, Genome Biology 6: R16) • It makes only sense to analyze expressed transcripts
A word on technical replication… Generally have limited amount of replicates due to cost of the experiments or availability of samples Technical replication is seen by many statisticians as a waste of time and resources because they do not substantially increase your power to detect differences... Biological replicates: include the technical variation and give information on natural/biological variation to determine what amplitude of difference might be biologically relevant
What is different? 7 7 7 6.5 6.5 6.5 6 6 6 5.5 5.5 5.5 5 5 5 4.5 4.5 4.5 FC=2.28 FC=1.14 FC=1.16 4 4 4 P=0.001 P=0.07 P=0.04 3.5 3.5 3.5 3 3 3 7 7 7 6.5 6.5 6.5 6 6 6 5.5 5.5 5.5 5 5 5 4.5 4.5 4.5 FC=1.43 FC=1.81 FC=2.28 4 4 4 P=0.25 P=0.32 P=0.17 3.5 3.5 3.5 3 3 3 Be aware of effect of outliers or biological variation on FC! T-test is a common way to identify differentially expressed genes
Multiple Testing Problem • Most common statistical tests were good in the times before large scale datasets • P=0.01 means if you were repeating the same test 100x you would expect the same outcome only ones (1 in 100) • P<0.01 means you would not expect the same outcome even if you repeated the test 100 x • Problem: With microarrays, you are testing thousands of genes at the same time • If you had 10,000 genes and none were different, you would expect 100 to have p<0.01 by chance alone
Multiple Testing Solutions (1) Adjust p-values to reflect multiple testing: • Bonferroni correction: – Multiply p-values by number of tests done – Very conservative; greatly increases false negative rate • Benjamini & Hochberg correction: – Less conservative, willing to accept some number of false positives in order to decrease false negative rate • Storey’s q-values – Controls false discovery rate; – Estiamtes height of uniform part of p-value distribution and corrects p-value based on this height Calculate False Discovery Rate: - Try multiple p-value cut-offs and use what you are willing to accept
Multiple Testing Solutions (2) Non-specific filtering to remove genes before analysis – Remove non-expressed genes (e.g. genes not expressed in all samples) – This typically reduces the number of tests by 50-75% genes
Multiple Testing Solutions (3) a priori specification of a set of genes that are likely to be differentially expressed – E.g. from results from other experiments or other prior knowledge – Instead of correcting for all genes, this reduces the multiple testing correction to just that number of a priori defined genes – But don’t define your expectation after analyzing your current dataset or change your expectation after your previous one didn’t work
Limitations of microarrays • Limited to genes for which probes are on the array no de-novo gene discovery, only limited isoform information • Cross-hybridization • Only available for some species (unless you make custom arrays – expensive!) You can use arrays of highly related species (e.g. human arrays for chimpanzee samples) But regions with sequence difference would not hybridize equally well: X X Probeset Probeset Solutions: U se/create custom CDF files to “mask” probes you don’t want to consider Determine statistically the probes with unequal behavior between samples (Dannemann et al. 2009) RNA-Seq
RNA-Seq Quantification of gene expression 1. Biochemical amplification of all RNAs 2. Fragmentation of all RNAs 3. Biochemical conversion of all RNAs into cDNA 4. Sequencing of all cDNAs Short reads (e.g. 100 nt long) 5. Mapping of reads to a reference genome / genes EXON EXON EXON EXON 6. Count how many reads map to the gene to determine its expression level Quantification can be done at the level of genes, transcripts, exons
Quantification of gene expression Long gene EXON EXON EXON EXON Short gene Has fewer counts EXON EXON EXON EXON EXON EXON Need to normalize for gene / transcript length!
RNA-Seq vs Microarray differential expression With the costs for NGS decreasing, RNA-Seq will replace microarrays Microarrays Continues numbers Transcript abundance: Is linearly related to the fluorescence level recovered after hybridization RNA-Seq Counts of reads (discrete numbers) Transcript abundance: Counts are considered to be linearly related to transcript abundance Calculate expression values for genes, transcripts (i.e. different isoforms), or exons? Differential gene, transcript or exon expression?
Recommend
More recommend