Part II Extraction of gene signal from microarrays Swiss Institute of Bioinformatics
Biological question ( e.g. Differentially expressed genes, Scientific Process Sample class prediction, etc .) Experimental design Microarray experiment Pre-processing steps Image analysis/ (failed) Quality assessment Normalization Data Analysis Significance Testing Clustering Discrimination Biological verification and interpretation Swiss Institute of Bioinformatics
Steps in Images Processing • Addressing (or Gridding ) – Assigning coordinates to each spot • Segmentation – Classification of pixels as either foreground (signal) or background • Information Extraction – Foreground fluorescence intensity pairs ( R , G ) – Background intensities – Quality measures Swiss Institute of Bioinformatics
Addressing This is the process of assigning coordinates to each of the spots. Automating this part of the procedure permits high throughput analysis. 4 by 4 grids 19 by 21 spots per grid Swiss Institute of Bioinformatics
Addressing ���������������� ��������� ���������������������� �������������������� ������������ Swiss Institute of Bioinformatics
Problems in automatic addressing • Misregistration of the red and green channels • Rotation of the array in the image Rotat i on Rotat i on Swiss Institute of Bioinformatics
Problems in automatic addressing • Skew in the array Swiss Institute of Bioinformatics
Addressing • Basic structure of images known (determined by the arrayer) • Parameters to address spot positions – Separation between rows and columns of grids – Individual translation of grids – Separation between rows and columns of spots within each grid – Small individual translation of spots – Overall position of the array in the image ��������� Swiss Institute of Bioinformatics
Steps in Images Processing • Addressing (or Gridding ) – Assigning coordinates to each spot • Segmentation – Classification of pixels as either foreground (signal) or background • Information Extraction – Foreground fluorescence intensity pairs ( R , G ) – Background intensities – Quality measures Swiss Institute of Bioinformatics
Segmentation Methods • Fixed circles • Adaptive circles • Adaptive shape – Edge detection – Seeded Region Growing (R. Adams and L. Bishof (1994): Regions grow outwards from seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region • Histogram methods Swiss Institute of Bioinformatics
Fixed circle segmentation • Fits a circle with a constant diameter to all spots in the image • Easy to implement • The spots should be of the same shape and size ��������������� ����������������� Swiss Institute of Bioinformatics
Adaptive circle segmentation • The circle diameter is estimated separately for each spot Dapple finds spots by detecting edges of spots (second derivative) • Problematic if spot exhibits oval shapes Swiss Institute of Bioinformatics
Limitation of circular segmentation —Small spot —Not circular Result of Seed Region Growing Swiss Institute of Bioinformatics
Adaptive shape segmentation • Specification of starting points or seeds – Bonus: already know geometry of array • Regions grow outwards from the seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region Swiss Institute of Bioinformatics
Histogram segmentation • Choose target mask larger than any spot • Fg and bg intensities determined from the histogram of pixel values for pixels within the masked area • Example : QuantArray – Background : mean between 5th and 20th percentile – Foreground : mean between 80th and 95th percentile • May not work well when a large target mask is set to compensate for variation in spot size !�� "��������� Swiss Institute of Bioinformatics
Steps in Images Processing • Addressing (or Gridding ) – Assigning coordinates to each spot • Segmentation – Classification of pixels as either foreground (signal) or background • Information Extraction – Foreground fluorescence intensity pairs ( R , G ) – Background intensities – Quality measures Swiss Institute of Bioinformatics
Information Extraction • Spot Intensities § mean of pixel intensities § median of pixel intensities § Pixel variation (e.g. IQR) • Background values § None § Local Take the average § Constant (global) • Quality Information Swiss Institute of Bioinformatics
Spot ‘foreground’ intensity • The total amount of hybridization for a spot is proportional to the total fluorescence generated by the spot • Spot intensity = sum of pixel intensities within the spot mask • Since later calculations are based on ratios between Cy5 and Cy3, we compute the average* pixel value over the spot mask * alternative : ratios of medians may be better than means if bright specks present Swiss Institute of Bioinformatics
Background intensity • The measured fluorescence intensity includes a contribution of non-specific hybridization and other chemicals on the glass • Fluorescence from regions not occupied by DNA should be different from regions occupied by DNA → one solution is to use local negative controls (spotted DNA that should not hybridize) Swiss Institute of Bioinformatics
BG: None • Do not consider the background – Can be better than some forms of local background determination with good quality arrays With a loose mathematical notation: + σ − + σ R R fg R fg bg R bg = M , , log + σ − + σ 2 G G fg G fg bg G bg , , + σ + σ R ( ) fg R fg R bg ≈ , , log + σ + σ 2 G ( ) fg G fg G bg , , worse than + σ R fg R fg = M , log + σ 2 G Swiss Institute of Bioinformatics fg G fg ,
BG: Local • Focus on small regions surrounding the spot mask • Median of pixel values in this region • Most software implements such an approach ��������� #��$��� ����%�$������ • By ignoring pixels immediately surrounding the spots, bg estimate is less sensitive to the performance of the segmentation procedure Swiss Institute of Bioinformatics
Background can matter Without BG correction With BG correction Swiss Institute of Bioinformatics
Summary • Image analysis is a crucial preprocessing step – Association of a "geographic" location (and corresponding annotation) with signal intensities – Several non-trivial technical choices (scanner, image analysis software, etc…) can affect the quality of the signal • Bg correction is sometimes not desirable (low bg arrays) Swiss Institute of Bioinformatics
Biological question ( e.g. Differentially expressed genes, Scientific Process Sample class prediction, etc .) Experimental design Microarray experiment Pre-processing steps Image analysis/ (failed) Quality assessment Normalization Data Analysis Significance Testing Clustering Discrimination Biological verification and interpretation Swiss Institute of Bioinformatics
Quality assessment overview Visual inspection of images Evaluation of MvA plots Compare statistical summaries for the chips Swiss Institute of Bioinformatics
Red/Green overlay images Co-registration and overlay offers a quick visualization, revealing information on color balance, uniformity of hybridization, spot uniformity, background, and artifacts such as dust or scratches $���& ��'���%�������������(�( ��& �������%������������%�������� �(�( Swiss Institute of Bioinformatics
Spatial plots: background from two slides Swiss Institute of Bioinformatics
Practical Problems 1 Comet Tails § Likely cause: insufficiently rapid immersion of the slides in the succinic anhydride blocking solution Swiss Institute of Bioinformatics
Practical Problems 2 �������%� ���)��������� �����%� ������������ ����� Swiss Institute of Bioinformatics
Practical Problems 3 High Background • 2 likely causes: – Insufficient blocking – Precipitation of the labeled probe Weak Signals Swiss Institute of Bioinformatics
Practical Problems 4 ������������ § *�!���������&�� +���������������� ����������� ������������� ���������� Swiss Institute of Bioinformatics
Practical Problems 5 Swiss Institute of Bioinformatics
Artifacts in microarrays • We are interested in finding true biologically meaningful differences between sample types • Due to other sources of systematic variation, there are also usually artifactual differences • Sources of artifacts include: – print tips - differences in subarrays – plate effects – differences in rows within subarray – batch effects – hybridization artifacts Swiss Institute of Bioinformatics
Sample boxplot ���������� �������� , - /'���!���0 ������ , . Swiss Institute of Bioinformatics
Boxplots of log 2 R/G *������������������.1�����&�2��+%�2�����# 34 (Example data associated to limmaGUI package.) Swiss Institute of Bioinformatics
Biological question ( e.g. Differentially expressed genes, Scientific Process Sample class prediction, etc .) Experimental design Microarray experiment Pre-processing steps Image analysis/ (failed) Quality assessment Normalization Data Analysis Estimation Testing Clustering Discrimination Biological verification and interpretation Swiss Institute of Bioinformatics
Pin group (sub-array) effects ������ ������������������������������������ �������� �������������������������� Swiss Institute of Bioinformatics
Boxplots, highlighting pin group effects � � � � � � � � * ��������������� Clear example of spatial bias Swiss Institute of Bioinformatics
Preprocessing: Normalization • Why? To correct for systematic differences between samples on the same slide, or between slides, which do not represent true biological variation between samples • How do we know it is necessary? By examining self-self hybridizations , where no true differential expression is occurring. There are dye biases which vary with spot intensity, location on the array, plate origin, pins, scanning parameters, etc. Swiss Institute of Bioinformatics
What is self-self hybridization? • In dual channel (2-color) microarrays, such as cDNA arrays, two samples are each labeled with a different fluorescent dye • In most studies, the samples are from different sources ( e.g. cancer vs. normal) • However, it is also possible to co-hybridize two samples from the same source (but differently labeled) Swiss Institute of Bioinformatics
Dual channel co-hybridizations Control Treated Control Control (self-self) sample sample sample sample Swiss Institute of Bioinformatics
Self-self hybridizations False color overlay Boxplots within pin-groups Scatter (MA-)plots Swiss Institute of Bioinformatics
Normalization: global • Normalization based on a global adjustment log 2 R/G → log 2 R/G - c = log 2 R/(kG) • Common choices for k or c = log 2 k are c = median or mean of log ratios for a particular gene set ( e.g. all genes, or control, or ‘housekeeping’ genes) • Another possibility is total intensity normalization, where k = ∑ R i / ∑ G i Swiss Institute of Bioinformatics
Effect of global normalization Swiss Institute of Bioinformatics
Normalization: intensity-dependent • Here, run a line through the middle of the MA plot, shifting the M value of the pair (A,M) by c=c(A), i.e. log 2 R/G → log 2 R/G - c (A) = log 2 R/(k(A)G) • One estimate of c(A) is made using the LOWESS (or loess) function of Cleveland (1979): LOcally WEighted Scatterplot Smoothing Swiss Institute of Bioinformatics
Effect of lowess normalization Swiss Institute of Bioinformatics
Comparison between arrays • Different arrays often do not show identical signal distribution of M values – Various technical reasons (e.g. labeling efficiency, amount of labelled RNA, scanner settings, etc…) • Need to normalize the signal between chips – Multiple possibilities, one often used: "scale normalization" Swiss Institute of Bioinformatics
Scale normalization: between slides Idea: make the median spread of M values identical by multiplying them by a chip-specific constant Boxplots of log ratios from 3 replicate self-self hybs Left panel: before normalization Middle panel: after within print-tip group normalization Right panel: after a further between-slide scale normalization Swiss Institute of Bioinformatics
Taking scale into account Assume: All slides have the same spread in M • True log ratio is m ij where i represents different slides and j represents different spots • Observed is M ij , where M ij = a i m ij • Robust estimate of a i is MAD i = median j { |m ij - median(m ij ) | } • Could instead make same assumption for print tip groups (rather than slides ) Swiss Institute of Bioinformatics
NCI 60 experiments Swiss Institute of Bioinformatics
Same normalization on another data set ����� ����� Swiss Institute of Bioinformatics
Normalization: Summary • Reduces systematic (not random) effects • Makes it possible to compare several arrays • Use logratios (MVA plots) • Lowess normalization (dye bias) • Pin-group location normalization • Pin-group scale normalization • Between slide scale normalization • Control Spots • Normalization introduces more variability • Outliers (bad spots) handled with replication Swiss Institute of Bioinformatics
cDNA gene expression data Data on p genes for n samples: �78��������� � sample1 sample2 sample3 sample4 sample5 … 1 0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49 0.24 0.06 0.46 ... $���� 3 0.15 0.74 0.04 0.10 0.20 ... 5�����6 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ... $�������������������� �������� ��� ����78�������� 9 � 5����������6�*�� : 5 7������������ ;�$��������������6 Swiss Institute of Bioinformatics
Software for Microarray Analysis • Very large number of commercial and free softwares (GeneSpring, PathwayAssist,…) • There are several R packages for microarray analysis available as part of the open source BioConductor project http://www.bioconductor.org/ • BioC software often created by the author of the methodology Swiss Institute of Bioinformatics
Biological question ( e.g. Differentially expressed genes, Scientific Process Sample class prediction, etc .) Experimental design Microarray experiment Pre-processing steps Image analysis/ (failed) Quality assessment Normalization Data Analysis Significance Testing Clustering Discrimination Biological verification and interpretation Swiss Institute of Bioinformatics
cDNA gene expression data Data on p genes for n samples: �78��������� � sample1 sample2 sample3 sample4 sample5 … 1 0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49 0.24 0.06 0.46 ... $���� 3 0.15 0.74 0.04 0.10 0.20 ... 5�����6 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ... $�������������������� �������� ��� ����78�������� 9 � 5����������6�*�� : 5 7������������ ;�$��������������6 Swiss Institute of Bioinformatics
Replicated experiments • Have n replicates • For each gene, have n values of M = log 2 fold change, one from each array • Summarize M 1 , ..., M n for each gene by – M = average (M 1 , ..., M n ) – s = SD(M 1 , ..., M n ) • Rank genes in order of strength of evidence in favor of DE • How might we do this? Swiss Institute of Bioinformatics
Which genes are DE? • Difficult to judge significance – massive multiple testing problem – genes dependent – don’t know null distribution of M • Strategy – aim to rank genes – assume most genes are not DE (depending on type of experiment and array) – find genes separated from the majority Swiss Institute of Bioinformatics
Ranking criteria • Genes i = 1, ..., p • M i = average log 2 fold change for gene i – Problem : genes with large variability likely to be selected, even if not DE • Fix that by taking variability into account: use t i = M i / (s i / √ n) – Problem : genes with extremely small variances make very large t – When the number of replicates is small, the smallest s i are likely to be underestimates Swiss Institute of Bioinformatics
Summary • Image analysis is important to extract information from the array – Background may or may not be taken into account • Normalization procedures are always needed – To remove systematic (technical) effects – To allow comparisons between chips • Identification of differentially expressed genes is difficult – No absolute estimation of significance is possible – Ranking of genes by significance is possible Swiss Institute of Bioinformatics
End of part II Swiss Institute of Bioinformatics
Part III Higher-level analysis Swiss Institute of Bioinformatics
Finding biological information Once the matrix of gene-expression vs samples is available, statistical tools can be used to: • Find similarity (or difference) of expression pattern in differentially expressed genes • Find differentially expressed functional groups of genes (pathway analysis, gene ontology) • Find classes in the set of samples (unsupervised analysis) • Use differentially expressed genes as a mean to classify samples in known categories (supervised analysis) • Find genes significantly related to survival in a pool of patients Swiss Institute of Bioinformatics
Unsupervised analysis: Cluster analysis • data matrix (n,p) • distance matrix (n,n), similarity matrix (n,n) Dendrogram • cluster formation: – mutually exclusive clusters – hierarchical clusters • comparison of clusters, means and variances Swiss Institute of Bioinformatics
Recommend
More recommend