Introduction to microarrays Thierry Sengstag, PhD Bioinformatics - PowerPoint PPT Presentation

Part II Extraction of gene signal from microarrays Swiss Institute of Bioinformatics

Biological question ( e.g. Differentially expressed genes, Scientific Process Sample class prediction, etc .) Experimental design Microarray experiment Pre-processing steps Image analysis/ (failed) Quality assessment Normalization Data Analysis Significance Testing Clustering Discrimination Biological verification and interpretation Swiss Institute of Bioinformatics

Steps in Images Processing • Addressing (or Gridding ) – Assigning coordinates to each spot • Segmentation – Classification of pixels as either foreground (signal) or background • Information Extraction – Foreground fluorescence intensity pairs ( R , G ) – Background intensities – Quality measures Swiss Institute of Bioinformatics

Addressing This is the process of assigning coordinates to each of the spots. Automating this part of the procedure permits high throughput analysis. 4 by 4 grids 19 by 21 spots per grid Swiss Institute of Bioinformatics

Addressing �� Swiss Institute of Bioinformatics

Problems in automatic addressing • Misregistration of the red and green channels • Rotation of the array in the image Rotat i on Rotat i on Swiss Institute of Bioinformatics

Problems in automatic addressing • Skew in the array Swiss Institute of Bioinformatics

Addressing • Basic structure of images known (determined by the arrayer) • Parameters to address spot positions – Separation between rows and columns of grids – Individual translation of grids – Separation between rows and columns of spots within each grid – Small individual translation of spots – Overall position of the array in the image �� Swiss Institute of Bioinformatics

Segmentation Methods • Fixed circles • Adaptive circles • Adaptive shape – Edge detection – Seeded Region Growing (R. Adams and L. Bishof (1994): Regions grow outwards from seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region • Histogram methods Swiss Institute of Bioinformatics

Fixed circle segmentation • Fits a circle with a constant diameter to all spots in the image • Easy to implement • The spots should be of the same shape and size �� Swiss Institute of Bioinformatics

Adaptive circle segmentation • The circle diameter is estimated separately for each spot Dapple finds spots by detecting edges of spots (second derivative) • Problematic if spot exhibits oval shapes Swiss Institute of Bioinformatics

Limitation of circular segmentation —Small spot —Not circular Result of Seed Region Growing Swiss Institute of Bioinformatics

Adaptive shape segmentation • Specification of starting points or seeds – Bonus: already know geometry of array • Regions grow outwards from the seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region Swiss Institute of Bioinformatics

Histogram segmentation • Choose target mask larger than any spot • Fg and bg intensities determined from the histogram of pixel values for pixels within the masked area • Example : QuantArray – Background : mean between 5th and 20th percentile – Foreground : mean between 80th and 95th percentile • May not work well when a large target mask is set to compensate for variation in spot size !�� "�� Swiss Institute of Bioinformatics

Information Extraction • Spot Intensities § mean of pixel intensities § median of pixel intensities § Pixel variation (e.g. IQR) • Background values § None § Local Take the average § Constant (global) • Quality Information Swiss Institute of Bioinformatics

Spot ‘foreground’ intensity • The total amount of hybridization for a spot is proportional to the total fluorescence generated by the spot • Spot intensity = sum of pixel intensities within the spot mask • Since later calculations are based on ratios between Cy5 and Cy3, we compute the average* pixel value over the spot mask * alternative : ratios of medians may be better than means if bright specks present Swiss Institute of Bioinformatics

Background intensity • The measured fluorescence intensity includes a contribution of non-specific hybridization and other chemicals on the glass • Fluorescence from regions not occupied by DNA should be different from regions occupied by DNA → one solution is to use local negative controls (spotted DNA that should not hybridize) Swiss Institute of Bioinformatics

BG: None • Do not consider the background – Can be better than some forms of local background determination with good quality arrays With a loose mathematical notation:   + σ − + σ R R   fg R fg bg R bg = M , , log   + σ − + σ 2 G G   fg G fg bg G bg , ,   + σ + σ R ( )   fg R fg R bg ≈ , , log   + σ + σ 2 G ( )   fg G fg G bg , , worse than   + σ R   fg R fg = M , log   + σ 2 G Swiss Institute of Bioinformatics   fg G fg ,

BG: Local • Focus on small regions surrounding the spot mask • Median of pixel values in this region • Most software implements such an approach �� #��$�� %�$�� • By ignoring pixels immediately surrounding the spots, bg estimate is less sensitive to the performance of the segmentation procedure Swiss Institute of Bioinformatics

Background can matter Without BG correction With BG correction Swiss Institute of Bioinformatics

Summary • Image analysis is a crucial preprocessing step – Association of a "geographic" location (and corresponding annotation) with signal intensities – Several non-trivial technical choices (scanner, image analysis software, etc…) can affect the quality of the signal • Bg correction is sometimes not desirable (low bg arrays) Swiss Institute of Bioinformatics

Quality assessment overview Visual inspection of images Evaluation of MvA plots Compare statistical summaries for the chips Swiss Institute of Bioinformatics

Red/Green overlay images Co-registration and overlay offers a quick visualization, revealing information on color balance, uniformity of hybridization, spot uniformity, background, and artifacts such as dust or scratches $��& ��'��%��(�( ��& ��%��%�� (�( Swiss Institute of Bioinformatics

Spatial plots: background from two slides Swiss Institute of Bioinformatics

Practical Problems 1 Comet Tails § Likely cause: insufficiently rapid immersion of the slides in the succinic anhydride blocking solution Swiss Institute of Bioinformatics

Practical Problems 2 ��%� ��)�� %� �� Swiss Institute of Bioinformatics

Practical Problems 3 High Background • 2 likely causes: – Insufficient blocking – Precipitation of the labeled probe Weak Signals Swiss Institute of Bioinformatics

Practical Problems 4 �� § *�!��&�� +�� Swiss Institute of Bioinformatics

Practical Problems 5 Swiss Institute of Bioinformatics

Artifacts in microarrays • We are interested in finding true biologically meaningful differences between sample types • Due to other sources of systematic variation, there are also usually artifactual differences • Sources of artifacts include: – print tips - differences in subarrays – plate effects – differences in rows within subarray – batch effects – hybridization artifacts Swiss Institute of Bioinformatics

Sample boxplot �� , - /'��!��0 �� , . Swiss Institute of Bioinformatics

Boxplots of log 2 R/G *��.1��&�2��+%�2��# 34 (Example data associated to limmaGUI package.) Swiss Institute of Bioinformatics

Biological question ( e.g. Differentially expressed genes, Scientific Process Sample class prediction, etc .) Experimental design Microarray experiment Pre-processing steps Image analysis/ (failed) Quality assessment Normalization Data Analysis Estimation Testing Clustering Discrimination Biological verification and interpretation Swiss Institute of Bioinformatics

Pin group (sub-array) effects �� Swiss Institute of Bioinformatics

Boxplots, highlighting pin group effects � � � � � � � � * �� Clear example of spatial bias Swiss Institute of Bioinformatics

Preprocessing: Normalization • Why? To correct for systematic differences between samples on the same slide, or between slides, which do not represent true biological variation between samples • How do we know it is necessary? By examining self-self hybridizations , where no true differential expression is occurring. There are dye biases which vary with spot intensity, location on the array, plate origin, pins, scanning parameters, etc. Swiss Institute of Bioinformatics

What is self-self hybridization? • In dual channel (2-color) microarrays, such as cDNA arrays, two samples are each labeled with a different fluorescent dye • In most studies, the samples are from different sources ( e.g. cancer vs. normal) • However, it is also possible to co-hybridize two samples from the same source (but differently labeled) Swiss Institute of Bioinformatics

Dual channel co-hybridizations Control Treated Control Control (self-self) sample sample sample sample Swiss Institute of Bioinformatics

Self-self hybridizations False color overlay Boxplots within pin-groups Scatter (MA-)plots Swiss Institute of Bioinformatics

Normalization: global • Normalization based on a global adjustment log 2 R/G → log 2 R/G - c = log 2 R/(kG) • Common choices for k or c = log 2 k are c = median or mean of log ratios for a particular gene set ( e.g. all genes, or control, or ‘housekeeping’ genes) • Another possibility is total intensity normalization, where k = ∑ R i / ∑ G i Swiss Institute of Bioinformatics

Effect of global normalization Swiss Institute of Bioinformatics

Normalization: intensity-dependent • Here, run a line through the middle of the MA plot, shifting the M value of the pair (A,M) by c=c(A), i.e. log 2 R/G → log 2 R/G - c (A) = log 2 R/(k(A)G) • One estimate of c(A) is made using the LOWESS (or loess) function of Cleveland (1979): LOcally WEighted Scatterplot Smoothing Swiss Institute of Bioinformatics

Effect of lowess normalization Swiss Institute of Bioinformatics

Comparison between arrays • Different arrays often do not show identical signal distribution of M values – Various technical reasons (e.g. labeling efficiency, amount of labelled RNA, scanner settings, etc…) • Need to normalize the signal between chips – Multiple possibilities, one often used: "scale normalization" Swiss Institute of Bioinformatics

Scale normalization: between slides Idea: make the median spread of M values identical by multiplying them by a chip-specific constant Boxplots of log ratios from 3 replicate self-self hybs Left panel: before normalization Middle panel: after within print-tip group normalization Right panel: after a further between-slide scale normalization Swiss Institute of Bioinformatics

Taking scale into account Assume: All slides have the same spread in M • True log ratio is m ij where i represents different slides and j represents different spots • Observed is M ij , where M ij = a i m ij • Robust estimate of a i is MAD i = median j { |m ij - median(m ij ) | } • Could instead make same assumption for print tip groups (rather than slides ) Swiss Institute of Bioinformatics

NCI 60 experiments Swiss Institute of Bioinformatics

Same normalization on another data set �� Swiss Institute of Bioinformatics

Normalization: Summary • Reduces systematic (not random) effects • Makes it possible to compare several arrays • Use logratios (MVA plots) • Lowess normalization (dye bias) • Pin-group location normalization • Pin-group scale normalization • Between slide scale normalization • Control Spots • Normalization introduces more variability • Outliers (bad spots) handled with replication Swiss Institute of Bioinformatics

cDNA gene expression data Data on p genes for n samples: �78�� sample1 sample2 sample3 sample4 sample5 … 1 0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49 0.24 0.06 0.46 ... $�� 3 0.15 0.74 0.04 0.10 0.20 ... 5��6 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ... $�� 78�� 9 � 5��6�*�� : 5 7�� ;�$��6 Swiss Institute of Bioinformatics

Software for Microarray Analysis • Very large number of commercial and free softwares (GeneSpring, PathwayAssist,…) • There are several R packages for microarray analysis available as part of the open source BioConductor project http://www.bioconductor.org/ • BioC software often created by the author of the methodology Swiss Institute of Bioinformatics

cDNA gene expression data Data on p genes for n samples: �78�� sample1 sample2 sample3 sample4 sample5 … 1 0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49 0.24 0.06 0.46 ... $�� 3 0.15 0.74 0.04 0.10 0.20 ... 5��6 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ... $�� 78�� 9 � 5��6�*�� : 5 7�� ;�$��6 Swiss Institute of Bioinformatics

Replicated experiments • Have n replicates • For each gene, have n values of M = log 2 fold change, one from each array • Summarize M 1 , ..., M n for each gene by – M = average (M 1 , ..., M n ) – s = SD(M 1 , ..., M n ) • Rank genes in order of strength of evidence in favor of DE • How might we do this? Swiss Institute of Bioinformatics

Which genes are DE? • Difficult to judge significance – massive multiple testing problem – genes dependent – don’t know null distribution of M • Strategy – aim to rank genes – assume most genes are not DE (depending on type of experiment and array) – find genes separated from the majority Swiss Institute of Bioinformatics

Ranking criteria • Genes i = 1, ..., p • M i = average log 2 fold change for gene i – Problem : genes with large variability likely to be selected, even if not DE • Fix that by taking variability into account: use t i = M i / (s i / √ n) – Problem : genes with extremely small variances make very large t – When the number of replicates is small, the smallest s i are likely to be underestimates Swiss Institute of Bioinformatics

Summary • Image analysis is important to extract information from the array – Background may or may not be taken into account • Normalization procedures are always needed – To remove systematic (technical) effects – To allow comparisons between chips • Identification of differentially expressed genes is difficult – No absolute estimation of significance is possible – Ranking of genes by significance is possible Swiss Institute of Bioinformatics

End of part II Swiss Institute of Bioinformatics

Part III Higher-level analysis Swiss Institute of Bioinformatics

Finding biological information Once the matrix of gene-expression vs samples is available, statistical tools can be used to: • Find similarity (or difference) of expression pattern in differentially expressed genes • Find differentially expressed functional groups of genes (pathway analysis, gene ontology) • Find classes in the set of samples (unsupervised analysis) • Use differentially expressed genes as a mean to classify samples in known categories (supervised analysis) • Find genes significantly related to survival in a pool of patients Swiss Institute of Bioinformatics

Unsupervised analysis: Cluster analysis • data matrix (n,p) • distance matrix (n,n), similarity matrix (n,n) Dendrogram • cluster formation: – mutually exclusive clusters – hierarchical clusters • comparison of clusters, means and variances Swiss Institute of Bioinformatics

Introduction to microarrays Thierry Sengstag, PhD Bioinformatics - PowerPoint PPT Presentation

EMBnet's introduction to bioinformatics Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility Swiss Institute of Bioinformatics Part I Technology of microarrays Swiss Institute of Bioinformatics Biology Fundamentals

DNA Microarrays Microarrays: What are they good for? Microarrays offer the ability to measure

TISSUE MICROARRAYS AND CONTROL SLIDES Tissue Microarrays | Infectious Disease Arrays | IHC &

Nonequilibrium effects in DNA microarrays: a multiplatform study Jean-Charles Walter KU Leuven,

Quantification of cross hybridization on oligonucleotide microarrays Li Zhang Dept. of

Microarrays False Discovery Rate Prof. Tesler Math 186 Winter 2019 Prof. Tesler

Microarrays: B A Splitting into two single strands An introduction to the bio- B technology

metaMA: an R package implementing meta-analysis approaches for microarrays G. Marot, J.-L.

Earl Bellinger and Fabio Mendes What are microarrays again? A microarray is a 2D array on a solid

Comparison of Normalization Methods for cDNA Microarrays Liling Warren, Ben Hui Liu

Attaching Fluorescent Nanoclusters to DNA Origami Microarrays John Devany 10/31/14 Worster

Assessing Effect of Cross- Hybridization on Oligonucleotide Microarrays S. Kachalo, J.Liang

Issued: 01/02/2017 FFPE Tissue and Cell Line Microarrays Safety Data Sheet SECTION 1:

Introduction to Microarray Data Analysis and Gene Networks Alvis Brazma European Bioinformatics

Outline CMOS Sensor Arrays for Bio Molecule 1. Introduction and Neural Tissue Interfacing 2.

Introduction to statistics Frdric Schtz Frederic.Schutz@isb-sib.ch 19 January 2009

Introduction to Microarray Data Analysis and Gene Networks Alvis Brazma European Bioinformatics

E xpe rime nts De sig n a nd Ana lysis F o tis E . Pso mo po ulo s CODAT A-RDA Advanc e d

Biostatistics ANOVA - Analysis of Variance Burkhardt Seifert & Alois Tschopp Biostatistics

Lessons from Gene Expression Kasper Daniel Hansen < khansen@jhsph.edu | www.hansenlab.org

101 E C O L O G Y A N D B I O D I V E R S I T Y Introductions Syllabus Term Schedule

Addressing Population Variability in Risk Assessment: Challenges and Opportunities SRP Risk

Statistical analysis for scRNAseq data Cathy Maugis-Rabusseau cathy.maugis@insa-toulouse.fr

NO 3 NPGO !"#$ !"%$ !""$ &$$$ NPGO SSHa pattern Sea Level Anomalies

Statistical Analysis of RNA-Seq Data: Experimental design Lorena S. Rivarola-Duarte PhD Student

Introduction to microarrays Thierry Sengstag, PhD Bioinformatics - PowerPoint PPT Presentation

EMBnet's introduction to bioinformatics Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility Swiss Institute of Bioinformatics Part I Technology of microarrays Swiss Institute of Bioinformatics Biology Fundamentals

DNA Microarrays Microarrays: What are they good for? Microarrays offer the ability to measure

TISSUE MICROARRAYS AND CONTROL SLIDES Tissue Microarrays | Infectious Disease Arrays | IHC &amp;

Nonequilibrium effects in DNA microarrays: a multiplatform study Jean-Charles Walter KU Leuven,

Quantification of cross hybridization on oligonucleotide microarrays Li Zhang Dept. of

Microarrays False Discovery Rate Prof. Tesler Math 186 Winter 2019 Prof. Tesler

Microarrays: B A Splitting into two single strands An introduction to the bio- B technology

metaMA: an R package implementing meta-analysis approaches for microarrays G. Marot, J.-L.

Earl Bellinger and Fabio Mendes What are microarrays again? A microarray is a 2D array on a solid

Comparison of Normalization Methods for cDNA Microarrays Liling Warren, Ben Hui Liu

Attaching Fluorescent Nanoclusters to DNA Origami Microarrays John Devany 10/31/14 Worster

Assessing Effect of Cross- Hybridization on Oligonucleotide Microarrays S. Kachalo, J.Liang

Issued: 01/02/2017 FFPE Tissue and Cell Line Microarrays Safety Data Sheet SECTION 1:

Introduction to Microarray Data Analysis and Gene Networks Alvis Brazma European Bioinformatics

Outline CMOS Sensor Arrays for Bio Molecule 1. Introduction and Neural Tissue Interfacing 2.

Introduction to statistics Frdric Schtz Frederic.Schutz@isb-sib.ch 19 January 2009

Introduction to Microarray Data Analysis and Gene Networks Alvis Brazma European Bioinformatics

E xpe rime nts De sig n a nd Ana lysis F o tis E . Pso mo po ulo s CODAT A-RDA Advanc e d

Biostatistics ANOVA - Analysis of Variance Burkhardt Seifert &amp; Alois Tschopp Biostatistics

Lessons from Gene Expression Kasper Daniel Hansen &lt; khansen@jhsph.edu | www.hansenlab.org

101 E C O L O G Y A N D B I O D I V E R S I T Y Introductions Syllabus Term Schedule

Addressing Population Variability in Risk Assessment: Challenges and Opportunities SRP Risk

Statistical analysis for scRNAseq data Cathy Maugis-Rabusseau cathy.maugis@insa-toulouse.fr

NO 3 NPGO !&quot;#$ !&quot;%$ !&quot;&quot;$ &amp;$$$ NPGO SSHa pattern Sea Level Anomalies

Statistical Analysis of RNA-Seq Data: Experimental design Lorena S. Rivarola-Duarte PhD Student

TISSUE MICROARRAYS AND CONTROL SLIDES Tissue Microarrays | Infectious Disease Arrays | IHC &

Biostatistics ANOVA - Analysis of Variance Burkhardt Seifert & Alois Tschopp Biostatistics

Lessons from Gene Expression Kasper Daniel Hansen < khansen@jhsph.edu | www.hansenlab.org

NO 3 NPGO !"#$ !"%$ !""$ &$$$ NPGO SSHa pattern Sea Level Anomalies