UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics Core Facility EPFL
Objectives Concentrate on Solexa/Illumina technology Describe the steps from imaging colonies to mapping/assembling sequence tags Understand the content and structure of the output files Study possible sources of systematic bias and find remedies to some of them
Terminology colony: set of identical sequences obtained on the flow-cell by amplification of a template CACGTGGTCATG (sequencing) cycle: attempt to CACGTGGTCATG incorporate the next nucleotide of every CACGTGGTCATG complementary strand (color) channel: 1 of 4 imaged colors, A corresponding to the fluorophore TTT associated with one base (e.g. A ) GTGCGTGGTAAA read: sequence output representing the GTGCGTGGTAAA TTTA... colony ATTT base calling: algorithm constructing the reads from the measurements
Images Each sequencing cycle produces 4 images for each of the 100 tiles DNA colonies must be located, quantified, and tracked across images stacks (~100’000 colonies/image) Each colony, at each cycle, generates a quadruplet of fluorescence intensities Naively: highest of the 4 values determines the base
Solexa/Illumina file structure
Solexa/Illumina file structure
Solexa/Illumina file structure
Quality scores
Quality scores
Summary of data Intensities Sequence Quality
Summary of data Intensities Sequence Quality
Summary of data Intensities Sequence Quality
Summary of data Intensities Sequence Quality
Summary of data Intensities Sequence Quality
Summary of data Intensities Sequence Quality There are ~10M such plots...
Global look s_4_0001_int cycle 1 cycle 2 #CH4:OBJ130954 #END CYCLE 1 13.5 43.4 2021.8 1180.6 11.1 43.2 1875.0 1049.2 -27.6 -51.1 2531.9 1699.1 -48.5 -56.7 63.0 1349.1 Colonies -143.9 -43.0 2575.9 1133.8 -257.2 -5.8 59.8 1176.6 -9.3 -262.8 2657.1 1639.4 -129.3 646.7 920.1 557.2 -107.8 -27.3 1968.3 1320.1 964.1 540.7 1436.3 1015.0 -20.5 -45.4 2312.2 862.1 1497.0 918.4 -6.6 13.5 105.8 -38.9 1938.7 966.6 14.1 34.4 1751.6 903.7 52.2 201.4 1934.9 1198.6 1337.1 772.6 199.4 893.7 77.1 24.3 2467.7 1102.2 153.9 223.8 23.4 937.0 637.6 198.8 2501.5 1500.8 313.7 579.3 41.0 663.9 -15.2 18.6 2401.9 1053.9 688.1 347.3 655.9 1194.2 ... A C G T
Global look s_4_0001_int cycle 1 cycle 2 #CH4:OBJ130954 #END CYCLE 1 13.5 43.4 2021.8 1180.6 11.1 43.2 1875.0 1049.2 -27.6 -51.1 2531.9 1699.1 -48.5 -56.7 63.0 1349.1 Colonies -143.9 -43.0 2575.9 1133.8 -257.2 -5.8 59.8 1176.6 -9.3 -262.8 2657.1 1639.4 -129.3 646.7 920.1 557.2 -107.8 -27.3 1968.3 1320.1 964.1 540.7 1436.3 1015.0 -20.5 -45.4 2312.2 862.1 1497.0 918.4 -6.6 13.5 105.8 -38.9 1938.7 966.6 14.1 34.4 1751.6 903.7 52.2 201.4 1934.9 1198.6 1337.1 772.6 199.4 893.7 77.1 24.3 2467.7 1102.2 153.9 223.8 23.4 937.0 637.6 198.8 2501.5 1500.8 313.7 579.3 41.0 663.9 -15.2 18.6 2401.9 1053.9 688.1 347.3 655.9 1194.2 ... A C G T
Global look s_4_0001_int cycle 1 cycle 2 #CH4:OBJ130954 #END CYCLE 1 13.5 43.4 2021.8 1180.6 11.1 43.2 1875.0 1049.2 -27.6 -51.1 2531.9 1699.1 -48.5 -56.7 63.0 1349.1 Colonies -143.9 -43.0 2575.9 1133.8 -257.2 -5.8 59.8 1176.6 -9.3 -262.8 2657.1 1639.4 -129.3 646.7 920.1 557.2 -107.8 -27.3 1968.3 1320.1 964.1 540.7 1436.3 1015.0 -20.5 -45.4 2312.2 862.1 1497.0 918.4 -6.6 13.5 105.8 -38.9 1938.7 966.6 14.1 34.4 1751.6 903.7 52.2 201.4 1934.9 1198.6 1337.1 772.6 199.4 893.7 77.1 24.3 2467.7 1102.2 153.9 223.8 23.4 937.0 637.6 198.8 2501.5 1500.8 313.7 579.3 41.0 663.9 -15.2 18.6 2401.9 1053.9 688.1 347.3 655.9 1194.2 ... A C G T
Global look Each colony is a point in 4D s_4_0001_int intensity space at each cycle cycle 1 cycle 2 #CH4:OBJ130954 #END CYCLE 1 Naive interpretation was 13.5 43.4 2021.8 1180.6 11.1 43.2 1875.0 1049.2 -27.6 -51.1 2531.9 1699.1 -48.5 -56.7 63.0 1349.1 optimistic Colonies -143.9 -43.0 2575.9 1133.8 -257.2 -5.8 59.8 1176.6 -9.3 -262.8 2657.1 1639.4 -129.3 646.7 920.1 557.2 -107.8 -27.3 1968.3 1320.1 964.1 540.7 1436.3 1015.0 -20.5 -45.4 2312.2 862.1 1497.0 918.4 -6.6 13.5 105.8 -38.9 1938.7 966.6 14.1 34.4 1751.6 903.7 52.2 201.4 1934.9 1198.6 1337.1 772.6 199.4 893.7 77.1 24.3 2467.7 1102.2 153.9 223.8 23.4 937.0 637.6 198.8 2501.5 1500.8 313.7 579.3 41.0 663.9 -15.2 18.6 2401.9 1053.9 688.1 347.3 655.9 1194.2 ... A C G T
Bias 1: optical effects False image from measured intensities as a function of x-y coordinates on tile There are obvious boundary effects, stronger in some color channels We can correct this effect by fitting a position-depend base line There are other position- dependant issues like spot overlaps
Bias 2: sticky fluorophores T fluorophores stick to the surface of the flow cell
Bias 2: sticky fluorophores T fluorophores stick to the surface of the flow cell
Bias 3: color cross-talk and decay Fluorophores spectra overlap Some intensity pairs are correlated We can use a basis transform in 4D space to reduce correlations
Bias 3: color cross-talk and decay Fluorophores spectra overlap Some intensity pairs are correlated We can use a basis transform in 4D space to reduce correlations
Bias 4: dephasing Suppose some strands in a TAC colony failed to incorporate their CACGTGGTCATG nucleotides at a previous cycle GTAC CACGTGGTCATG They may successfully elongate GTAC at subsequent cycles CACGTGGTCATG These are therefore lagging behind in their synthesis and emit signal in a different channel
Bias 4: dephasing Suppose some strands in a TAC colony failed to incorporate their CACGTGGTCATG nucleotides at a previous cycle A GTAC CACGTGGTCATG They may successfully elongate A GTAC at subsequent cycles CACGTGGTCATG These are therefore lagging behind in their synthesis and emit signal in a different channel
Bias 4: dephasing Suppose some strands in a G TAC colony failed to incorporate their CACGTGGTCATG nucleotides at a previous cycle A GTAC CACGTGGTCATG They may successfully elongate A GTAC at subsequent cycles CACGTGGTCATG These are therefore lagging behind in their synthesis and emit signal in a different channel
Binomial law there is a probability q<1 of incorporating a nucleotide n at cycle C if q is independent of C and n, there is a simple way of correcting for dephasing sum the contributions from previous n in the sequence weighted by the probability of that many mis-incorporations I(n,c) are measured intensities, J(n,k) are dephasing-less intensities C � C � q k (1 − q ) C − k 1( s k = n ) � Prob( n, C ) = k k =1 C � C � � q k (1 − q ) C − k J ( n, k ) I ( n, C ) = k k =1
Base probability We would like to associate a probability with each base at each read position � Solution 1: Prob( n, C ) = I ( n, C ) / I ( k, C ) k Better solution: fit gaussian distributions to the four data clouds
Entropy entropy H(p) is a measure of how flat (or peaked) is a probability distribution peaked = 0 ≤ H ≤ log 2 (10) = flat H = log 2 (ambiguity), ambiguity is the number of 10 � states compatible with the H ( p ) = − p ( k ) log 2 ( p ( k )) observation k =1
IUPAC codes Fluorescence intensities (after bias correction and possibly normalization) provide a probability distribution over the four nucleotides We use entropy to convert this into a measure of ambiguity of the call using IUPAC’s convention, e.g. M=A or C, H=A or C or T log 2 (1.5) log 2 (2.5) log 2 (3.5) H 0 2 ACGT MRWSYK BDHV N
Sequence mapping
Sequence mapping
Summary Sequencing produces images that are then quantified into tab- delimited text files with four intensity values for each colony and each sequencing cycle These values can be represented in 4D space to show color cross- talk and decay They can be represented as tile pseudo-images to show optical effects Two major sources of bias are dephasing and changing baselines between colors and between cycles Simple signal transformations can decrease many of these biases Per-base quality scores are useful information at the mapping level
References Image analysis: ImageJ http://rsb.info.nih.gov/ij/ Genome indexing: Iseli et al. Indexing strategies for rapid searches of short words in genome sequences. PLoS ONE (2007) vol. 2 (6) pp. e579 tagger: http://www.isrec.isb-sib.ch/tagger/ bowtie: http://bowtie-bio.sourceforge.net/ Base calling: Rougemont et al. Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics (2008) vol. 9 (1) pp. 431 http://bbcf.epfl.ch/Software
Recommend
More recommend