Data Mining in Bioinformatics Day 6: Classification in Next Generation Sequencing Data Analysis Dominik Grimm February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls Universität Tübingen Dominik Grimm: Data Mining in Bioinformatics, Page 1
Overview Genome sequencing: A brief review Classical sequencing methods Paired-end sequencing Next Generation Sequencing (NGS): A brief introduction Next Generation Sequencing approaches Illumina Genome Analyzer II Genome reconstruction Detecting structural variations Accurate indel prediction using paired-end short reads (Grimm et al. , 2013) SVM approach to predict indels Dominik Grimm: Data Mining in Bioinformatics, Page 2
Genome sequencing: A brief review Brief historical review First DNA sequences were obtained in the early 1970s (Min Jou et al. , 1971, 1972) Laborious techniques were required to retrieve small DNA pieces, e.g. in 1973 the lac-operator ( 24 base- pairs (bp) ) was sequenced by Walter Gilbert and Allen Maxam (Gilbert and Maxam, 1973) In 1977 two rapid sequencing methods were developed (almost simultaneously) Maxam-Gilbert sequencing at Harvard University USA Sanger sequencing by Frederick Sanger at the Uni- versity of Cambridge UK Nobel prize for Frederick Sanger, Walter Gilbert and Paul Berg in 1980 Dominik Grimm: Data Mining in Bioinformatics, Page 3
Genome sequencing: A brief review Maxam-Gilbert sequencing (Maxam and Gilbert, 1977) Gel electrophoresis to DNA sequences (ATTCGA) marked reconstruct the sequence at the 5' end ATTCGA A T G C DNA sequences get modified at A,T,G or C A T G C and get split of from the DNA backbone Rarely used, because: Sequences are marked at the 5’ and 3’ end with radioactive phosphor 32 P or a non radioactive biotin or fluorescein It is hard to automize Dominik Grimm: Data Mining in Bioinformatics, Page 4
Genome sequencing: A brief review Sanger sequencing (Sanger et al. , 1977) 3' 5' A T T C G A Template Primer Polymerase Polymerase Polymerase Polymerase + ddCTP + ddATP + ddTTP + ddGTP + a lot of dTTP + a lot of dCTP + a lot of dATP + a lot of dGTP 3' 5' 3' 5' 3' 5' 3' 5' A T T C G A A T T C G A A T T C G A A T T C G A A T T C G A A T T C G A A T 3' 5' 3' 5' A T T C G A A T T C G A A T T C G A A T T Widely used, because: A T G C Less toxic and radioactive substances are needed Is more efficient due to automation Method works for short sequence strand from 100 bp up to 1.5 kbp Dominik Grimm: Data Mining in Bioinformatics, Page 5
Genome sequencing: A brief review Shotgun sequencing (Sanger et al. , 1980, 1982) Several DNA copies using mechanical shear forces to break DNA into pieces at random positions find overlapping pieces and reconstruct the original sequence For the first time it was possible to sequence a long sequence even whole genomes Efficient bioinformatic techniques are essential to reconstruct the original se- quence The Institute for Genome Research led by Craig Venter proposed a concept of highly parallel sequencing Dominik Grimm: Data Mining in Bioinformatics, Page 6
Genome sequencing: A brief review Shotgun sequencing Error-prone due to sequencing errors, repetitive nu- cleotide patterns or similar reads from distinct genomic po- sitions ! DNA molecules are copied and sequenced nu- merous times The fold-coverage c is measuring the average number of reads covering a given nucleotide: | R | c ( R , g ) = 1 X | R i | , (1) | g | i =1 where g is a DNA sequence of length n and R a set of DNA reads with an average length m . Example : A sequence of length 4000 bp is reconstructed using 20 reads, each with an average length of 600 bp ! c = 3 x coverage (3 fold coverage) Dominik Grimm: Data Mining in Bioinformatics, Page 7
Genome sequencing: A brief review Paired-end sequencing (Edwards and Caskey, 1991) In paired-end (or mate pair) sequencing both ends of the same fragment are sequenced The distance between two reads of a paired-end read is known Reads that are reassembled approximately the known dis- tance apart from each other are called happy, otherwise they are unhappy. expected distance between two reads Read 2 Read 2 Read 1 (happy) (unhappy) DNA sequence Paired-end reads and their distance information help to re- construct the original sequence more reliable. Dominik Grimm: Data Mining in Bioinformatics, Page 8
Next Generation Sequencing Overview In the last three decades Sanger sequencing was the most used and productive way ! But classical Sanger it is still expensive, a lot of scientists, huge sequencing centers and a lot of time are required for whole genomes ! There is a high demand for low-cost sequencing tech- niques Dominik Grimm: Data Mining in Bioinformatics, Page 9
Next Generation Sequencing Cyclic-array sequencing technologies (Shendure and Ji, 2008) 454 pyrosequencing used in the 454 Genome Sequencer, Roche Applied Science SOLiD platform developed by Applied Biosystems Polonator developed by George M. Church’s group at Har- vard Medical School HeliScope Single Molecule Sequencer by Helicos Solexa technology used in the Illumina Genome Analyzer ! "Cyclic-array based sequencing can be summarized as the sequencing of a dense array of DNA features by iterative cycles of enzymatic manipulation and imaging- based data collection" (Shendure and Ji, 2008) Dominik Grimm: Data Mining in Bioinformatics, Page 10
Next Generation Sequencing The Illumina Genome Analyzer II 1. A DNA sequence library has to be prepared (a) Create several copies of the DNA strand (b) Fragment strand using nebulization or sonication (c) Amplify ends of fragments (d) Phosphorylate the 3’ end and add an Adenosine over- hang to the 5’ end (e) Ligate Illumina adapters 3. Adenosin addition 1. DNA End-repair P A A P 2. Phosphorylation 4. Adapter ligation P P Dominik Grimm: Data Mining in Bioinformatics, Page 11
Next Generation Sequencing The Illumina Genome Analyzer II 2. Flow cell preparation 5. 3' extension 6. Denaturation 7. Bridge formation 8. 3' aplification 36 times repeated 9. Bridge 10. Bridge formation denaturation Dominik Grimm: Data Mining in Bioinformatics, Page 12
Next Generation Sequencing The Illumina Genome Analyzer II 3. Sequencing 12.Single base extension and 13.Fluorescent base cleavage and 14.Repeated more than 50 times 11.Sequencing primers are hybridized laser based imaging terminator gets unblocked to determine sequence ? C C A T Laser ! Now it is possible to generate gigabases of high- quality reads within one day using only one machine and less than 6 hours of hands-on-time Dominik Grimm: Data Mining in Bioinformatics, Page 13
Next Generation Sequencing Sequencing costs (Wetterstrand, 2013) Dominik Grimm: Data Mining in Bioinformatics, Page 14
Genome reconstruction Reference guided mapping Millions of short reads ( ⇠ 30 up to 200 bp) are generated ! challenge to reconstruct the original sequence using as- sembly methods New approach: Align short reads against a known genome of the same species ( reference genome ) ! also re- ferred as mapping (tools: SHORE (Ossowski et al. , 2008), SSAHA2 (Ning et al. , 2001)) Reference Genome paired-end short reads Dominik Grimm: Data Mining in Bioinformatics, Page 15
Genome reconstruction Reference guided mapping with SHORE (Ossowski et al. , 2008) SHORE uses the best-match strategy to map reads ! Best matches are mapped at first and then the number of mismatches and gaps are increased iteratively Reads with 0 mismatches and gaps are mapped at first fol- lowed by alignments with Levenshtein Edit Distance (LED) = 1 (Levenshtein, 1965) and Hamming Distance (HD) = 1 (Hamming, 1950), LED = 2 and HD = 2, LED = 3 and HD = 3, LED = 4 and HD = 4. Dominik Grimm: Data Mining in Bioinformatics, Page 16
Genome reconstruction Hamming Distance (HD) (Hamming, 1950) The HD d HD measures the number of varying positions in two strings s 1 and s 2 of equal length m X d HD ( s 1 , s 2 ) = 1 , i = 1 , . . . , m (2) s 1 i 6 = s 2 i Example s 1 = ” ATCCATGC ” and s 2 = ” ATGGATAC ” s 1 : ” ATCCATGC ” s 2 : ” ATGGATAC ” ! d hm = 3 Dominik Grimm: Data Mining in Bioinformatics, Page 17
Genome reconstruction Levenshtein Edit Distance (LED) (Levenshtein, 1965) The LED d LED is the minimum number of edit operations to trans- form a string s 1 of length n into a string s 2 if length m . Allowed edit operations are deletion, insertion or substitution of a single charac- ter. Can be computed in O = ( nm ) using dynamic programming . 8 0 , if a i = b j < ω ( a i , b j ) = 1 , if a i 6 = b j , Substitution : 8 (3) d LED ( i � 1 , j � 1) + ω ( s 1 i , s 2 i ) > > > < d LED ( i, j ) = min d LED ( i, j � 1) + 1 Insertion > > d LED ( i � 1 , ) + 1 Deletion > : Example s 1 = ” sole ” and s 2 = solid ”” s 1 : ” sole ” s 2 : ” solid ” ! d LED = 2 , two operations: substitution ( e ! i ) , insertion ( d at the end ) Dominik Grimm: Data Mining in Bioinformatics, Page 18
Genome reconstruction Reconstruction is not trivial Reference Genome Left over reads: Both reads could Left over reads: Single read could not be mapped not be mapped Dominik Grimm: Data Mining in Bioinformatics, Page 19
Recommend
More recommend