Lecture 16: Mapping Reads to a Reference Fall 2019 November 12,14, 2019 1
Next-Gen Sequencing Read Mapping Align reads to genome/transcriptome Input: Many reads, length l strings R 1 ,...,R m and approximate reference genome string G Output: positions in G where reads match Complications: ◦ Millions of reads to analyze ◦ Repetitive regions: non-unique mapping ◦ Mutations/errors: imperfect match, incorrect mapping ◦ Paired end reads ◦ Numbers: ~100M reads, |G| ~ 3x10 9 , read length: 100-200bp 2
Solution – Suffix Tree Build suffix tree for G For each read R i ◦ Find matches of R i to G by tree traversal Time complexity? 3
Solution – Suffix Tree Build suffix tree for G For each read R i ◦ Find matches of R i to G by tree traversal Time complexity: O( l m+|G|) for exact matching Can store human genome text in 750M bytes (6G bits) but, need ~64G bytes for the tree Large constants, hard to implement 4
Solution – Hashed Index Preprocessing: – Create hash-table H – For each position p=l,...,|G| Hash (key=G[p,...,p + l -1], value=p) in H For each R i report H(S i ) Time complexity? 5
Solution – Hashed Index Preprocessing: – Create hash-table H – For each position p=l,...,|G| Hash (key=G[p,...,p + l -1], value=p) in H For each R i report H(S i ) Time complexity: O( l( m+|G|)) Problems: ◦ Only exact matching, memory O(|G|) ◦ Implemented by PASS Campagna D, Albiero A, Bilardi A, Caniato E, Forcato C, Manavski S, Vitulo N, Valle G. PASS: a program to align short sequences. Bioinformatics. 2009 Apr 1;25(7):967-8 6
� Improvements � � Pack strings into bit vectors � TGAGTGTGAC 11100010111011100001 Do not hash every position 7
Solution – Hashed Reads Hash the reads, and scan using the genome. Implemented by the MAQ tool Memory usage: depends on number of reads Mapping short DNA sequencing reads and calling variants using mapping quality scores. Li H, Ruan J, Durbin R. Genome Res. 2008 Nov;18(11):1851-8. 8
Handling mismatches les. If a read has one mismatch with respect to its correct position, then hash T 1 partitioning that read into two T 2 makes sure one part is still exactly � matched. � For two mismatches: create six T 3 T 4 templates, some of which are T 5 noncontiguous, so that each and T 6 template covers half the read and has a partner template that complements it to form the full read. Related to the idea of spaced seeds 9
Solution - Burrows-Wheeler Transform (BWT) BWT $acaacg aacg$ac acaacg$ gc$aaac acaacg$ acg$aca caacg$a cg$acaa g$acaac Burrows-Wheeler Matrix (BWM)
Solution - Burrows-Wheeler Transform (BWT) BWT $acaacg aacg$ac acaacg$ gc$aaac acaacg$ acg$aca caacg$a cg$acaa g$acaac BWT(G) is often highly compressible
Burrows-Wheeler Matrix $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac
Burrows-Wheeler Matrix $acaacg aacg$ac 3 1 acaacg$ 4 acg$aca See the suffix array? 2 caacg$a 5 cg$acaa g$acaac 6
Bowtie/BWA Bowtie/BWA: the first methods to use BWT for read mapping ◦ Optimized to run against mammalian sized genomes ◦ Genome is indexed using the Burrows-Wheeler algorithm to keep the memory footprint small The human genome can be stored in ~2Gb BWA: Bowtie: http://bio-bwa.sourceforge.net http://bowtie-bio.sourceforge.net/ (Bowtie) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Langmead et al, Genome Biology 2009, 10:R25 (BWA) Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform, Li Heng and Richard Durbin, (2009) 25:1754–1760
Lots of Read Mappers! DNA: blue, RNA: in red, miRNA: green Bisulphite: purple. See: http://wwwdev.ebi.ac.uk/fg/hts_mappers/ and the corresponding paper Tools for mapping high-throughput sequencing data Nuno A. Fonseca, Johan Rung, Alvis Brazma and John C. Marioni 15
Phred Quality Scores Q s = -10 log 10 Pr(base is incorrect) 16
Using Phred Scores Mappers typically use Phred scores in one of two ways: ◦ Incorporate them into the score of a match ◦ Trim the 3’ end of a read when its quality is deemed too low (takes advantage of the fact that for the Illumina platform, sequencing accuracy decreases from 5’ to 3’) 17
Read mapping software General mappers (BLAT, SSAHA, Exonerate) were designed for aligning any sequences and the source of data is irrelevant. Some mappers are designed for data coming from a specific platform (e.g. Illumina, SOLiD). Some mappers exploit specific biases associated with a sequencing platform. Some mappers use Phred quality scores for scoring a match. 18
Ungapped vs. Spliced Alignment Unspliced reads (Black) ◦ Map within exactly one exon Spliced reads (Blue) ◦ Overlap two or more exons ◦ A more challenging problem than ungapped read mapping
Splice Sites Acceptor splice site (3’) Donor splice site (5’) Canonical Splice Site Consensus: GT---AG GT Occurs about 1 billion times in the 3 Billion nucleotide long human DNA Donor consensus ◦ A 64 G 73 G 100 T 100 A 62 A 68 G 84 T 63 Acceptor consensus ◦ C 65 A 100 G 100 20
PALMapper Strategy: ◦ Identify unspliced alignments and seed regions for spliced reads using a fast indexing tool. ◦ Infer spliced alignments from seed regions. De Bona, F. et al., Optimal Spliced Alignments of Short Sequence Reads, ECCB08/Bioinformatics, 24 (16):i174, 2008. Images from http://www.raetschlab.org/lectures/qpalma-ismb08-short-SIG.pdf
QPALMA:Optimal Spliced Alignments of Reads Based on PALMA ◦ Uses Computational Splice Site Prediction Q = Quality ◦ Uses a read’s quality information Information Used in QPALMA ◦ Read quality ◦ Splice site models ◦ Intron length m odels
Splice site prediction Splice site classifiers (support vector machine) ◦ Predictions of d onor sites ◦ Predictions of acceptor sites Training ◦ Known Acceptor & Donor Sites Obtained by aligning EST and cDNA sequences using BLAT (or any program for spliced alignments for long sequences) High quality alignments can be used as positive examples Other decoy sites used as negative examples Window of length 141 nt around the splice site
PALMapper uses Smith Waterman PALMapper is an “Intelligent” variant of SW that changes the scoring matrix in certain ways to incorporate ◦ Sequence Quality Scores ◦ Donor & Acceptor Splice Site Prediction Scores ◦ Intron Length Model To perform spliced alignments of short read sequences against a genome 24
Smith Waterman Source of Information Sequence matches Classical scoring f : Σ × Σ → R 25
Extensions of SW Penalizes the scoring if the classifier does not consider that there is a plausible intron. Source of Information Sequence matches Computational splice site predictions 26
Extensions of SW Penalizes the scoring if the intron length is not “expected” Source of Information Sequence matches Computational splice site predictions Intron length model 27
Incorporating Quality Scores Source of Information Sequence matches Computational splice site predictions Intron length model Read quality information Quality scoring f : ( Σ × R ) × Σ → R 28
Incorporating Quality Scores Q E (i): the quality score of the read at position i
The Extended SW g o = intron opening score g e = intron extension score f acc (i) = score of an acceptor site f don (i) = score of a donor site Intron scoring
Training 31
Performance of Various Extensions (% of correct alignments)
Error depends on where the read falls within an intron 33
Summary As read length becomes larger more reads require spliced alignment There are many tools for this task: v TopHat (uses Bowtie for ungapped mapping) v MapSplice (uses Bowtie) v RUM (uses Bowtie and BLAT) v PASS v ARYANA 34
A comparison of spliced alignment tools Figure from: Systematic evaluation of spliced alignment programs for RNA-seq data Nature Methods (2013) http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.2722.html 35
Recommend
More recommend