Massively parallel read mapping on graphics cards Johannes K¨ oster May 15, 2014 1 / 23 Genome Informatics
Outline 1 Next-Generation-Sequencing of DNA 2 Read Mapping 3 Algorithm 4 Results 2 / 23 Genome Informatics
Outline 1 Next-Generation-Sequencing of DNA 2 Read Mapping 3 Algorithm 4 Results 3 / 23 Genome Informatics
Next-Generation-Sequencing 1 Chop DNA/RNA into small fragments. 2 Ligate adapters to both ends. 3 Spread fragment solution across a flowcell with beads. 4 Amplify fragments into clusters (PCR). 5 Sequence fragments by adding fluorescent complementary bases ◮ reads. Illumina, 2013 4 / 23 Genome Informatics
Outline 1 Next-Generation-Sequencing of DNA 2 Read Mapping 3 Algorithm 4 Results 5 / 23 Genome Informatics
Read Mapping For each read... find position in the known reference genome. ? ? ? • A DNA sequence is a word over Σ = { A , C , G , T } . • string matching, but with error tolerance 6 / 23 Genome Informatics
Read Mapping For each read... find position(s) with optimal alignment(s) to either strand of the reference: ACTGTGGACTATCAATGGAC GGTACTGT CTATCTATGGACCGTTAG ◮ Smith Waterman Algorithm Too slow, therefore heuristics to find anchor positions: • suffixarray/Burrows-Wheeler-Transformation (BWA, bowtie2) • q-gram indices (RazerS3) 7 / 23 Genome Informatics
Read mapping on GPUs Challenges: • limited and slow memory � q-gram index • branching interrupts parallelism � BWT Idea: • Use a special q-gram index with small memory footprint. • Use parallelism to hide memory latency. • Export branching into bitvector operations. ◮ PEANUT – the ParallEl AligNment UTility 8 / 23 Genome Informatics
Outline 1 Next-Generation-Sequencing of DNA 2 Read Mapping 3 Algorithm 4 Results 9 / 23 Genome Informatics
Algorithm Main steps: • Filtration find potential hits between reads and reference using a special q-gram index • Validation validate hits using a bit-parallel alignment algorithm 10 / 23 Genome Informatics
Algorithm Main steps: • Filtration find potential hits between reads and reference using a special q-gram index • Validation validate hits using a bit-parallel alignment algorithm 10 / 23 Genome Informatics
Q-Gram Index For a given DNA sequence T : • consider q-grams (substrings of length q ) GGTACTGACGTTCTATGGACCGTTAG • encode them as integers ACGT = 11 10 01 00 = 228 • array P with concatenation of q-gram positions • array Q with address in P for each q-gram ◮ size 4 q + | T | P [ Q [228]] . . . P [ Q [229]] 11 / 23 Genome Informatics
Q-Group Index • assign each q-gram to a q-group 228 ⌊ g / w ⌋ 228 / 32 • store occurence of q-gram in a 0 1 2 3 4 5 6 7... bit-vector 228 % 32 0 1 • two address arrays guide from 0 0 q-group to positions of the found 0 q-gram in the text 1 ◮ size 2 / w · 4 q +min { 4 q , | T |} + | T | 12 / 23 Genome Informatics
Q-Group Index GAAA I 0101 0000 0010 S 0 2 2 3 S' 0 1 5 8 O 15 22 11 17 308 3 52 31 less memory, because we consider only. . . • q-groups at the top level • occuring q-grams at the bottom calculate adress ranges in parallel by • population counts • prefix-sums 13 / 23 Genome Informatics
Algorithm Main steps: • Filtration find potential hits between reads and reference using a special q-gram index • Validation validate hits using a bit-parallel alignment algorithm 14 / 23 Genome Informatics
Validation T G T C T A 0 T 0 G 0 T +1 +1 0 A Observations: • calculating column j needs only column j − 1 • each transition changes edit distance by at most 1 0 1 Myers bit-parallel algorithm 1 : 1 0 0 1 + ^ 1 1 • process graph column-wise 0 << 0 & 0 0 • maintain distance deltas in bitvectors 1 1 0 0 15 / 23 1 Myers, 1999. J. ACM 46. Genome Informatics
Workflow start s e c n e u q e n s o i a t d f i d l t i r a l a a e t v i o r n n d o • load reads into buffer a i t v o a p a l r o l t i s l d i f t a p t • build q-group index of reads r i n o o o stop c n e i t s a s t s f • filtration of hits d i h i l i i n t l r a e g a v t t i r i o w • validation of hits n n o t a v r a t • postprocessing l l i i f d a t i n o o n • writing i t a f d i l t i l a r a v t i o n n o i v t a r a t l d i l i f t a n o i • IO • GPU • CPU 16 / 23 Genome Informatics
Outline 1 Next-Generation-Sequencing of DNA 2 Read Mapping 3 Algorithm 4 Results 17 / 23 Genome Informatics
Results 1.0 0.9 0.8 0.7 occupancy 0.6 0.5 0.4 filter_reference create_queries_index 0.3 validate_hits 0.2 0 100 200 300 400 500 600 block size 18 / 23 Genome Informatics
Sensitivity • assessed with Rabema 2 benchmark on S. cerevisiae genome • 100% for reads with error rate less than 7% • 99.77% for error rates up to 10% • 98.97% for error rates up to 20% 19 / 23 2 Holtgrewe et al. 2011. BMC Bioinformatics Genome Informatics
Performance Output types: all all alignments of a read best one of the best alignments best-stratum all best alignments 5 million simulated human reads: mapper type time [min:sec] sens. [%] PEANUT best-stratum 1:55 98.62 BWA-MEM best 3:16 96.99 Bowtie 2 best 5:21 96.85 PEANUT all 18:29 98.74 RazerS 3 all 199:55 98.83 Intel Core i7, 16GB RAM NVIDIA Geforce 780, 3GB RAM 20 / 23 Genome Informatics
Performance 5 million real human exome reads: mapper type time [min:sec] PEANUT best-stratum 1:33 BWA-MEM best 1:58 Bowtie 2 best 3:12 PEANUT all 10:52 RazerS 3 all 89:38 Intel Core i7, 16GB RAM NVIDIA Geforce 780, 3GB RAM 21 / 23 Genome Informatics
Performance 10 million human exome paired end reads: mapper type time [min:sec] PEANUT best-stratum 3:08 BWA-MEM best 4:44 Bowtie 2 best 8:18 PEANUT all 21:54 RazerS 3 all 150:59 Intel Core i7, 16GB RAM NVIDIA Geforce 780, 3GB RAM 22 / 23 Genome Informatics
Summary PEANUT is a GPU based read mapper that outperforms other state-of-the-art mappers in terms of • sensitivity • speed by introducing the q-group index with small memory footprint and exploiting • bit-vector operations • prefix sums • population counts http://peanut.readthedocs.org 23 / 23 Genome Informatics
Recommend
More recommend