Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh 1,2 , Erik Saule 1 , Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 1 Department of Biomedical Informatics 2 Department of Electrical and Computer Engineering The Ohio State University
Outline I. Introduction • Motivation • Contribution • Related Work II. Masher Workflow • Index Construction • Mapping III. Experiments and Results IV. Conclusion and Future Work ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 2 23 Sep 2013
Motivation The read length of next generation sequencing (NGS) devices is continuously increasing so there is a wide interest in efficient and accurate mapping of long(er) reads. Utilizing the powerful capabilities of GPUs to improve the mapping of NGS reads. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 3 23 Sep 2013
Related Work and Contributions Contribution A novel hash-based indexing technique by which: For large genomes, the memory footprint small enough to be stored in a restricted-memory device such as a GPU. The index data structure is more suitable for GPU parallelization Related Work Burrows-Wheeler Transform (BWT) o Bowtie2 o CUSHAW2 o Soap3-dp Hash Indexing o SeqAlto o BFAST ) ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 4 23 Sep 2013
Masher workflow ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 5 23 Sep 2013
Index Construction Processing genome file Base pairs to 2 bit format. Replacing each N with A . ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 6 23 Sep 2013
Index Construction Processing genome file Base pairs to 2 bit format. Replacing each N with A . Indexing Seed length L S Indexing step size ∆ G ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 7 23 Sep 2013
Index Construction Index arrays - Locations array Genome length, N Stores the indexed locations in order for each seed Location array size = log 2 (N) x 𝑂/ ∆ G Size ≈ 2.9 GB , hg19, ∆ G = 4 ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 8 23 Sep 2013
Index Construction Index arrays - Count array Stores the number of occurrences for each seed Size = 4 Ls x log 2 𝑂/ ∆ G Store at most 255 locations. Appear more than 255, do uniform selection. Size = 1 GB , L S = 15. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 9 23 Sep 2013
Index Construction Index arrays - Ptrs array Stores the starting index at locs array for a group of seeds Seed group size, δ . Group id = seed / δ Size = 4 L / δ x log 2 ( 𝑂/ ∆ G Size = 0.5 GB , δ = 8, ∆ G = 4. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 10 23 Sep 2013
Index Construction Index arrays L S = 15, ∆ G = 4, δ = 8 , hg19 Total indexing arrays size = 2.9 + 1 + 0.5 = 4.4 GB. Space – time tradeoff ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 11 23 Sep 2013
Index Construction Accessing the Index Count array Assume seed = i + 4 • Belongs to seed group ( i , i + δ −1 ) • , δ = 8 , i mod δ = 0. • Seed index in group, k = ( i +4) mod δ • C k=4 = count[ i + 4 ] • Ptrs array j = seed / δ , • • Locs group index (Lgi) = ptrs[ j ] 𝑙−1 𝐷 𝑜 Locs seed index (Lsi) = Lgi + 𝑜=0 • Locs array • Extract locations from (Lsi , Lsi + C k - 1 ) ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 12 23 Sep 2013
Index Construction 1 0.9 Pr(count <= x) 0.8 0.7 0.6 0.5 1 6 11 16 21 26 31 36 41 46 51 56 Seeds count ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 13 23 Sep 2013
Mapping Seed & hash Read step size, ∆ R Read length, L R N seeds = ∆ G x ( L R − L S )/ ∆ R Locate candidate alignment locations (CALs) Each thread is assigned to a specific seed. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 14 23 Sep 2013
Mapping Merge CALs and weights In merging CALs, if two CALs are within a threshold distance, the second weight will be added to the first weight. For efficiency purpose, Masher consists of two main loops. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 15 23 Sep 2013
Mapping Sorting and Batching CALs Sorting and setting the CALs in batches with respect to their weights. At this stage, a filter operation for CALs with low weight could be applied. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 16 23 Sep 2013
Mapping Sorting and Batching CALs Sorting and setting the CALs in batches with respect to their weights. At this stage, a filter operation for CALs with low weight could be applied. Bounded local Alignment A parameterized variant of Smith-Waterman (SW) algorithm supporting affinity gap scoring. Bounded alignment, only the matrix cells (i, j) where |i - j| <= w are visited and scored. Masher does two passes and sets w to 4 and 16 respectively GPU block performs multiple SWs in parallel. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 17 23 Sep 2013
Experiments and Results Platform Intel core i7-960 CPU clocked at 3.2 Ghz. 4 Hyper-Threading cores, 24GB of DDR3 memory. Tesla K20c GPU, 4.8GB of global memory. CUDA 5.0 and GCC 4.2.4. Human genome and Simulated Reads Human genome hg19 Wgsim simulator, 100K reads of length 100, 300, 500, and 1000 with error rates 2%, 4%, 6%, and 8%. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 18 23 Sep 2013
Experiments and Results Metrics for comparison Sensitivity , is the percentage of the aligned reads. Accuracy , is the percentage of the reads correctly aligned to simulator locations among all aligned reads. Execution time : Only alignment time was measured. The lower bound for a valid alignment score is set to score LB = L R x (1.9 - 0.5 x Error Rate) Two modes of Masher Normal mode, ∆ R = 0.7 L R Fast mode, ∆ R = L R Comparison with Bowtie2 (sensitive and fast) , 8 threads SOAP3-dp CUSHAW2-GPU. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 19 23 Sep 2013
Experiments and Results L R = 100 bps. Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp CUSHAW2-GPU 99.44 99.36 99.23 98.87 97.55 99.9 96.81 99.9 98.8 98.5 98.8 94.63 93.15 96.2 94.5 98 98 89.83 92.5 96 88.8 100 Sensitivity % 81.7 80.6 90 80 67.7 70 60 50 40 1 2 3 4 100 95.49 96.2 95.01 94.44 95.5 95.2 95.2 93.82 93.78 94.5 94.3 93.07 95 92.42 93.2 Accuracy % 94 92.6 91.49 91.9 91.7 90.84 95 93 91.1 89.47 90 85 80 2% 4% 6% 8% Error rate ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 20 23 Sep 2013
Experiments and Results L R = 500 bps. Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp 99.89 99.89 99.84 99.78 99.74 99.51 99.34 99.62 98.93 99.9 99.9 99.9 99.8 99.9 99.9 99.2 97.7 94.3 100 Sensitivity % 90 75.3 80 70 60 48.6 50 40 1 2 3 4 97.69 97.78 98.2 98.1 98.8 97.19 98.5 98.3 97.8 96.83 96.83 97.8 97.6 97.4 97.2 96.25 96.15 100 98 98 97 Accuracy % 95 90 85 80 2% 4% 6% 8% Error rate ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 21 23 Sep 2013
Experiments and Results L R = 1000 bps. Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp 99.99 99.73 99.53 98.93 99.8 99.9 99.9 99.9 99.9 99.8 99.8 99.5 99.3 100 100 98.7 100 100 91.4 100 Sensitivity % 90 68.9 80 70 60 50 40 1 2 3 4 98.5 98.25 98.5 98.5 98.9 98.28 97.78 98.3 98.1 98.5 97.86 97.24 97.41 97.8 96.66 97.5 97.3 96.43 100 96.1 96 Accuracy % 95 90 85 80 2% 4% 6% 8% Error rate ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 22 23 Sep 2013
Experiments and Results L R = 100 bps. Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp CUSHAW2-GPU Execution time (sec.) in log scale 25 14.9 11.8 10.5 11 10 9.3 9.4 9.3 9.1 8.6 8.3 9 9 7.3 6.6 6.6 5.5 5.3 4.9 5 5 5 5 5 5 1 2% 4% 6% 8% Error rate ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 23 23 Sep 2013
Recommend
More recommend