masher mapping long er reads with hash based genome
play

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on - PowerPoint PPT Presentation

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh 1,2 , Erik Saule 1 , Kamer Kaya 1 and mit V. atalyrek 1,2 1 Department of Biomedical Informatics 2 Department of Electrical and Computer Engineering The


  1. Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh 1,2 , Erik Saule 1 , Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 1 Department of Biomedical Informatics 2 Department of Electrical and Computer Engineering The Ohio State University

  2. Outline I. Introduction • Motivation • Contribution • Related Work II. Masher Workflow • Index Construction • Mapping III. Experiments and Results IV. Conclusion and Future Work ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 2 23 Sep 2013

  3. Motivation  The read length of next generation sequencing (NGS) devices is continuously increasing so there is a wide interest in efficient and accurate mapping of long(er) reads.  Utilizing the powerful capabilities of GPUs to improve the mapping of NGS reads. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 3 23 Sep 2013

  4. Related Work and Contributions Contribution  A novel hash-based indexing technique by which:  For large genomes, the memory footprint small enough to be stored in a restricted-memory device such as a GPU.  The index data structure is more suitable for GPU parallelization Related Work  Burrows-Wheeler Transform (BWT) o Bowtie2 o CUSHAW2 o Soap3-dp  Hash Indexing o SeqAlto o BFAST ) ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 4 23 Sep 2013

  5. Masher workflow ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 5 23 Sep 2013

  6. Index Construction Processing genome file  Base pairs to 2 bit format.  Replacing each N with A . ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 6 23 Sep 2013

  7. Index Construction Processing genome file  Base pairs to 2 bit format.  Replacing each N with A . Indexing  Seed length L S  Indexing step size ∆ G ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 7 23 Sep 2013

  8. Index Construction Index arrays - Locations array  Genome length, N  Stores the indexed locations in order for each seed  Location array size = log 2 (N) x 𝑂/ ∆ G  Size ≈ 2.9 GB , hg19, ∆ G = 4 ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 8 23 Sep 2013

  9. Index Construction Index arrays - Count array  Stores the number of occurrences for each seed  Size = 4 Ls x log 2 𝑂/ ∆ G  Store at most 255 locations.  Appear more than 255, do uniform selection.  Size = 1 GB , L S = 15. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 9 23 Sep 2013

  10. Index Construction Index arrays - Ptrs array  Stores the starting index at locs array for a group of seeds  Seed group size, δ .  Group id = seed / δ  Size = 4 L / δ x log 2 ( 𝑂/ ∆ G  Size = 0.5 GB , δ = 8, ∆ G = 4. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 10 23 Sep 2013

  11. Index Construction Index arrays  L S = 15, ∆ G = 4, δ = 8 , hg19  Total indexing arrays size = 2.9 + 1 + 0.5 = 4.4 GB.  Space – time tradeoff ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 11 23 Sep 2013

  12. Index Construction Accessing the Index  Count array Assume seed = i + 4 • Belongs to seed group ( i , i + δ −1 ) • , δ = 8 , i mod δ = 0. • Seed index in group, k = ( i +4) mod δ • C k=4 = count[ i + 4 ] •  Ptrs array j = seed / δ , • • Locs group index (Lgi) = ptrs[ j ] 𝑙−1 𝐷 𝑜 Locs seed index (Lsi) = Lgi + 𝑜=0 •  Locs array • Extract locations from (Lsi , Lsi + C k - 1 ) ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 12 23 Sep 2013

  13. Index Construction 1 0.9 Pr(count <= x) 0.8 0.7 0.6 0.5 1 6 11 16 21 26 31 36 41 46 51 56 Seeds count ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 13 23 Sep 2013

  14. Mapping Seed & hash  Read step size, ∆ R  Read length, L R  N seeds = ∆ G x ( L R − L S )/ ∆ R Locate candidate alignment locations (CALs)  Each thread is assigned to a specific seed. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 14 23 Sep 2013

  15. Mapping Merge CALs and weights  In merging CALs, if two CALs are within a threshold distance, the second weight will be added to the first weight.  For efficiency purpose, Masher consists of two main loops. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 15 23 Sep 2013

  16. Mapping Sorting and Batching CALs  Sorting and setting the CALs in batches with respect to their weights.  At this stage, a filter operation for CALs with low weight could be applied. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 16 23 Sep 2013

  17. Mapping Sorting and Batching CALs  Sorting and setting the CALs in batches with respect to their weights.  At this stage, a filter operation for CALs with low weight could be applied. Bounded local Alignment  A parameterized variant of Smith-Waterman (SW) algorithm supporting affinity gap scoring.  Bounded alignment, only the matrix cells (i, j) where |i - j| <= w are visited and scored.  Masher does two passes and sets w to 4 and 16 respectively  GPU block performs multiple SWs in parallel. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 17 23 Sep 2013

  18. Experiments and Results Platform  Intel core i7-960 CPU clocked at 3.2 Ghz. 4 Hyper-Threading cores, 24GB of DDR3 memory.  Tesla K20c GPU, 4.8GB of global memory.  CUDA 5.0 and GCC 4.2.4. Human genome and Simulated Reads  Human genome hg19  Wgsim simulator, 100K reads of length 100, 300, 500, and 1000 with error rates 2%, 4%, 6%, and 8%. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 18 23 Sep 2013

  19. Experiments and Results Metrics for comparison  Sensitivity , is the percentage of the aligned reads.  Accuracy , is the percentage of the reads correctly aligned to simulator locations among all aligned reads.  Execution time : Only alignment time was measured.  The lower bound for a valid alignment score is set to score LB = L R x (1.9 - 0.5 x Error Rate) Two modes of Masher  Normal mode, ∆ R = 0.7 L R  Fast mode, ∆ R = L R Comparison with  Bowtie2 (sensitive and fast) , 8 threads  SOAP3-dp  CUSHAW2-GPU. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 19 23 Sep 2013

  20. Experiments and Results L R = 100 bps. Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp CUSHAW2-GPU 99.44 99.36 99.23 98.87 97.55 99.9 96.81 99.9 98.8 98.5 98.8 94.63 93.15 96.2 94.5 98 98 89.83 92.5 96 88.8 100 Sensitivity % 81.7 80.6 90 80 67.7 70 60 50 40 1 2 3 4 100 95.49 96.2 95.01 94.44 95.5 95.2 95.2 93.82 93.78 94.5 94.3 93.07 95 92.42 93.2 Accuracy % 94 92.6 91.49 91.9 91.7 90.84 95 93 91.1 89.47 90 85 80 2% 4% 6% 8% Error rate ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 20 23 Sep 2013

  21. Experiments and Results L R = 500 bps. Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp 99.89 99.89 99.84 99.78 99.74 99.51 99.34 99.62 98.93 99.9 99.9 99.9 99.8 99.9 99.9 99.2 97.7 94.3 100 Sensitivity % 90 75.3 80 70 60 48.6 50 40 1 2 3 4 97.69 97.78 98.2 98.1 98.8 97.19 98.5 98.3 97.8 96.83 96.83 97.8 97.6 97.4 97.2 96.25 96.15 100 98 98 97 Accuracy % 95 90 85 80 2% 4% 6% 8% Error rate ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 21 23 Sep 2013

  22. Experiments and Results L R = 1000 bps. Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp 99.99 99.73 99.53 98.93 99.8 99.9 99.9 99.9 99.9 99.8 99.8 99.5 99.3 100 100 98.7 100 100 91.4 100 Sensitivity % 90 68.9 80 70 60 50 40 1 2 3 4 98.5 98.25 98.5 98.5 98.9 98.28 97.78 98.3 98.1 98.5 97.86 97.24 97.41 97.8 96.66 97.5 97.3 96.43 100 96.1 96 Accuracy % 95 90 85 80 2% 4% 6% 8% Error rate ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 22 23 Sep 2013

  23. Experiments and Results L R = 100 bps. Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp CUSHAW2-GPU Execution time (sec.) in log scale 25 14.9 11.8 10.5 11 10 9.3 9.4 9.3 9.1 8.6 8.3 9 9 7.3 6.6 6.6 5.5 5.3 4.9 5 5 5 5 5 5 1 2% 4% 6% 8% Error rate ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 23 23 Sep 2013

Recommend


More recommend