SWMapper: Scalable Read Mapper on SunWay TaihuLight Kai Xu 1,2 , Xiaohui Duan 1,2 , Xiangxu Meng 1 , Xin Li 1 , Bertil Schmidt 3 , Weiguo Liu 1,2 1 Shandong University 2 National Supercomputing Center in Wuxi 3 Jonhannes Gutenberg University
Outline Introduction and Background Implementation Performance Evaluation Conclusion and Future Work
Introduction Related work l trie–based BWA, Bowtie, … l k-mer-based BWA-MEM, Bowtie2, GEM, FastHASH, mrsFAST, RazerS3, FEM, S-Aligner, BitMapper2, Hobbes3, … l read mappers on compute clusters BWA, pMAP, SEAL, BigBWA , SparkBWA , parSRA, CUSHAW3-UPC++, mer-Aligner, S-Aligner, …
Introduction 1. Index k-mer into a hash table or orther similar data structrure seed-and-extend strategy Reference ACGTACGTAGCATGCATCGATCGTACGCATCGAT ACGT -> 0xE4 CGTA -> 0x39 GTAC -> 0x4E 2. Extract and hash k-mer (“seed”) Read ACGTCCGTAGCATGCT ACGT -> 0xE4 CCGT -> 0xE5 AGCA -> 0x18 3. Probe hash table to find locations 0xE4: 1, 234, 456, 1246, 17983 0xE5: 89, 284, 956, 2246, 27983 0x18: 9, 423, 929, 3645, 47228 4. Compute alignment (“extend”) at the locaions Reference ACGTACGTAGCATGCATCGATCGTACGCATCGAT ||| | |||||||||| Read ACGTCCGTAGCATGCT
Outline Introduction and Background Implementation Performance Evaluation Conclusion and Future Work
Implementation Memory DMA SW26010 Architecture LDM Compute Memory Memory DMA Memory CPE CPE MC MC Cluster Cluster MPE (8*8) (8*8) MPE l Limited LDM, just 64KB Network on chip(NoC) l One MPI process can attach to one core group (CG) MPE CPE CPE MPE l Latency from CPE to MPE using DMA Cluster Cluster MC MC (8*8) (8*8) transferring pattern. l Memory size of one CG: 8 GB Memory Memory l Memory bandwidth of one CG: up to 34 GB/s
Implementation MPE Workflow Malloc two buffers for reads Load one batch of reads Call athread_spawn and Build hash index and results respectively to read-buffer rds_id start CPE alignment Init index rds_id and res_id Update rds_id Load one batch of reads Write result-buffer Call athread_spawn and to read-buffer rds_id res_id to disk Call athread_join start CPE alignment Update rds_id Update res_id If there are still reads Write result-buffer Call athread_join res_id to disk Update res_id
Implementation succinct hash index position 51 42 45 48 reference … C A C A T C G T A G C A T … C A C A T Seed length l • A C A T C Reference length r • C A T C G Store one seed for every s seeds • Seeds number = ( r – l + 1)/ s • A T C G T T C G T A Match C G T A G long seed long seed • Long seed length = l + s - 1 read G T G T T A C G G G C G T T • Divide read into long seeds G G T C G T A T C G • Divide each long seed into s seeds • Use seeds to find candidate positions G T C G T A T C G T T C G T G T C G T G
Implementation Build hash index File system reference A C A C A T C G T A G C A T G A hash value A A C A C A T C G T A G C A T G offset 0x00 0x01 0x02 0x03 0x00 0x01 0x02 0x03 locations 0 2 3 5 0 1 3 4 0 6 3 9 24 12 36 48 57 63 69 78 0x00 0x01 0x02 0x03 Process 0 Process 1 • Process 0 and 1 build 0 3 6 9 their own hash index Two processes merge • 0 6 36 3 48 57 9 24 63 12 69 78 them into a single hash index 0x00 0x01 0x02 0x03 File system hash index 0 3 6 9 0 6 36 3 48 57 9 24 63 12 69 78
Implementation CPE Workflow Get parameters from local_id = faaw(&global_id, 1) MPE Get read from memory Divide reads into seeds Remove duplicate Encode read to bits- Choose e + 1 long seeds Get the candidate locations locations vector to compute If local_id < read_numer Generate results and local_id = Call banded Myers to Use long seeds to filter put the results to faaw(&global_id, 1) verify locations candidate locations memory
Implementation seed index location Removing Duplicate Locations s1 s2 s3 s4 s1 2 17 12 17 16 31 18 34 20 36 s2 2 31 s3 1 34 Locations buffer Load one block for each seed • Minimum heap Use one location of each seed to create a • s4 1 36 minimum heap Pop the smallest location • Push the next location from the same buffer, • discard the location which already exists in heap Load next block when a block has no more • locations
Implementation Seed Filtration 720 reference … C A C G T C G T A G C A T T … G G T C G T G T A T C G read long seed Matching short seed at position • 720 (green) but not matching long … 128 400 … hash index read. Thus, the position is discarded. … 720 890 … • We store the bases before and after locations short seeds along with locations in A C A G C C A G the hash index.
Implementation Vectorization of banded Myers Algorithm sub-ref 1 sub-ref 2 sub-ref 3 sub-ref 4 0x48 0xa5 0xfd 0x78 0x48 0xef 0x4f 0x5f 0xd3 sub-ref 1 0xef 0xb4 0x4e 0xe8 0x4f 0x4c 0x1f 0x61 sub-ref 2 0xa5 0xb4 0x4c 0x8d 0x5a sub-ref 3 0x5f 0x8d 0x2c 0x63 0xfe 0x4e 0x1f 0x2c 0x9d sub-ref 4 0x78 0xe8 0x61 0x63 0xf4 0xd3 0x5a 0x9d 0xf4 Transposition of the data layout. • Each row represents contiguous memory. • Data of the same color corresponds to bits-vector of sub-reference at different candidate locations. •
Algorithm SIMD pseudocode Implementation Require : bit-vectors of sub-reference ref_hi, ref_lo, bit-vectors of read read_hi, read_lo Ensure : edit distance err 1: //get match of read and reference Vectorization of banded Myers Algorithm 2: t1 = simd_vxorw(read_hi, ref_hi); 3: t2 = simd_vxorw(read_lo, ref_lo); 4: matchv = simd_vorw(t1, t2) Register a 1 0 1 0 1 0 1 1 1 0 0 0 5: // x=match|vn ; 6: xv = simd_vorw(matchv, vnv); Register b 0 0 1 1 1 1 0 0 1 1 0 1 7: //d0=(vp+(x&vp)); 8: d0v = simd_vandw(xv, vpv); Register c 1 1 0 0 1 0 0 1 0 0 1 1 9: d0v = simd_vaddw(d0v, vpv); 10: //d0=(d0 ∧ vp)|x; 11: d0v = simd_log3x(d0v,vpv, xv, table[0]); Register d 12: //hn=vp&d0; 1 1 1 0 1 0 1 0 13: hnv = simd_vandw(vpv, d0v); 14: //hp=vn|(vp|d0); Register e 1 1 1 0 1 0 0 1 1 0 1 1 15: hpv = simd_lox3x(vnv, vpv, d0v, table[1]); 16: //x=d0>>1; 17: xv = simd_vsrlw(d0v, 1); Using the instruction log3x to compute e = (a & b)|c . • 18: //vn=x&hp; The truth table d of size 8 = 2^3 is calculated, e.g. the • 19: vnv = simd_vandw(xv, hpv); 20: //vp=hn| ∼ (x|hp); value of the first bit in d is computed by ((0 & 1) | 1). 21: vpv = simd_log3x(hnv, xv, hpv, table[2]); The bits of a, b, c are combined to get the result from • 22: //tmp_res =(d0&1) ∧ 1; the corresponding location in d to get the final result 23: tmp_res = simd_log3x(d0v, 1, 1, table[3]); stored in e . 24: //err=err+tmp_res 25: errv = simd_add(errv, tmp_resv);
Outline Introduction and Background Implementation Performance Evaluation Conclusion and Future Work
Performance evaluation datasets Table 1: Datasets used for performance evaluation. Datasets NCBI Acc NO. Read length(bps) Number of reads D1 N/A 100 100000 R1 ERR013135 108 20M Intel machine configure: CPU: Intel Xeon W-2123v3 CPU (4 cores in total operating at 3.6 GHz) Memory: 16GB OS: Ubuntu 16.04 with Linux kernel 4.4.0-28-generic.
Performance evaluation Accuracy Table 1: Results using different accuracy measures based on the Rabema benchmark Tools Mapped Accuracy Reads All[%] All-best[%] Any-best[%] RazerS3 99993 100.00 100.00 100.00 BitMapper2 99993 100.00 100.00 100.00 Hobbes3 99993 100.00 100.00 100.00 S-Aligner 99993 99.98 99.99 100.00 SWMapper 99993 100.00 100.00 100.00
Performance evaluation Hash Index Construction Time Table 3: Hash index construction times (in seconds) for the first chromosome of GRCh38 Tools BitMapper2 Hobbes3.0 S-Aligner SWMapper Time(s) 54 47 238 37 Table 4: Strong scaling test for index construction of SWMapper using a full human reference genome (GRCh38) with different numbers of MPI processes on Sunway TaihuLight. Processes 1 2 4 8 16 Time(s) 1632 857 443 257 63 Efficiency 100% 95% 92% 79% 63%
Performance evaluation Runtimes (in seconds) on a single CG after Runtime comparison (in seconds) for mapping all reads of Dataset R1 to the first chromosome of incrementally applying different optimization steps. GRCh38. Tools are executed on a Xeon CPU (green) or on a single SW26010 (red).
Performance evaluation Strong scaling Table 5: Strong scaling test for mapping all reads of Dataset R1 to a whole human genome reference (GRCh38) using different numbers of MPI processes on Sunway TaihuLight. Processes 1 2 4 8 16 32 64 128 Time(s) 6635 3368 1705 867 437 230 124 70 Efficiency 100% 99% 97% 96% 95% 90% 83% 74%
Outline Introduction and Background Implementation Performance Evaluation Conclusion and Future Work
Recommend
More recommend