SWMapper: Scalable Read Mapper on SunWay TaihuLight Kai Xu 1,2 , - PowerPoint PPT Presentation

SWMapper: Scalable Read Mapper on SunWay TaihuLight Kai Xu 1,2 , Xiaohui Duan 1,2 , Xiangxu Meng 1 , Xin Li 1 , Bertil Schmidt 3 , Weiguo Liu 1,2 1 Shandong University 2 National Supercomputing Center in Wuxi 3 Jonhannes Gutenberg University

Outline Introduction and Background Implementation Performance Evaluation Conclusion and Future Work

Introduction Related work l trie–based BWA, Bowtie, … l k-mer-based BWA-MEM, Bowtie2, GEM, FastHASH, mrsFAST, RazerS3, FEM, S-Aligner, BitMapper2, Hobbes3, … l read mappers on compute clusters BWA, pMAP, SEAL, BigBWA , SparkBWA , parSRA, CUSHAW3-UPC++, mer-Aligner, S-Aligner, …

Introduction 1. Index k-mer into a hash table or orther similar data structrure seed-and-extend strategy Reference ACGTACGTAGCATGCATCGATCGTACGCATCGAT ACGT -> 0xE4 CGTA -> 0x39 GTAC -> 0x4E 2. Extract and hash k-mer (“seed”) Read ACGTCCGTAGCATGCT ACGT -> 0xE4 CCGT -> 0xE5 AGCA -> 0x18 3. Probe hash table to find locations 0xE4: 1, 234, 456, 1246, 17983 0xE5: 89, 284, 956, 2246, 27983 0x18: 9, 423, 929, 3645, 47228 4. Compute alignment (“extend”) at the locaions Reference ACGTACGTAGCATGCATCGATCGTACGCATCGAT ||| | |||||||||| Read ACGTCCGTAGCATGCT

Implementation Memory DMA SW26010 Architecture LDM Compute Memory Memory DMA Memory CPE CPE MC MC Cluster Cluster MPE (8*8) (8*8) MPE l Limited LDM, just 64KB Network on chip(NoC) l One MPI process can attach to one core group (CG) MPE CPE CPE MPE l Latency from CPE to MPE using DMA Cluster Cluster MC MC (8*8) (8*8) transferring pattern. l Memory size of one CG: 8 GB Memory Memory l Memory bandwidth of one CG: up to 34 GB/s

Implementation MPE Workflow Malloc two buffers for reads Load one batch of reads Call athread_spawn and Build hash index and results respectively to read-buffer rds_id start CPE alignment Init index rds_id and res_id Update rds_id Load one batch of reads Write result-buffer Call athread_spawn and to read-buffer rds_id res_id to disk Call athread_join start CPE alignment Update rds_id Update res_id If there are still reads Write result-buffer Call athread_join res_id to disk Update res_id

Implementation succinct hash index position 51 42 45 48 reference … C A C A T C G T A G C A T … C A C A T Seed length l • A C A T C Reference length r • C A T C G Store one seed for every s seeds • Seeds number = ( r – l + 1)/ s • A T C G T T C G T A Match C G T A G long seed long seed • Long seed length = l + s - 1 read G T G T T A C G G G C G T T • Divide read into long seeds G G T C G T A T C G • Divide each long seed into s seeds • Use seeds to find candidate positions G T C G T A T C G T T C G T G T C G T G

Implementation Build hash index File system reference A C A C A T C G T A G C A T G A hash value A A C A C A T C G T A G C A T G offset 0x00 0x01 0x02 0x03 0x00 0x01 0x02 0x03 locations 0 2 3 5 0 1 3 4 0 6 3 9 24 12 36 48 57 63 69 78 0x00 0x01 0x02 0x03 Process 0 Process 1 • Process 0 and 1 build 0 3 6 9 their own hash index Two processes merge • 0 6 36 3 48 57 9 24 63 12 69 78 them into a single hash index 0x00 0x01 0x02 0x03 File system hash index 0 3 6 9 0 6 36 3 48 57 9 24 63 12 69 78

Implementation CPE Workflow Get parameters from local_id = faaw(&global_id, 1) MPE Get read from memory Divide reads into seeds Remove duplicate Encode read to bits- Choose e + 1 long seeds Get the candidate locations locations vector to compute If local_id < read_numer Generate results and local_id = Call banded Myers to Use long seeds to filter put the results to faaw(&global_id, 1) verify locations candidate locations memory

Implementation seed index location Removing Duplicate Locations s1 s2 s3 s4 s1 2 17 12 17 16 31 18 34 20 36 s2 2 31 s3 1 34 Locations buffer Load one block for each seed • Minimum heap Use one location of each seed to create a • s4 1 36 minimum heap Pop the smallest location • Push the next location from the same buffer, • discard the location which already exists in heap Load next block when a block has no more • locations

Implementation Seed Filtration 720 reference … C A C G T C G T A G C A T T … G G T C G T G T A T C G read long seed Matching short seed at position • 720 (green) but not matching long … 128 400 … hash index read. Thus, the position is discarded. … 720 890 … • We store the bases before and after locations short seeds along with locations in A C A G C C A G the hash index.

Implementation Vectorization of banded Myers Algorithm sub-ref 1 sub-ref 2 sub-ref 3 sub-ref 4 0x48 0xa5 0xfd 0x78 0x48 0xef 0x4f 0x5f 0xd3 sub-ref 1 0xef 0xb4 0x4e 0xe8 0x4f 0x4c 0x1f 0x61 sub-ref 2 0xa5 0xb4 0x4c 0x8d 0x5a sub-ref 3 0x5f 0x8d 0x2c 0x63 0xfe 0x4e 0x1f 0x2c 0x9d sub-ref 4 0x78 0xe8 0x61 0x63 0xf4 0xd3 0x5a 0x9d 0xf4 Transposition of the data layout. • Each row represents contiguous memory. • Data of the same color corresponds to bits-vector of sub-reference at different candidate locations. •

Algorithm SIMD pseudocode Implementation Require : bit-vectors of sub-reference ref_hi, ref_lo, bit-vectors of read read_hi, read_lo Ensure : edit distance err 1: //get match of read and reference Vectorization of banded Myers Algorithm 2: t1 = simd_vxorw(read_hi, ref_hi); 3: t2 = simd_vxorw(read_lo, ref_lo); 4: matchv = simd_vorw(t1, t2) Register a 1 0 1 0 1 0 1 1 1 0 0 0 5: // x=match|vn ; 6: xv = simd_vorw(matchv, vnv); Register b 0 0 1 1 1 1 0 0 1 1 0 1 7: //d0=(vp+(x&vp)); 8: d0v = simd_vandw(xv, vpv); Register c 1 1 0 0 1 0 0 1 0 0 1 1 9: d0v = simd_vaddw(d0v, vpv); 10: //d0=(d0 ∧ vp)|x; 11: d0v = simd_log3x(d0v,vpv, xv, table[0]); Register d 12: //hn=vp&d0; 1 1 1 0 1 0 1 0 13: hnv = simd_vandw(vpv, d0v); 14: //hp=vn|(vp|d0); Register e 1 1 1 0 1 0 0 1 1 0 1 1 15: hpv = simd_lox3x(vnv, vpv, d0v, table[1]); 16: //x=d0>>1; 17: xv = simd_vsrlw(d0v, 1); Using the instruction log3x to compute e = (a & b)|c . • 18: //vn=x&hp; The truth table d of size 8 = 2^3 is calculated, e.g. the • 19: vnv = simd_vandw(xv, hpv); 20: //vp=hn| ∼ (x|hp); value of the first bit in d is computed by ((0 & 1) | 1). 21: vpv = simd_log3x(hnv, xv, hpv, table[2]); The bits of a, b, c are combined to get the result from • 22: //tmp_res =(d0&1) ∧ 1; the corresponding location in d to get the final result 23: tmp_res = simd_log3x(d0v, 1, 1, table[3]); stored in e . 24: //err=err+tmp_res 25: errv = simd_add(errv, tmp_resv);

Performance evaluation datasets Table 1: Datasets used for performance evaluation. Datasets NCBI Acc NO. Read length(bps) Number of reads D1 N/A 100 100000 R1 ERR013135 108 20M Intel machine configure: CPU: Intel Xeon W-2123v3 CPU (4 cores in total operating at 3.6 GHz) Memory: 16GB OS: Ubuntu 16.04 with Linux kernel 4.4.0-28-generic.

Performance evaluation Accuracy Table 1: Results using different accuracy measures based on the Rabema benchmark Tools Mapped Accuracy Reads All[%] All-best[%] Any-best[%] RazerS3 99993 100.00 100.00 100.00 BitMapper2 99993 100.00 100.00 100.00 Hobbes3 99993 100.00 100.00 100.00 S-Aligner 99993 99.98 99.99 100.00 SWMapper 99993 100.00 100.00 100.00

Performance evaluation Hash Index Construction Time Table 3: Hash index construction times (in seconds) for the first chromosome of GRCh38 Tools BitMapper2 Hobbes3.0 S-Aligner SWMapper Time(s) 54 47 238 37 Table 4: Strong scaling test for index construction of SWMapper using a full human reference genome (GRCh38) with different numbers of MPI processes on Sunway TaihuLight. Processes 1 2 4 8 16 Time(s) 1632 857 443 257 63 Efficiency 100% 95% 92% 79% 63%

Performance evaluation Runtimes (in seconds) on a single CG after Runtime comparison (in seconds) for mapping all reads of Dataset R1 to the first chromosome of incrementally applying different optimization steps. GRCh38. Tools are executed on a Xeon CPU (green) or on a single SW26010 (red).

Performance evaluation Strong scaling Table 5: Strong scaling test for mapping all reads of Dataset R1 to a whole human genome reference (GRCh38) using different numbers of MPI processes on Sunway TaihuLight. Processes 1 2 4 8 16 32 64 128 Time(s) 6635 3368 1705 867 437 230 124 70 Efficiency 100% 99% 97% 96% 95% 90% 83% 74%

SWMapper: Scalable Read Mapper on SunWay TaihuLight Kai Xu 1,2 , - PowerPoint PPT Presentation

SWMapper: Scalable Read Mapper on SunWay TaihuLight Kai Xu 1,2 , Xiaohui Duan 1,2 , Xiangxu Meng 1 , Xin Li 1 , Bertil Schmidt 3 , Weiguo Liu 1,2 1 Shandong University 2 National Supercomputing Center in Wuxi 3 Jonhannes Gutenberg University

Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight

Computational Topology - Mapper Jiaqi Ni Eindhoven University of Technology June 14, 2018

Large-scale Simulations of Peridynamics on Sunway TaihuLight Supercomputer Authors: Xinyuan Li,

SUNWAY UNIVERSITY STUDENT COUNCIL 2020/2021 The official student body to represent the Sunway

Artworks and Articles Meet Artworks and Articles Meet MAPPER and Persistent MAPPER and

Distributed Multiscale Computing The Mapper project Alfons Hoekstra The Mapper project receives

The Mapper algorithm and its applications Boris Goldfarb University at Albany, SUNY May 21, 2018

Linked Data Mapper Mapper Linked Data A Browser rowser- -based Semantic Mapping

Device-Mapper Remote Replication Target Linux-Kongress Hamburg 2008 Heinz Mauelshagen Consulting

Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, Wei Xue, Li Wu 2 1 3 Ou

TTN Mapper Processing 3 million crowd sourced LoRa packets JP Meijers Interests: Who am I?

COREP TEMPLATES TO XBRL MAPPER Fernando Wagener 4th Workshop XBRL-COREP Madrid 2/2/2006

Hardware Mapper Service 1 18 August 2016 Jonathan Davies Mo3va3on

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway

ideogram layout and formatting SESSION 1 MARTIN KRZYWINSKI Genome Sciences Center BC Cancer

Natural Selection 02-715 Advanced Topics in Computa8onal Genomics

Informed Search and Exploration Berlin Chen 2004 Reference: 1. S. Russell and P. Norvig.

Seeing Single Molecules Seeing Single Molecules Dr. Arindam Chowdhury Department of Chemistry

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture19: Alternative Tests,

The Bioconductor Project Paula Andrea Martinez, PhD. Data Scientist DataCamp Introduction to

submitted by: Anyesha anandita prusty Adm.no.:56c/15 Group: b HYBRIDIZATION: Crossing

Christine Pecci, MD University of California, San Francisco Annual Review in Family Medicine