S6636 - GEM3: CPU-GPU Heterogeneous DNA Sequence Alignment for Scalable Read Sizes Alejandro Chacón (UAB) San%ago Marco (CNAG-CRG) Juan Carlos Moure (UAB) Paolo Ribeca (CNAG-CRG) Antonio Espinosa (UAB) 1
Genomic Sequencing applicaKons Direct ApplicaKons • Personalized Medicine • Diagnosis and intervenKon • Drug development and usage • Cancer genomics • Genome EdiKng (crisp) • Large Scale populaKon analysis • PhylogeneKc • In vitro meat Example of benefits in diagnosis. Detect cancer with a blood test: à Earlier detec%on. à Non-intrusive methods. à (ctDNA + deep sequencing ) 2
Genomic Sequencing Mapping Individual MutaKons Process widely used in Sample bioinformaKcs analysis GOAL: Correct the sequencing errors Queries (Tera-sequences) Repeat (x20 – x60) Approximate mapping process Sequencing Errors Aligned queries Reference Genome (Gbases) 3
Growing the Sequencing Data Falling the sequencing cost à Democra%zes the personalized medicine Exposes a computaKonal demanding problem (HPC) (Deep sequencing) 4
Seed & Extend mapping strategy Input (FASTQ file) Candidates: Query: x Seeds: Phase 1 Phase 2 Genome Region: ? EXTENDING SEEDING (text comparison) (text search) Human Genome OK NO: Alignment Filtered-out Genome : posi%ons Output (SAM file) REDUCE THE COMPUTATIONAL COST REDUCE THE REPORTED POSITIONS 5
TradiKonal Mapper: Internal workflow Input (FASTQ file) Candidates: Query: x Seeds: Exact Search (index) EXTENDING SEEDING Smith&Waterman NO MAP Decode PosiKons OK NO: Alignment Filtered-out Genome : posi%ons Human Genome (%me mapping) GEM CNAG 340 h Output (SAM file) 6
Introducing the new Mapper GEM3-GPU 7
Introducing GEM3: Internal workflow Input (FASTQ file) Candidates: Query: x Seeds: Exact Search (index) K-mer Distance Filter Approximate Search BitParallel Myers Neighborhood Search Smith&Waterman NO MAP Decode PosiKons OK NO: Alignment Filtered-out Genome : posi%ons Human Genome (%me mapping) GEM CNAG GEM3-GPU (500x) 340 h 40 min à Output (SAM file) The queries present different size & error rate: GEM3 Mapper uses adaptaKve strategies to process them. 8
Sequencing ProducKon Hybrid Nodes 2 x K80 2 x E5 2640 16GB/s 136GB/s 16GB/s 960GB/s GDRAM CPUs DRAM GPUs (4x12GB) (16 cores) (128GB) (10K cores) GPU Algorithmic programming challenges Architectural characterisKcs CPUs GPUs Explicit transferences Performance: 2 TFLOPS 18 TFLOPS ( 9x ) Thread hierarchy (warps, blocks ...) Bandwidth: 136 GB/s 960 GB/s ( 7x ) Power Efficiency: 8GFlops/W 30GFlops/W (3.8x) Limited space for data structures Main Memory: 128GBs 4x12GBs ( 10% ) Threads: 48 HW threads 106K HW threads Extract explicitly huge parallelism Cache/thread: 1.25 MB 60 Bytes Huge memory constrains 9
GEM3: GPU Internal workflow Input (FASTQ file) Candidates: Read: Seeds: GPU (x15) Exact Search K-mer Distance Filter Approximate Search GPU (x21) BitParallel Myers Neighborhood Search Smith&Waterman NO MAP Decode PosiKons GPU (x15) OK NO: Alignment Filtered-out Genome : posi%ons Output (SAM file) GPU dedicated à (1) %me consuming stages + (2) best mapping for GPU arch (specialized kernels for common cases) 10
GEM3: GPU algorithmic challenges 1) Exposing massive parallelism (fine grain parallelism & batch mode) 2) Algorithmic interac%ons for hybrid systems (manage latency- & throughput-oriented cores) 3) Reducing the memory requirements (specialized structures) 4) Regularize the work (CPU and GPU collaboraTon & be warp aware) 5) Reduce the thread memory footprint (problem decomposiTon & thread-cooperaTve parallelizaTon) 11
1) Expose massive parallelism (sequence life-cycle) 1. Exact Search 4. Decode 6. BPM GEM stages ... ok 1. Exact S 5. KMER query (500nt) no 2. Approx S 6. BPM CPU: 3. Neigh S 7. SW 4. Decode 1 query x 30 seeds 30 seeds x 50 occ 30 seeds x 50 occ ok no ... ok query (500nt) no GPU: Not enough Task - Thread to saturate GPU 30 threads 1500 threads 1500 threads ... ok query (500nt) ... no GPU: ... ... ... no Higher parallelism ... Batch Mode + Task - Thread 1K queries 1.5M threads 30K threads 1.5M threads ... ok query (500nt) Much higher paralle. ... no GPU: + ... ... ... no Batch Mode ... keep the same + Task - R threads 240K threads 12M threads amount of memory 6M threads 1K queries (30K x 8 th) (1.5M x 8 th) (1.5M x 4 th) 12
2) GEM3: An hybrid processing system GPU Tasks CPU Tasks CPU Buffers GPU Buffers INPUT wai%ng queue CPU Thread 0 CORE 0 kernel execu%on GPU DEVICE 0 transferences Dynamic I/O dispatcher CPU queue Disk 0 Thread 1 Disk 1 CORE 1 Disk 2 CPU Thread 2 ... CORE 2 ready queue Disk r transferences queue GPU DEVICE m ... ... ... CPU Thread n OUTPUT CORE n 13
2) Algorithmic interacKons on hybrid systems INPUT: Sequences (FASTQ) STAGE 1 Transfers S2: output S2: input S2: output S2: input STAGE 2 STAGE 3 S4: output S4: input STAGE 4 S4: output S4: input STAGE 5 S6: output S6: input S6: output S6: input STAGE 6 STAGE 7 OUTPUT: Alignments (SAM) Workflow dependences serialize CPU, GPU and transferences. 14
2) Algorithmic interacKons on hybrid systems INPUT: Sequences (FASTQ) STAGE 1 ... ... STAGE 2 STAGE 3 Transfers ... ... STAGE 4 STAGE 5 ... STAGE 6 ... STAGE 7 OUTPUT: Alignments (SAM) MulK-Buffering strategy : Overlap both-direc%ons transferences + CPU tasks + GPU tasks. Breaking dependences : Dependence buffer explora%on to increase the parallelism. 15
2) Adapt the applicaKon to hybrid systems StaKc Scheduler & System analyzer Recollect architecture specs Dynamic Parallel I/O dispatcher INPUT Memory allocaKon policies Get Next Batch Batch queries processing Disk 0 - Buffer tuning (size, number, ...) - Buffer distribu%on between GPUs Disk 1 ... - Enable / disable compute stages - Par%al data structures alloca%on Disk r OUTPUT Store results Coupling GPUs with different memory restric%ons GPUs with limited memory space à flexible data allocaTon policy: Non-cri%cal structures to be (1) remotely accessed from the CPU memory or (2) work be processed by the CPU. 16
3) Reducing the GPU memory requirements CPU Memory (10x more GPU Memory (128 GB) space) (12 GB) Preprocessed data structures are specialized for each system: A) CPU & GPU collaboraKon Highly opKmized for common cases 1 Support all query operaKons 2 - but ... does not support all query opera%ons - large memory space requirements B) Special compressing strategies To reduce index size ... To reduce index size ... - more mem. accesses (big caches) - more compute (many compute resources) allocaKon policies + highly compacted indexes allow large scale genomes on GPU !! 17 1 GPU be[er on regular workflow -execuTons 2 CPU be[er on latency bounded and divergent -execuTons
4) Regularize the work Irregular work and parallelism along the pipeline (GPUs are friendly to regular work) 1. Exact Search 4. Decode 6. BPM ok ... query (500nt) no ... GPU: no ... ... ... ... Batch Mode + (1.5M x 4 th) (30K x 8 th) (1.5M x 8 th) Fine ParallelizaKon 1K queries CPU: divide seeds expand intervals split BPM join BPM (filter Ns) (process Ns) (Adapt irregular work) (break dependencies) (reconstruct results) A) CPU help to regularize the GPU work: · GPU process common cases à Corner cases relegated to CPU · Problem decomposi%on for GPUs à CPU cares to split in smaller problems B) Fine grain parallelism (thread-cooperaTon strategies) · Threads working in the same element à helps to regularize the work size 18
5) Reduce the thread memory footprint Example K80 runs 27K threads: Cache L2 : 60 B/thr If thread memory footprint not fit in Local Memory: cache memories: 64 B/thr Registers : 256 B/thr · Memory cache pressure issues: Main memory: 470 KB/thr - Increase GPU memory traffic · GPU resources alloca%on issues: Memory constrains: kernels with large memory footprints - Reduces thread GPU occupancy - BioinformaKc algorithms are an example. MUST re-think bio-algorithms to Working set Thread Parallelism scale in GPUs bytes/thread # threads 19
5) Reduce the thread memory footprint (A) BPM: Task parallel (B) BPM: Thread cooperaKve (1 thread – 1 task) (r threads – 1 task) Memory Memory per thread candidate p-1 per thread candidate candidate 1 |q|=100 candidate 0 |q|=100 r threads Levenshtei query 202 Bytes 164 Bytes query n distance Homology Levenshtei with Myers query |q|=1000 |q|=1000 n distance query ... 202 Bytes with Myers Homology 1580 Bytes result 0 result 0 result 0 larger query à more threads (but requires) result 0 (all is dynamic & flexible) · Complex register com. Query read ALL local data fits · Special data layouts in REGISTERS ! just once · Data regulariza%on · Distribu%on of the work (avoid all mem. re-accesses) 20
5) Kernel performance improvements (memory footprint reduc%on) Performance Exact Search kernel Performance Bit Parallel Myers kernel 2.50 Giga query bases / second 2.00 16x 1.50 1.00 3.3x 0.50 0.00 Query Size (m) Thread-coopera%ve strategy allows to scale larger problems in GPU. à The memory footprint reduc%on delivers 2.3x – 6.8x beher performance. 1 1 compared to the tradiTonal task-parallel strategy 21
Recommend
More recommend