parallel and memory efficient preprocessing for
play

Parallel and Memory-efficient Preprocessing for Metagenome Assembly - PowerPoint PPT Presentation

Parallel and Memory-efficient Preprocessing for Metagenome Assembly Vasudevan Rengasamy Paul Medvedev Kamesh Madduri School of EECS The Pennsylvania State University {vxr162, pashadag, madduri}@cse.psu.edu HiCOMB 2017 1 / 46 Talk Outline


  1. Parallel and Memory-efficient Preprocessing for Metagenome Assembly Vasudevan Rengasamy Paul Medvedev Kamesh Madduri School of EECS The Pennsylvania State University {vxr162, pashadag, madduri}@cse.psu.edu HiCOMB 2017 1 / 46

  2. Talk Outline Motivation for our work M ETA P REP : a new metagenome pre-processing strategy M ETA P REP evaluation Parallel Scaling Comparison to prior work Impact on Metagenome assembly Conclusions and Future work 2 / 46

  3. Metagenome assembly What is metagenome assembly? ◮ Metagenome: Mixed genomes present in an environment sample (Soil, Human gut, etc.). ◮ Assembly: Re-constructing genome sequence from reads . 3 / 46

  4. Metagenome assembly What is metagenome assembly? ◮ Metagenome: Mixed genomes present in an environment sample (Soil, Human gut, etc.). ◮ Assembly: Re-constructing genome sequence from reads . Why is metagenome assembly challenging? 1. Uneven coverage of genomes. 2. Repeated sequences across genomes. 3. Variable sizes of genomes. 4. Large dataset sizes (as the output from multiple sequencing runs may be merged). Metagenome assembly tools (MEGAH I T, MetaVelvet, metaSPAdes, etc.) attempt to overcome these challenges. 4 / 46

  5. MEGAHIT [Li2016] metagenome assembler ◮ State-of-the-art metagenome assembler. ◮ Uses a highly compressed de Bruijn graph representation. ◮ Refines assembly quality by using multiple k -mer lengths. ◮ Supports single-node shared memory parallelism (both CPUs and GPUs). ◮ 47 minutes to assemble a metagenome dataset containing 4.26 Gbp. 5 / 46

  6. A preprocessing strategy for Metagenome assembly ◮ Introduced by Howe et al. [Howe2014]. ◮ After filtering low frequency k -mers, partition de Bruijn graph into weakly connected components (WCCs). ◮ Assemble each large component independently. 6 / 46

  7. Recent work on metagenome partitioning [Flick2015] ◮ Construct an undirected read graph instead of a de Bruijn graph. ◮ Find connected components in the read graph using a distributed memory parallel approach based on Shiloach-Vishkin algorithm. ◮ Read graph components correspond to de Bruijn graph WCCs. R0: TAACGACC R1: R3: AACGACCT CTCAACGA R2: ACTCAAAT 7 / 46

  8. Our contributions ◮ Novel multi-stage algorithm to find connected components from read graphs. ◮ End-to-end hybrid parallelism using MPI and OpenMP. ◮ Memory aware implementation. ◮ Evaluate impact of preprocessing on metagenome assembly. 8 / 46

  9. M ETA P REP ◮ New Meta genome Prep rocessing tool. ◮ Main memory use is par ameterized. ◮ Multipass approach: Only enumerate a subset of k -mers in each pass. ◮ e.g., 10 passes ⇒ 10 × memory reduction. ◮ log ( P ) inter-node communication steps. 9 / 46

  10. Talk Outline Motivation for our work M ETA P REP : a new metagenome pre-processing strategy M ETA P REP evaluation Parallel Scaling Comparison to prior work Impact on Metagenome assembly Conclusions and Future work 10 / 46

  11. M ETA P REP overview Input: FASTQ files Construct Read Graph Find Connected Components Output: FASTQ files 11 / 46

  12. M ETA P REP overview Input: KmerHist FASTQ files M ETA P REP step Function FASTQPart IndexCreate IndexCreate Create index files for parallel runs. Enumerate � k -mer, read i � tu- 1 KmerGen KmerGen ples. Transfer � k -mer, read i � tuples Multiple Passes KmerGen-Comm 2 KmerGen-Comm to owner tasks. LocalSort 3 LocalSort Sort tuples by k -mers. 4 LocalCC Identify connected compo- LocalCC nents (CCs). 5 MergeCC Merge components across tasks, create output FAST Q MergeCC fi les with reads from largest CC and other CCs. Output: FASTQ files 12 / 46

  13. A simple strategy for static work partitioning ◮ Precompute an m -mer histogram ( m ≪ k , defaults are k = 27, m = 10) ◮ Used to partition k -mers across MPI tasks and threads in a load balanced manner. Reads: k-mers: m-mer histogram: R1: ACTAGG ACTAG, CTAGG AC - 1 R2: CTGTAA CTGTA, TGTAA CT - 2 TG - 1 13 / 46

  14. Notation Notation Description M Total number of k -mers enumerated R Paired-end read count S Number of I/O passes P Number of MPI tasks T Number of threads per task 14 / 46

  15. k -mer Enumeration ◮ Generate � k -mer, read_id � tuples. ◮ Multiple threads write to single array without synchr onization. O ff sets precomputed. ◮ Output: a bu ff er on each MP I task. ◮ k -mers are partially sorted. ◮ Time: O ( MS PT ) , Space ≈ 24 M SP bytes. Thread 1 offset Thread T offset ... ... To MPI Task 1 To MPI Task P Send Buffer at MPI task i 15 / 46

  16. Sort by k -mer ◮ Sort tuples by k -mer value to identify reads with common k -mer and create read graph edges. ◮ Radix sort implementation. ◮ Reuse send buffer ⇒ No additional memory . ◮ Partition tuples into T disjoint ranges. ◮ Sort ranges in parallel using T threads. ◮ Time: O ( M PT ) , Space ≈ 24 M SP bytes. 16 / 46

  17. Identify connected components ◮ Find connected components using edges from local k -mers. ◮ Union-by-index and path splitting. Union (6,5) 20 10 20 Union-by-index 10 5 5 8 2 8 2 6 6 Find (6) 20 20 20 Path Splitting 10 10 5 5 8 10 5 8 2 8 2 6 2 6 6 17 / 46

  18. Identify connected components ◮ Find connected components using edges from local k -mers. ◮ Union-by-index and path splitting. ◮ No critical sections. ◮ Store edges that merges components (similar to [Patwary2012]). ◮ Process edges again in case of lost updates. ◮ Time: O ( M PT log ∗ R ) , Space ≈ 12 M SP + 4 R bytes. 18 / 46

  19. Merge components ◮ Merge component forests in each MPI task in log P iterations. ◮ Time: O ( R log P log ∗ R ) , Space ≈ 8 R bytes. P0 P1 P2 P3 R1 R1 R1 R4 R1 R4 0: R4 R4 R3 R3 R3 R2 R3 R2 R2 R2 P0 P2 1: R1 R1 R4 R4 R3 R3 R2 R2 P0 2: R1 R4 R3 R2 19 / 46

  20. Talk Outline Motivation for our work M ETA P REP : a new metagenome pre-processing strategy M ETA P REP evaluation Parallel Scaling Comparison to prior work Impact on Metagenome assembly Conclusions and Future work 20 / 46

  21. Experiments and Results Description of datasets Read Count Size I D Dataset Source R ( × 10 6 ) (Gbp) 12 . 7 2 . 29 NCB I (SRR341725) HG Human gut 21 . 3 4 . 26 NCB I (SRR947737) LL Lake Lanier 54 . 8 11 . 07 NCB I (SRX200676) MM Mock microbial community 1132 . 8 223 . 26 JG I (402461) I S I owa, Continuous corn soil Machine con fi guration ◮ Edison supercomputer at NERSC ◮ Each node has 2 × 12-core I vy bridge processors and 64 GB memory. 21 / 46

  22. Overview Motivation for our work M ETA P REP : a new metagenome pre-processing strategy M ETA P REP evaluation Parallel Scaling Comparison to prior work Impact on Metagenome assembly Conclusions and Future work 22 / 46

  23. Single node scaling for Human Gut (HG) Dataset 300 KmerGen-I/O 250 20 KmerGen LocalSort 200 Relative Speedup LocalCC-Opt Time(seconds) 15 CC-I/O Speedup 150 10 100 5 50 0 1 2 4 8 12 24 Threads 23 / 46

  24. Multi-node scaling for Human Gut (HG) Dataset 25 16 KmerGen-I/O KmerGen KmerGen-Comm 20 LocalSort LocalCC-Opt Merge-Comm MergeCC Time (seconds) 15 CC-I/O Speedup 8 10 6 4 5 0 1 2 4 8 16 Nodes 24 / 46

  25. Multi-node scaling for LL and MM datasets LL (S=2) MM (S=4) 40 16 180 16 160 35 140 30 Time (seconds) 120 25 100 20 8 8 80 15 12 60 10 40 4 4 22 5 20 0 0 1 2 4 8 16 1 2 4 8 16 Nodes Nodes KmerGen-I/O LocalSort MergeCC LocalCC-Opt KmerGen CC-I/O KmerGen-Comm Merge-Comm Speedup 25 / 46

  26. Multi-node scaling for Iowa Continuous Soil dataset 900 KmerGen-I/O 1X 800 KmerGen KmerGen-Comm 700 LocalSort LocalCC-Opt 600 Time(seconds) Merge-Comm 500 MergeCC CC-I/O 400 300 3.25X 200 100 0 16 64 Nodes For 16 node run, S = 8. For 64 node run, S = 2. 26 / 46

  27. Overview Motivation for our work M ETA P REP : a new metagenome pre-processing strategy M ETA P REP evaluation Parallel Scaling Comparison to prior work Impact on Metagenome assembly Conclusions and Future work 27 / 46

  28. KmerGen performance comparison with KMC-2 k -mer counter [Deoro wicz2015] 140 90 MetaPrep KMC-2 80 120 KMC-2 MetaPrep16 70 100 60 1.57X Time (seconds) 80 50 40 60 30 40 20 6.76X 1.76X 20 10 1.56X 3.18X 2.72X 0 0 HG LL MM HG LL MM Dataset Dataset ◮ MetaPrep16: M ETA P REP run using 16 nodes. 28 / 46

  29. Comparison with read graph connectivity [Flick2015] Table 1: Execution time comparison with Metagenome partitioning work (AP_LB) using 16 nodes. Time (seconds) M ETA P REP Dataset M ETA P REP AP_LB Speedup 5 . 5 23 . 6 4 . 22 × HG 2 . 25 × 11 . 5 25 . 9 LL 19 . 6 56 . 1 2 . 86 × MM ◮ 21 iterations for AP_LB vs 4 for M ETA P REP for MM dataset. 29 / 46

  30. Overview Motivation for our work M ETA P REP : a new metagenome pre-processing strategy M ETA P REP evaluation Parallel Scaling Comparison to prior work Impact on Metagenome assembly Conclusions and Future work 30 / 46

  31. Largest component size ◮ Largest component size can be reduced by using fi lters. 1. k -mer size (k) - Longer k -mers occur in fewer components 2. k -mer frequency (KF) - Filter erroneous (low frequency) and repeat k -mers (high frequency) 31 / 46

Recommend


More recommend