high performance genome studies
play

HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver - PowerPoint PPT Presentation

HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver and Prof. Paolo Bientinesi RWTH Aachen University 19 June 2012, SIAM Conference on Applied Linear Algebra, Valencia, Spain Thanks to the AICES HPAC group and DFG grant GSC111


  1. HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver and Prof. Paolo Bientinesi RWTH Aachen University 19 June 2012, SIAM Conference on Applied Linear Algebra, Valencia, Spain Thanks to the AICES HPAC group and DFG grant GSC111

  2. OUTLINE • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 2 of 30

  3. OUTLINE • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 3 of 30

  4. r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y p n n GENOME-WIDE ASSOCIATION STUDIES lots of GLS because i = 0..millions y ∈ ℝ n observations (phenotype) input X i ∈ ℝ n × p genome measurements/covariates M ∈ ℝ n × n observation dependencies output r i ∈ ℝ p relations between phenotype and genome variations 4 of 30

  5. ⇒ r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y THE NUMBERS y ∈ ℝ n 80 MB # DNA fragments (nucleotides) m ~ 48 ﹣ 250 000 000 M ∈ ℝ n × n 800 MB # samples n ~ 10 000 r ∈ ℝ p × m 7-40 GB # covariates p = 20 X ∈ ℝ n × p × m 72 TB ﹣ 373 PB 5 of 30

  6. OUTLINE • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 6 of 30

  7. BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y 7 of 30

  8. BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y Cholesky once during initialization LL T := M 8 of 30

  9. BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y Cholesky once during initialization LL T := M i L − T L − 1 X i ) − 1 X T i L − T L − 1 y r i ← ( X T 9 of 30

  10. BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y Cholesky once during initialization LL T := M i L − T L − 1 X i ) − 1 X T i L − T L − 1 y r i ← ( X T r i ← (( L − 1 X i ) T L − 1 X i ) − 1 L − 1 X i L − 1 y 10 of 30

  11. BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y Cholesky once during initialization LL T := M i L − T L − 1 X i ) − 1 X T i L − T L − 1 y r i ← ( X T r i ← (( L − 1 X i ) T L − 1 X i ) − 1 L − 1 X i L − 1 y One trsm per iteration step i ˆ X i := L − 1 X i X i ) − 1 ˆ r i ← ( ˆ i ˆ X T X i L − 1 y 11 of 30

  12. ⇒ OPTIMIZATIONS • Blocking in i • many small trsms vs. one big trsm ... 12 of 30

  13. ⇒ OPTIMIZATIONS • Blocking in i • many small trsms vs. one big trsm ... • Out-of-core algorithm • read block b +1 while computing block b • double-buffering technique necessary 13 of 30

  14. CLAK-Chol FLMM GWFGLS EMMAX Years 10.000.000s Months 1.000.000s 100.000s Days 10.000s Hours 1.000s Minutes 100s 1m 10m 36m m (nucleotide count) PERFORMANCE From years/months down to weeks/days 14 of 30

  15. OUTLINE • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 15 of 30

  16. CAN GPU S HELP GO FURTHER? • trsm takes 90-95% of time • compute on the GPU • while GPU computes: • CPU computations • CPU ⇄ GPU transfers • our cluster: nVidia Fermi ☞ 16 of 30

  17. GPU: 1-10 Gb RAM: 10-100 Gb HDD: Terabytes MEMORY PYRAMID Need for streaming computation Need for two levels of double-buffering 17 of 30

  18. GPU b b -1 trsm b -1 CPU/RAM b +1 b -1 b +2 HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING (1) Retrieve previous results from GPU, start reading second-next block from HDD 18 of 30

  19. GPUs b b -1 trsm CPU/RAM b +1 b -1 b +2 HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING (2) Buffer switch (no copying) 19 of 30

  20. GPUs b b -1 trsm CPU/RAM b -1 b +1 b +2 HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING Buffers switched 20 of 30

  21. GPUs b b +1 trsm b +1 CPU/RAM b -1 b +1 b +2 Computation HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING (3) Send next block to GPU, start CPU computation 21 of 30

  22. GPUs b b +1 trsm CPU/RAM b -1 b +1 b +2 HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING (4) Write results to disk (fast because small) 22 of 30

  23. GPUs b b +1 trsm CPU/RAM b -1 b +1 b +2 HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING (5) Buffer switch (no copying) 23 of 30

  24. b -1 GPUs b b +1 trsm b +1 b -1 CPU/RAM b -1 b +1 b +2 Computation HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING One full iteration 24 of 30

  25. GPU GPU trsm b GPU trsm b +1 Send b +1 Send b +2 t CPU Recv b -1 CPU comp b -1 Recv b CPU b Read b +2 Read b +3 HDD Write b -1 TIMELINE Parallelism on the vertical axis Heavy use of asynchronous dispatching CPU ⇄ GPU transfer GPU computation Data dependencies HDD ⇄ CPU transfer CPU computation 25 of 30

  26. GPU t CPU HDD TIMELINE, TO SCALE problem sizes: n=10k, m=100k, block=10k GPU: 2x nVidia Quadro 6000 (Fermi, 515 GFlops each , 6GB memory) = 10.000$ CPU: 2x Intel Xeon X5650 (6cores, 128 GFlops, 24GB memory) = 2000$ CPU ⇄ GPU transfer GPU computation Blas: Intel MKL 10.2 Compiler: icc 12.1 HDD ⇄ CPU transfer CPU computation 26 of 30

  27. OUTLINE • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 27 of 30

  28. Hybrid CPU+2GPU algorithm Original CPU-only algorithm 96,7s 100 84,8s ⟵ in-core out-of-core ⟶ 74,6s 75 65,6s 52,4s Time [s] 50 43,1s 32,9s 24,9s 25 18,3s 16,3s 14,3s 12,3s 11,6s 10,3s 8,3s 6,3s 4,3s 0 1k 10k 20k 30k 40k 50k 60k 70k 80k 90k m (nucleotide count) PERFORMANCE sustained in-core performance when out-of-core extrapolated: 13h/70h vs 2.5h/13h 28 of 30

  29. Hybrid CPU+2GPU algorithm Original CPU-only algorithm 96,7s 100 84,8s ⟵ in-core out-of-core ⟶ 74,6s 75 65,6s 52,4s Time [s] 50 43,1s 32,9s 24,9s 25 18,3s 16,3s 14,3s 12,3s 11,6s 10,3s 8,3s 6,3s 4,3s 0 1k 10k 20k 30k 40k 50k 60k 70k 80k 90k m (nucleotide count) PERFORMANCE 5.2x speedup using 2 GPU, 10x using 4 GPUs from days to hours (when m is millions) 29 of 30

  30. CONCLUSION • Don’t replace the CPU by GPU • Combine them • Hide data transfer latency by overlapping with computation • Double/triple-buffering • GPU never stops computing • GPUs order of magnitude faster? • V. Volkov (http://www.cs.berkeley.edu/~volkov/) • Victor W. Lee et al. ( «Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU», 2010 ) 30 of 30

  31. Hybrid CPU+2GPU algorithm Original CPU-only algorithm 96,7s 100 84,8s ⟵ in-core out-of-core ⟶ 74,6s 75 65,6s 52,4s Time [s] 50 43,1s 32,9s 24,9s 25 18,3s 16,3s 14,3s 12,3s 11,6s 10,3s 8,3s 6,3s 4,3s 0 1k 10k 20k 30k 40k 50k 60k 70k 80k 90k m (nucleotide count) QUESTIONS? beyer@aices.rwth-aachen.de 31 of 30

  32. FUTURE WORK • Solution for L too big for GPU memory • Apply similar technique to similar problems • Extension to multiple phenotypes (y) 32 of 30

Recommend


More recommend