HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver and Prof. Paolo Bientinesi RWTH Aachen University 19 June 2012, SIAM Conference on Applied Linear Algebra, Valencia, Spain Thanks to the AICES HPAC group and DFG grant GSC111
OUTLINE • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 2 of 30
OUTLINE • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 3 of 30
r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y p n n GENOME-WIDE ASSOCIATION STUDIES lots of GLS because i = 0..millions y ∈ ℝ n observations (phenotype) input X i ∈ ℝ n × p genome measurements/covariates M ∈ ℝ n × n observation dependencies output r i ∈ ℝ p relations between phenotype and genome variations 4 of 30
⇒ r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y THE NUMBERS y ∈ ℝ n 80 MB # DNA fragments (nucleotides) m ~ 48 ﹣ 250 000 000 M ∈ ℝ n × n 800 MB # samples n ~ 10 000 r ∈ ℝ p × m 7-40 GB # covariates p = 20 X ∈ ℝ n × p × m 72 TB ﹣ 373 PB 5 of 30
OUTLINE • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 6 of 30
BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y 7 of 30
BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y Cholesky once during initialization LL T := M 8 of 30
BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y Cholesky once during initialization LL T := M i L − T L − 1 X i ) − 1 X T i L − T L − 1 y r i ← ( X T 9 of 30
BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y Cholesky once during initialization LL T := M i L − T L − 1 X i ) − 1 X T i L − T L − 1 y r i ← ( X T r i ← (( L − 1 X i ) T L − 1 X i ) − 1 L − 1 X i L − 1 y 10 of 30
BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y Cholesky once during initialization LL T := M i L − T L − 1 X i ) − 1 X T i L − T L − 1 y r i ← ( X T r i ← (( L − 1 X i ) T L − 1 X i ) − 1 L − 1 X i L − 1 y One trsm per iteration step i ˆ X i := L − 1 X i X i ) − 1 ˆ r i ← ( ˆ i ˆ X T X i L − 1 y 11 of 30
⇒ OPTIMIZATIONS • Blocking in i • many small trsms vs. one big trsm ... 12 of 30
⇒ OPTIMIZATIONS • Blocking in i • many small trsms vs. one big trsm ... • Out-of-core algorithm • read block b +1 while computing block b • double-buffering technique necessary 13 of 30
CLAK-Chol FLMM GWFGLS EMMAX Years 10.000.000s Months 1.000.000s 100.000s Days 10.000s Hours 1.000s Minutes 100s 1m 10m 36m m (nucleotide count) PERFORMANCE From years/months down to weeks/days 14 of 30
OUTLINE • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 15 of 30
CAN GPU S HELP GO FURTHER? • trsm takes 90-95% of time • compute on the GPU • while GPU computes: • CPU computations • CPU ⇄ GPU transfers • our cluster: nVidia Fermi ☞ 16 of 30
GPU: 1-10 Gb RAM: 10-100 Gb HDD: Terabytes MEMORY PYRAMID Need for streaming computation Need for two levels of double-buffering 17 of 30
GPU b b -1 trsm b -1 CPU/RAM b +1 b -1 b +2 HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING (1) Retrieve previous results from GPU, start reading second-next block from HDD 18 of 30
GPUs b b -1 trsm CPU/RAM b +1 b -1 b +2 HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING (2) Buffer switch (no copying) 19 of 30
GPUs b b -1 trsm CPU/RAM b -1 b +1 b +2 HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING Buffers switched 20 of 30
GPUs b b +1 trsm b +1 CPU/RAM b -1 b +1 b +2 Computation HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING (3) Send next block to GPU, start CPU computation 21 of 30
GPUs b b +1 trsm CPU/RAM b -1 b +1 b +2 HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING (4) Write results to disk (fast because small) 22 of 30
GPUs b b +1 trsm CPU/RAM b -1 b +1 b +2 HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING (5) Buffer switch (no copying) 23 of 30
b -1 GPUs b b +1 trsm b +1 b -1 CPU/RAM b -1 b +1 b +2 Computation HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING One full iteration 24 of 30
GPU GPU trsm b GPU trsm b +1 Send b +1 Send b +2 t CPU Recv b -1 CPU comp b -1 Recv b CPU b Read b +2 Read b +3 HDD Write b -1 TIMELINE Parallelism on the vertical axis Heavy use of asynchronous dispatching CPU ⇄ GPU transfer GPU computation Data dependencies HDD ⇄ CPU transfer CPU computation 25 of 30
GPU t CPU HDD TIMELINE, TO SCALE problem sizes: n=10k, m=100k, block=10k GPU: 2x nVidia Quadro 6000 (Fermi, 515 GFlops each , 6GB memory) = 10.000$ CPU: 2x Intel Xeon X5650 (6cores, 128 GFlops, 24GB memory) = 2000$ CPU ⇄ GPU transfer GPU computation Blas: Intel MKL 10.2 Compiler: icc 12.1 HDD ⇄ CPU transfer CPU computation 26 of 30
OUTLINE • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 27 of 30
Hybrid CPU+2GPU algorithm Original CPU-only algorithm 96,7s 100 84,8s ⟵ in-core out-of-core ⟶ 74,6s 75 65,6s 52,4s Time [s] 50 43,1s 32,9s 24,9s 25 18,3s 16,3s 14,3s 12,3s 11,6s 10,3s 8,3s 6,3s 4,3s 0 1k 10k 20k 30k 40k 50k 60k 70k 80k 90k m (nucleotide count) PERFORMANCE sustained in-core performance when out-of-core extrapolated: 13h/70h vs 2.5h/13h 28 of 30
Hybrid CPU+2GPU algorithm Original CPU-only algorithm 96,7s 100 84,8s ⟵ in-core out-of-core ⟶ 74,6s 75 65,6s 52,4s Time [s] 50 43,1s 32,9s 24,9s 25 18,3s 16,3s 14,3s 12,3s 11,6s 10,3s 8,3s 6,3s 4,3s 0 1k 10k 20k 30k 40k 50k 60k 70k 80k 90k m (nucleotide count) PERFORMANCE 5.2x speedup using 2 GPU, 10x using 4 GPUs from days to hours (when m is millions) 29 of 30
CONCLUSION • Don’t replace the CPU by GPU • Combine them • Hide data transfer latency by overlapping with computation • Double/triple-buffering • GPU never stops computing • GPUs order of magnitude faster? • V. Volkov (http://www.cs.berkeley.edu/~volkov/) • Victor W. Lee et al. ( «Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU», 2010 ) 30 of 30
Hybrid CPU+2GPU algorithm Original CPU-only algorithm 96,7s 100 84,8s ⟵ in-core out-of-core ⟶ 74,6s 75 65,6s 52,4s Time [s] 50 43,1s 32,9s 24,9s 25 18,3s 16,3s 14,3s 12,3s 11,6s 10,3s 8,3s 6,3s 4,3s 0 1k 10k 20k 30k 40k 50k 60k 70k 80k 90k m (nucleotide count) QUESTIONS? beyer@aices.rwth-aachen.de 31 of 30
FUTURE WORK • Solution for L too big for GPU memory • Apply similar technique to similar problems • Extension to multiple phenotypes (y) 32 of 30
Recommend
More recommend