high performance genome studies
play

HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver - PowerPoint PPT Presentation

HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver and Prof. Paolo Bientinesi RWTH Aachen University 19 June 2012, SIAM Conference on Applied Linear Algebra, Valencia, Spain Thanks to the AICES HPAC group and DFG grant GSC111


  1. HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver and Prof. Paolo Bientinesi RWTH Aachen University 19 June 2012, SIAM Conference on Applied Linear Algebra, Valencia, Spain Thanks to the AICES HPAC group and DFG grant GSC111

  2. OUTLINE • Intoduction and motivation • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 2 of 38

  3. OUTLINE • Intoduction and motivation • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 3 of 38

  4. THE BIOLOGY Roughly, how an engineer sees it Organisms Cells Proteins DNA Genes Nucleotides 4 of 38

  5. SINGLE NUCLEOTIDE POLYMORPHISM nucleotide’s allele differs between two individuals of a species link to traits, diseases? Image source: wikipedia 5 of 38

  6. GWAS • Human Genome project (2004) • Genome-wide association studies (GWAS) • Find correlations between SNPs and traits (diseases) • Case group vs. control group • Variance Components & Generalized Linear Mixed Models 6 of 38

  7. GWAS STATS # of GWAS carried out each year 3000 2333 2304 2250 1500 1257 999 750 453 13 2 0 2005 2006 2007 2008 2009 2010 2011 7 of 38

  8. GWAS STATS Sample size 40K 30K 20K 10K 0K 2005 2006 2007 2008 2009 2010 2011 8 of 38

  9. GWAS STATS #SNPs passing QC 4M 3M 2M 1M 0M 2005 2006 2007 2008 2009 2010 2011 9 of 38

  10. GWAS STATS Largest #SNPs passing QC 12M 10,5M 9M 7,5M 6M 2,7M 2,6M 3M 2,4M 1,5M 0,2M 0M 2005 2006 2007 2008 2009 2010 2011 10 of 38

  11. HPC BASICS • Basic Linear Algebra Subprograms • LEGO-like building-blocks of LA • Vendor optimized implementations • TRSM: solution of multiple triangular systems TX = Y • Linear Algebra PACKage (LAPACK) • Higher-level LA algorithms • POTRF: Cholesky factorization of a SPD matrix LL T = A 11 of 38

  12. OUTLINE • Intoduction and motivation • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 12 of 38

  13. r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y p n n GENOME-WIDE ASSOCIATION STUDIES lots of GLS because i = 0..millions y ∈ ℝ n observations (phenotype) input X i ∈ ℝ n × p genome measurements/covariates M ∈ ℝ n × n observation dependencies output r i ∈ ℝ p relations between phenotype and genome variations 13 of 38

  14. ⇒ r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y THE NUMBERS y ∈ ℝ n 80 MB # DNA fragments (nucleotides) m ~ 48 ﹣ 250 000 000 M ∈ ℝ n × n 800 MB # samples n ~ 10 000 r ∈ ℝ p × m 7-40 GB # covariates p = 20 X ∈ ℝ n × p × m 72 TB ﹣ 373 TB 14 of 38

  15. OUTLINE • Intoduction and motivation • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 15 of 38

  16. BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y 16 of 38

  17. BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y Cholesky once during initialization LL T := M 17 of 38

  18. BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y Cholesky once during initialization LL T := M i L − T L − 1 X i ) − 1 X T i L − T L − 1 y r i ← ( X T 18 of 38

  19. BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y Cholesky once during initialization LL T := M i L − T L − 1 X i ) − 1 X T i L − T L − 1 y r i ← ( X T r i ← (( L − 1 X i ) T L − 1 X i ) − 1 L − 1 X i L − 1 y 19 of 38

  20. BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y Cholesky once during initialization LL T := M i L − T L − 1 X i ) − 1 X T i L − T L − 1 y r i ← ( X T r i ← (( L − 1 X i ) T L − 1 X i ) − 1 L − 1 X i L − 1 y One trsm per iteration step i ˆ X i := L − 1 X i X i ) − 1 ˆ r i ← ( ˆ i ˆ X T X i L − 1 y 20 of 38

  21. ⇒ OPTIMIZATIONS • Blocking in i • many small trsms vs. one big trsm ... 21 of 38

  22. ⇒ OPTIMIZATIONS • Blocking in i • many small trsms vs. one big trsm ... • Out-of-core algorithm • read block b +1 while computing block b • double-buffering technique necessary 22 of 38

  23. EMMAX GWFGLS FLMM CLAK-Chol Years 10.000.000s Months 1.000.000s 100.000s Days 10.000s Hours 1.000s Minutes 100s 1M 10M 100M m (SNP count) PERFORMANCE From years/months down to weeks/days 23 of 38

  24. OUTLINE • Intoduction and motivation • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 24 of 38

  25. CAN GPU S HELP GO FURTHER? • trsm takes 90-95% of time • compute on the GPU • while GPU computes: • CPU computations • CPU ⇄ GPU transfers • our cluster: nVidia Fermi ☞ 25 of 38

  26. GPU: 1-10 Gb RAM: 10-100 Gb HDD: Terabytes MEMORY PYRAMID Need for streaming computation Need for two levels of double-buffering 26 of 38

  27. trsm GPU b -1 b α β b -1 CPU/RAM b +1 b -1 b +2 C B A HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING (1) Retrieve previous results from GPU, start reading second-next block from HDD 27 of 38

  28. trsm GPUs b -1 b α β Computation b +1 CPU/RAM b +1 b -1 b +2 C B A HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING (2) Send next block to GPU, start CPU computation 28 of 38

  29. trsm GPUs b +1 b α β CPU/RAM b +1 b -1 b +2 C B A HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING (3) Write results to disk (fast because small) 29 of 38

  30. GPUs b +1 b α β CPU/RAM b +1 b -1 b +2 C B A HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING (4) Rotate buffers, iterate, smile 30 of 38

  31. trsm b -1 GPUs b b +1 α β Computation b +1 b -1 CPU/RAM b +1 b -1 b +2 C B A HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING One full iteration 31 of 38

  32. GPU GPU trsm b GPU trsm b +1 Send b +1 Send b +2 t CPU Recv b -1 CPU comp b -1 Recv b CPU b Read b +2 Read b +3 HDD Write b -1 TIMELINE Parallelism on the vertical axis Heavy use of asynchronous dispatching CPU ⇄ GPU transfer GPU computation Data dependencies HDD ⇄ CPU transfer CPU computation Asynchronous dispatch of 38

  33. GPU t CPU HDD TIMELINE, TO SCALE problem sizes: n=10k, m=100k, block=10k GPU: 2x nVidia Quadro 6000 (Fermi, 515 GFlops each , 6GB memory) = 10.000$ CPU: 2x Intel Xeon X5650 (6cores, 128 GFlops, 24GB memory) = 2000$ CPU ⇄ GPU transfer GPU computation Blas: Intel MKL 10.2 Compiler: icc 12.1 HDD ⇄ CPU transfer CPU computation 33 of 38

  34. OUTLINE • Intoduction and motivation • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 34 of 38

  35. Hybrid CPU+2GPU algorithm Original CPU-only algorithm 96,7s 100 84,8s ⟵ in-core out-of-core ⟶ 74,6s 75 65,6s 52,4s Time [s] 50 43,1s 32,9s 24,9s 25 18,3s 16,3s 14,3s 12,3s 11,6s 10,3s 8,3s 6,3s 4,3s 0 1k 10k 20k 30k 40k 50k 60k 70k 80k 90k m (nucleotide count) PERFORMANCE 5.2x speedup using 2 GPU sustained in-core performance when out-of-core 35 of 38

  36. EMMAX GWFGLS FLMM CLAK-Chol CLAK-Chol GPU Extrapolated Years 10.000.000s Months 1.000.000s 100.000s Days 10.000s Hours 1.000s Minutes 100s 1M 10M 100M m (SNP count) PERFORMANCE From years/months down to hours 36 of 38

  37. Runtime Perfect scalability 45s 40,7s 33,8s 21,6s Time 22,5s 16,2s 11,7s 11,3s 1 2 3 4 number of GPUs SCALABILITY #GPUs x2 ⇒ time x0.54 Almost perfect 37 of 38

  38. CONCLUSION • Don’t replace the CPU by GPU • Combine them • Hide data transfer latency by overlapping with computation • Double/triple-buffering • GPU never stops computing • GPUs order of magnitude faster? • V. Volkov (http://www.cs.berkeley.edu/~volkov/) • Victor W. Lee et al. ( «Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU», 2010 ) 38 of 38

  39. EMMAX GWFGLS FLMM CLAK-Chol CLAK-Chol GPU Extrapolated Years 10.000.000s Months 1.000.000s 100.000s Days 10.000s Hours 1.000s Minutes 100s 1M 10M 100M m (SNP count) QUESTIONS? beyer@aices.rwth-aachen.de 39 of 38

  40. Viewer’s eyes Projection surface 3D scene 40 of 38

  41. Viewer’s eyes 3D scene Projection surface 41 of 38

  42. VISUALIZATION 42 of 38

  43. FUTURE WORK • Solution for L too big for GPU memory • Apply similar technique to similar problems • Extension to multiple phenotypes (y) 43 of 38

Recommend


More recommend