HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver and Prof. Paolo Bientinesi RWTH Aachen University 19 June 2012, SIAM Conference on Applied Linear Algebra, Valencia, Spain Thanks to the AICES HPAC group and DFG grant GSC111
OUTLINE • Intoduction and motivation • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 2 of 38
OUTLINE • Intoduction and motivation • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 3 of 38
THE BIOLOGY Roughly, how an engineer sees it Organisms Cells Proteins DNA Genes Nucleotides 4 of 38
SINGLE NUCLEOTIDE POLYMORPHISM nucleotide’s allele differs between two individuals of a species link to traits, diseases? Image source: wikipedia 5 of 38
GWAS • Human Genome project (2004) • Genome-wide association studies (GWAS) • Find correlations between SNPs and traits (diseases) • Case group vs. control group • Variance Components & Generalized Linear Mixed Models 6 of 38
GWAS STATS # of GWAS carried out each year 3000 2333 2304 2250 1500 1257 999 750 453 13 2 0 2005 2006 2007 2008 2009 2010 2011 7 of 38
GWAS STATS Sample size 40K 30K 20K 10K 0K 2005 2006 2007 2008 2009 2010 2011 8 of 38
GWAS STATS #SNPs passing QC 4M 3M 2M 1M 0M 2005 2006 2007 2008 2009 2010 2011 9 of 38
GWAS STATS Largest #SNPs passing QC 12M 10,5M 9M 7,5M 6M 2,7M 2,6M 3M 2,4M 1,5M 0,2M 0M 2005 2006 2007 2008 2009 2010 2011 10 of 38
HPC BASICS • Basic Linear Algebra Subprograms • LEGO-like building-blocks of LA • Vendor optimized implementations • TRSM: solution of multiple triangular systems TX = Y • Linear Algebra PACKage (LAPACK) • Higher-level LA algorithms • POTRF: Cholesky factorization of a SPD matrix LL T = A 11 of 38
OUTLINE • Intoduction and motivation • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 12 of 38
r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y p n n GENOME-WIDE ASSOCIATION STUDIES lots of GLS because i = 0..millions y ∈ ℝ n observations (phenotype) input X i ∈ ℝ n × p genome measurements/covariates M ∈ ℝ n × n observation dependencies output r i ∈ ℝ p relations between phenotype and genome variations 13 of 38
⇒ r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y THE NUMBERS y ∈ ℝ n 80 MB # DNA fragments (nucleotides) m ~ 48 ﹣ 250 000 000 M ∈ ℝ n × n 800 MB # samples n ~ 10 000 r ∈ ℝ p × m 7-40 GB # covariates p = 20 X ∈ ℝ n × p × m 72 TB ﹣ 373 TB 14 of 38
OUTLINE • Intoduction and motivation • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 15 of 38
BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y 16 of 38
BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y Cholesky once during initialization LL T := M 17 of 38
BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y Cholesky once during initialization LL T := M i L − T L − 1 X i ) − 1 X T i L − T L − 1 y r i ← ( X T 18 of 38
BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y Cholesky once during initialization LL T := M i L − T L − 1 X i ) − 1 X T i L − T L − 1 y r i ← ( X T r i ← (( L − 1 X i ) T L − 1 X i ) − 1 L − 1 X i L − 1 y 19 of 38
BASIC ALGORITHM r i ← ( X T i M − 1 X i ) − 1 X T i M − 1 y Cholesky once during initialization LL T := M i L − T L − 1 X i ) − 1 X T i L − T L − 1 y r i ← ( X T r i ← (( L − 1 X i ) T L − 1 X i ) − 1 L − 1 X i L − 1 y One trsm per iteration step i ˆ X i := L − 1 X i X i ) − 1 ˆ r i ← ( ˆ i ˆ X T X i L − 1 y 20 of 38
⇒ OPTIMIZATIONS • Blocking in i • many small trsms vs. one big trsm ... 21 of 38
⇒ OPTIMIZATIONS • Blocking in i • many small trsms vs. one big trsm ... • Out-of-core algorithm • read block b +1 while computing block b • double-buffering technique necessary 22 of 38
EMMAX GWFGLS FLMM CLAK-Chol Years 10.000.000s Months 1.000.000s 100.000s Days 10.000s Hours 1.000s Minutes 100s 1M 10M 100M m (SNP count) PERFORMANCE From years/months down to weeks/days 23 of 38
OUTLINE • Intoduction and motivation • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 24 of 38
CAN GPU S HELP GO FURTHER? • trsm takes 90-95% of time • compute on the GPU • while GPU computes: • CPU computations • CPU ⇄ GPU transfers • our cluster: nVidia Fermi ☞ 25 of 38
GPU: 1-10 Gb RAM: 10-100 Gb HDD: Terabytes MEMORY PYRAMID Need for streaming computation Need for two levels of double-buffering 26 of 38
trsm GPU b -1 b α β b -1 CPU/RAM b +1 b -1 b +2 C B A HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING (1) Retrieve previous results from GPU, start reading second-next block from HDD 27 of 38
trsm GPUs b -1 b α β Computation b +1 CPU/RAM b +1 b -1 b +2 C B A HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING (2) Send next block to GPU, start CPU computation 28 of 38
trsm GPUs b +1 b α β CPU/RAM b +1 b -1 b +2 C B A HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING (3) Write results to disk (fast because small) 29 of 38
GPUs b +1 b α β CPU/RAM b +1 b -1 b +2 C B A HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING (4) Rotate buffers, iterate, smile 30 of 38
trsm b -1 GPUs b b +1 α β Computation b +1 b -1 CPU/RAM b +1 b -1 b +2 C B A HDD b -2 b -1 b b +1 b +2 b -3 b +3 Data X Results r b -3 b -2 b -1 2-LEVEL TRIPLE-DOUBLE-BUFFERING One full iteration 31 of 38
GPU GPU trsm b GPU trsm b +1 Send b +1 Send b +2 t CPU Recv b -1 CPU comp b -1 Recv b CPU b Read b +2 Read b +3 HDD Write b -1 TIMELINE Parallelism on the vertical axis Heavy use of asynchronous dispatching CPU ⇄ GPU transfer GPU computation Data dependencies HDD ⇄ CPU transfer CPU computation Asynchronous dispatch of 38
GPU t CPU HDD TIMELINE, TO SCALE problem sizes: n=10k, m=100k, block=10k GPU: 2x nVidia Quadro 6000 (Fermi, 515 GFlops each , 6GB memory) = 10.000$ CPU: 2x Intel Xeon X5650 (6cores, 128 GFlops, 24GB memory) = 2000$ CPU ⇄ GPU transfer GPU computation Blas: Intel MKL 10.2 Compiler: icc 12.1 HDD ⇄ CPU transfer CPU computation 33 of 38
OUTLINE • Intoduction and motivation • Problem description • CPU-only algorithm • Leveraging the GPU • Results and conclusion 34 of 38
Hybrid CPU+2GPU algorithm Original CPU-only algorithm 96,7s 100 84,8s ⟵ in-core out-of-core ⟶ 74,6s 75 65,6s 52,4s Time [s] 50 43,1s 32,9s 24,9s 25 18,3s 16,3s 14,3s 12,3s 11,6s 10,3s 8,3s 6,3s 4,3s 0 1k 10k 20k 30k 40k 50k 60k 70k 80k 90k m (nucleotide count) PERFORMANCE 5.2x speedup using 2 GPU sustained in-core performance when out-of-core 35 of 38
EMMAX GWFGLS FLMM CLAK-Chol CLAK-Chol GPU Extrapolated Years 10.000.000s Months 1.000.000s 100.000s Days 10.000s Hours 1.000s Minutes 100s 1M 10M 100M m (SNP count) PERFORMANCE From years/months down to hours 36 of 38
Runtime Perfect scalability 45s 40,7s 33,8s 21,6s Time 22,5s 16,2s 11,7s 11,3s 1 2 3 4 number of GPUs SCALABILITY #GPUs x2 ⇒ time x0.54 Almost perfect 37 of 38
CONCLUSION • Don’t replace the CPU by GPU • Combine them • Hide data transfer latency by overlapping with computation • Double/triple-buffering • GPU never stops computing • GPUs order of magnitude faster? • V. Volkov (http://www.cs.berkeley.edu/~volkov/) • Victor W. Lee et al. ( «Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU», 2010 ) 38 of 38
EMMAX GWFGLS FLMM CLAK-Chol CLAK-Chol GPU Extrapolated Years 10.000.000s Months 1.000.000s 100.000s Days 10.000s Hours 1.000s Minutes 100s 1M 10M 100M m (SNP count) QUESTIONS? beyer@aices.rwth-aachen.de 39 of 38
Viewer’s eyes Projection surface 3D scene 40 of 38
Viewer’s eyes 3D scene Projection surface 41 of 38
VISUALIZATION 42 of 38
FUTURE WORK • Solution for L too big for GPU memory • Apply similar technique to similar problems • Extension to multiple phenotypes (y) 43 of 38
Recommend
More recommend