www.bsc.es Simulating the Behavior of the Human Brain on NVIDIA GPUs (Human Brain Project) Pedro Valero-Lara, Ivan Mart nez-Pérez, Antonio J. Peña, ı ı Xavier Martorell, Raül Sirvent, and Jesús Labarta Munich, 09-11-2018
Human Brain Project (HBP) H2020 FET Flagship Project Accelerate the fields of neuroscience, computing and brain-related medicine ● 8 Different Sub-Projects ● ➔ Sub-Project 7: High Performance Analytics and Computing WP 7.5: Providing support for the migration of simulation codes to hybrid and/or ● accelerator-enabled architectures 86x10 (86 Billions) neurons ⁹ ● ➔ ~ 80,000 Volta GPUs Steps: ● ➔ Neurons Generator → once at the very begining ➔ Solving voltage capacitance ➔ Synapses (Spiking) → communication 2
Solving Voltage Capacitance – Hines Method Hines Method void hines solver (double *a, double *b, double *d, double *rhs, int *p, int cell size) { int i; double factor; // backward sweep for(int i=cell size-1; i>0; −−i) { factor = a[i] / d[i]; d[ p[i] ] -= factor * b[i]; rhs[ p[i] ] -= factor * rhs[i]; Ax=b, ● } ➔ where A is a Hines (3 vectors) Matrix rhs[0] /= d[0] ➔ Similar to Tridiagonal System (Thomas Method) // forward sweep ➔ 8xN operations for(i=1; i<cell size; ++i) { ➔ Vector p → branches rhs[i] -= b[i] * rhs[ p[i] ]; rhs[i] /= d[i]; } } 3
Implementation of cuHinesBatch cuHinesBatch Saturate the GPU with a high number of neurons ● ➔ 1 thread per neuron ➔ No synchronizations ➔ No atomic operations Data Layouts ● ➔ Flat No coalesncing ➔ Full-Interleaved Coalescing Big jumps in memory ➔ Block-Interleaved Coalesing ● Small jumps in memory ● 4
Performance of cuHinesBatch: Flat Test Case: Flat K80 NVIDIA GPU (Kepler) ● ➔ 4992 CUDA cores ➔ 24 GB GDDR5 Input (Hines Matrices) ● ➔ 300 elements and 2 branches ➔ double precision ➔ BatchSize 512; 5,120; 51,120; 512,000 Setting ● ➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1) ➔ numactl −−interleave=all ➔ -O3 -openmp -arch=compute 35 5
Performance of cuHinesBatch: Full-Interleaved Test Case: Full-Interleaved K80 NVIDIA GPU (Kepler) ● ➔ 4992 CUDA cores ➔ 24 GB GDDR5 Input (Hines Matrices) ● ➔ 300 elements and 2 branches ➔ double precision ➔ BatchSize 512; 5,120; 51,120; 512,000 Setting ● ➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1) ➔ numactl −−interleave=all ➔ -O3 -openmp -arch=compute 35 6
Performance of cuHinesBatch: Block-Interleaved Test Case: Block-Interleaved K80 NVIDIA GPU (Kepler) ● ➔ 4992 CUDA cores ➔ 24 GB GDDR5 Input (Hines Matrices) ● ➔ 300 elements and 2 branches ➔ double precision ➔ BatchSize = 512,000 Setting ● ➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1) ➔ numactl −−interleave=all ➔ -O3 -openmp -arch=compute 35 7
Performance of cuHinesBatch on Real Neurons Test Case: Real Neurons K80 NVIDIA GPU (Kepler) ● ➔ 4992 CUDA cores ➔ 24 GB GDDR5 Input (Hines Matrices) ● ➔ 6 different morphologies Small, medium, and big Low (10%) and high (50%) #branches http://www.neuromorpho.org/ ● ➔ BatchSize 256; 2,560; 25,600; 256,000 Setting ● ➔ Block-Interleaved ➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1) ➔ numactl −−interleave=all ➔ -O3 -openmp -arch=compute 35 8
Performance of cuHinesBatch on Pascal Test Case: Pascal P100 NVIDIA GPU (Pascal) ● ➔ 3584 CUDA cores ➔ 16 GB HMB2 Input (Hines Matrices) ● ➔ Medium (size) ➔ Low (% #branches) ➔ BatchSize = 25,6000 Setting ● ➔ Full-Interleaved ➔ numactl −−interleave=all ➔ -O3 -openmp -arch=compute 62 NVPROF ● ➔ High occupancy (99,5%) ➔ High bandwidth (500 GB/s) ➔ No memory issues 9
Performance of cuHinesBatch: cuThomasBatch Test Case: cuThomasBatch 1 logic (K40) GPU of K80 NVIDIA GPU (Kepler) ● ➔ 2496 CUDA cores ➔ 16 GB GDDR5 Input (Hines Matrices) ● ➔ System Size 64; 128; 256; 512 1,024; 2,048; 4,096; 8,192 ➔ BatchSize 256; 2,560; 25,600; 256,000 20; 200; 2,000; 20,000 Setting ● ➔ cusparseDgtsvStridedBatch ➔ cuThomasBatch Results ● 1,2-2,8x faster ➔ 4x more precise ➔ 2x less memory occupancy ➔ 10
Performance of cuHinesBatch: Multi-Morphology Test Case: Multi-Morphology 1 logic (K40) GPU of K80 NVIDIA GPU (Kepler) ● ➔ 2496 CUDA cores ➔ 16 GB GDDR5 Input (Hines Matrices) ● ➔ Different morphologies Mono-Morpholgy Multi_morphology Same size • Different size • 1,024; 2,048; 4,096; 8,192 BatchSize = 25,600 ➔ 10% and 50% of #Branches ➔ Setting ● Full-Interleaved ➔ Padding ➔ 11
Conclusions & Future Work cuHinesBatch High performance (50x faster than seq. CPU) ● Big scaling even when using a very high number of neurons ● ➔ 1 thread per neuron (Hines System) ➔ Full-Interleaved Data Layout ➔ Faster than using one CUDA Block per system cuThomasBatch Data Layout transformation (from flat to full-interleaved) ● ➔ Once at the very begining of the simulation Fall in performance for multi-morphology ● ➔ 2 approaches: cuThomasBatch per segment CusparseDgtsvStridedBatch per segment 12
References & Acknowledgments Pedro Valero-Lara, Ivan Martínez-Perez, Antonio J. Peña, Xavier Martorell, Raúl Sirvent, Jesús Labarta: cuHinesBatch: Solving Multiple Hines systems on GPUs Human Brain Project* . ICCS 2017: 566-575 Pedro Valero-Lara, Ivan Mart nez-Pérez, Raül Sirvent, Xavier Martorell, and Antonio J. Peña: NVIDIA GPUs Scalability to ı ı Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch . PPAM 2017 cuHinesBatch repository: https://pm.bsc.es/gitlab/imartin1/cuHinesBatch Acknowledgements: This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 720270 (HBP SGA1), from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII (TIN2015- 65316-P) and the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d’Execució Paral·lels (2014-SGR-1051). We thank the support of NVIDIA through the BSC/UPC NVIDIA GPU Center of Excellence. Antonio J. Peña is cofinanced by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva fellowship number IJCI-2015-23266. 13
www.bsc.es Thank you! For further information please contact pedro.valero@bsc.es
Recommend
More recommend