Mitglied der Helmholtz- Gemeinschaft NVIDIA Application Lab at Jülich Dirk Pleiter | Jülich Supercomputing Centre (JSC)
Forschungszentrum Jülich at a Glance (status 2010) Budget: 450 mio Euro Staff: 4,800 (thereof 1,630 scientists) Visiting scientists: 900 per year Trainees: 90 Publications: 1,800 Protective rights and licences: 14,800 Research fields: health, energy and environment, and information technology; key technologies for tomorrow 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 2
Jülich Supercomputing Centre Supercomputer operation for: Centre – FZJ, Regional – JARA Helmholtz & National – NIC, GCS Europe – PRACE, EU projects Application support User support; coordination with SimLabs Scientific Visualization Peer review support and coordination R&D work Algorithms, performance analysis and tools Community data management service Computer architectures, Exascale Laboratories: EIC, ECL, NVIDIA Education and Training 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 3
Supercomputer Systems: Dual Track Approach IBM Power 4+ 2004 JUMP, 9 TFlop/s IBM Blue Gene/L IBM Power 6 2006-8 JUBL, 45 TFlop/s JUMP, 9 TFlop/s JUROPA 200 TFlop/s HPC-FF 2009 IBM Blue Gene/P 100 TFlop/s JUGENE, 1 PFlop/s JUDGE 240 TFlop/s File Server IBM Blue Gene/Q 2012 GPFS, Lustre JUQUEEN JUROPA++ 5.7 PFlop/s (target) Cluster, 1-2 PFlop/s 2014 + Booster General-Purpose Highly-Scalable 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 4
JUDGE Cluster System 206 IBM iDataPlex nodes 2 Tesla M2050 or M2070 per node Infiniband QDR network Peak performance: 239 Tflops Users Institute for Advanced Simulations Molecular dynamics and mechanics, micro-magnetism simulations, medical image reconstruction JuBrain partition Milkey Way partition 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 5
NVIDIA Application Lab at Jülich Collaboration between JSC and NVIDIA since July 2012 Enable scientific applications for GPU-based architectures Provide support for their optimization Investigate performance and scaling Work focus Application requirements analysis Kepler and CUDA feature analysis Parallelization on many GPUs Collaboration with performance tools developers Training 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 6
Pilot Application: JuBrain Application developed at the Institute of Neuroscience and Medicine (INM-1) at Forschungszentrum Jülich: Katrin Amunts, Markus Axer, Marcel Huysegoms Research goal Accurate, highly detailed computer model of the human brain 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 7
Brain Section Images Blockface pictures Exceeds GPU Created while cutting brain in sections memory capacity Histological images Polarized light images Low resolution vs. high resolution 100 μ m → 3 μ m pixel size 30 MBytes → 4 0 Gbytes data Challenge: 3d reconstruction 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 8
3D Reconstruction Moving image Metric Optimizer Fixed image Interpolator Transformation O(30) Registration algorithms → 3 parameters Rigid registration speedup → 6 parameters Afine registration on GPU → O(100) parameters Elastic registration 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 9
Fluid dynamics on Fermi and Kepler Lattice Boltzmann method D2Q37 model Application developed at U Rome Tore Vergata/INFN, U Ferrara/INFN, TU Eindhoven Reproduce dynamics of fluid by simulating virtual particles which collide and propagate Simulation of large systems requires double precision computation on many GPUs 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 10
Collide kernel on Fermi Kernel dominated by arithmetic operations Floating-point performance as a function of the number of threads/block [GFlop/s] Excellent performance on Fermi Implementation: F. Schifano (U Ferrara/INFN) 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 11
Kepler Performance Tuning for (i = 0; i < NPOP-1; i++) { lPop = p_prv[i*NX*NY + idx]; u = u + param_cx[i] * lPop; Performance analysis observations v = v + param_cy[i] * lPop; } Significant increase of L1 cache misses 17% (Tesla M2090) → 67% (Tesla K20 ) #pragma unroll for (i = 0; i < NPOP-1; i++) { lPop = p_prv[i*NX*NY + idx]; SM performance increased, but L1 cache u = u + param_cx[i] * lPop; v = v + param_cy[i] * lPop; capacity remained unchanged } Problem mitigation by simple code change Enforce loop unrolling to eliminate indirect memory accesses J. Kraus (NVIDIA Lab) 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 12
Collide kernel on Kepler GK110 Comparison Fermi vs. Kepler Grid size considered here: 252 x 16384 Floating-point performance as a function of the number of threads/block Performance improvement 1.7x 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 13
Propagate kernel Kernel dominated by memory access Grid size considered here: 252 x 16384 Memory bandwidth [GByte/s] as a function of the number of threads/block Performance improvement 1.4x 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 14
Summary NVIDIA Application Lab at Jülich New and fruitful model for collaboration We are just at the beginning ... Application requirements analysis JuBrain: Project aiming for realistic model of the human brain Kepler feature analysis Initial performance results for Lattice Boltzmann application on GK110 Very high performance level reached on Fermi can be sustained 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 15
Recommend
More recommend