nvidia application lab at j lich dirk pleiter j lich

NVIDIA Application Lab at Jlich Dirk Pleiter | Jlich Supercomputing - PowerPoint PPT Presentation

Mitglied der Helmholtz- Gemeinschaft NVIDIA Application Lab at Jlich Dirk Pleiter | Jlich Supercomputing Centre (JSC) Forschungszentrum Jlich at a Glance (status 2010) Budget: 450 mio Euro Staff: 4,800 (thereof 1,630 scientists)

  1. Mitglied der Helmholtz- Gemeinschaft NVIDIA Application Lab at Jülich Dirk Pleiter | Jülich Supercomputing Centre (JSC)

  2. Forschungszentrum Jülich at a Glance (status 2010)  Budget: 450 mio Euro  Staff: 4,800 (thereof 1,630 scientists)  Visiting scientists: 900 per year  Trainees: 90  Publications: 1,800  Protective rights and licences: 14,800  Research fields: health, energy and environment, and information technology; key technologies for tomorrow 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 2

  3. Jülich Supercomputing Centre Supercomputer operation for: Centre – FZJ,  Regional – JARA  Helmholtz & National – NIC, GCS  Europe – PRACE, EU projects  Application support  User support; coordination with SimLabs  Scientific Visualization  Peer review support and coordination R&D work Algorithms, performance analysis and tools  Community data management service  Computer architectures, Exascale Laboratories: EIC, ECL, NVIDIA  Education and Training 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 3

  4. Supercomputer Systems: Dual Track Approach IBM Power 4+ 2004 JUMP, 9 TFlop/s IBM Blue Gene/L IBM Power 6 2006-8 JUBL, 45 TFlop/s JUMP, 9 TFlop/s JUROPA 200 TFlop/s HPC-FF 2009 IBM Blue Gene/P 100 TFlop/s JUGENE, 1 PFlop/s JUDGE 240 TFlop/s File Server IBM Blue Gene/Q 2012 GPFS, Lustre JUQUEEN JUROPA++ 5.7 PFlop/s (target) Cluster, 1-2 PFlop/s 2014 + Booster General-Purpose Highly-Scalable 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 4

  5. JUDGE Cluster System  206 IBM iDataPlex nodes  2 Tesla M2050 or M2070 per node  Infiniband QDR network  Peak performance: 239 Tflops Users  Institute for Advanced Simulations  Molecular dynamics and mechanics, micro-magnetism simulations, medical image reconstruction  JuBrain partition  Milkey Way partition 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 5

  6. NVIDIA Application Lab at Jülich Collaboration between JSC and NVIDIA since July 2012  Enable scientific applications for GPU-based architectures  Provide support for their optimization  Investigate performance and scaling Work focus  Application requirements analysis  Kepler and CUDA feature analysis  Parallelization on many GPUs  Collaboration with performance tools developers  Training 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 6

  7. Pilot Application: JuBrain Application developed at the Institute of Neuroscience and Medicine (INM-1) at Forschungszentrum Jülich: Katrin Amunts, Markus Axer, Marcel Huysegoms Research goal Accurate, highly detailed computer model of the human brain 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 7

  8. Brain Section Images Blockface pictures Exceeds GPU  Created while cutting brain in sections memory capacity Histological images  Polarized light images  Low resolution vs. high resolution  100 μ m → 3 μ m pixel size  30 MBytes → 4 0 Gbytes data Challenge: 3d reconstruction 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 8

  9. 3D Reconstruction Moving image Metric Optimizer Fixed image Interpolator Transformation O(30) Registration algorithms  → 3 parameters Rigid registration speedup  → 6 parameters Afine registration on GPU  → O(100) parameters Elastic registration 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 9

  10. Fluid dynamics on Fermi and Kepler Lattice Boltzmann method  D2Q37 model  Application developed at U Rome Tore Vergata/INFN, U Ferrara/INFN, TU Eindhoven  Reproduce dynamics of fluid by simulating virtual particles which collide and propagate  Simulation of large systems requires double precision computation on many GPUs 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 10

  11. Collide kernel on Fermi  Kernel dominated by arithmetic operations  Floating-point performance as a function of the number of threads/block [GFlop/s] Excellent performance on Fermi Implementation: F. Schifano (U Ferrara/INFN) 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 11

  12. Kepler Performance Tuning for (i = 0; i < NPOP-1; i++) { lPop = p_prv[i*NX*NY + idx]; u = u + param_cx[i] * lPop; Performance analysis observations v = v + param_cy[i] * lPop; }  Significant increase of L1 cache misses 17% (Tesla M2090) → 67% (Tesla K20 )  #pragma unroll for (i = 0; i < NPOP-1; i++) { lPop = p_prv[i*NX*NY + idx]; SM performance increased, but L1 cache u = u + param_cx[i] * lPop; v = v + param_cy[i] * lPop; capacity remained unchanged } Problem mitigation by simple code change Enforce loop unrolling to eliminate indirect memory accesses J. Kraus (NVIDIA Lab) 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 12

  13. Collide kernel on Kepler GK110 Comparison Fermi vs. Kepler  Grid size considered here: 252 x 16384  Floating-point performance as a function of the number of threads/block Performance improvement 1.7x 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 13

  14. Propagate kernel Kernel dominated by memory access  Grid size considered here: 252 x 16384  Memory bandwidth [GByte/s] as a function of the number of threads/block Performance improvement 1.4x 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 14

  15. Summary NVIDIA Application Lab at Jülich  New and fruitful model for collaboration  We are just at the beginning ... Application requirements analysis  JuBrain: Project aiming for realistic model of the human brain Kepler feature analysis  Initial performance results for Lattice Boltzmann application on GK110  Very high performance level reached on Fermi can be sustained 14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 15

More recommend