gpu accelerated visualization and analysis in vmd
play

GPU Accelerated Visualization and Analysis in VMD John Stone - PowerPoint PPT Presentation

GPU Accelerated Visualization and Analysis in VMD John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign http://www.ks.uiuc.edu/Research/vmd/


  1. GPU Accelerated Visualization and Analysis in VMD John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign http://www.ks.uiuc.edu/Research/vmd/ Center for Molecular Modeling University of Pennsylvania, June 9, 2009 NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  2. VMD – “Visual Molecular Dynamics” • Visualization and analysis of molecular dynamics simulations, sequence data, volumetric data, quantum chemistry simulations, particle systems, … • User extensible with scripting and plugins • http://www.ks.uiuc.edu/Research/vmd/ NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  3. Range of VMD Usage Scenarios • Users run VMD on a diverse range of hardware: laptops, desktops, clusters, and supercomputers • Typically used as a desktop application, for interactive 3D molecular graphics and analysis • Can also be run in pure text mode for numerically intensive analysis tasks, batch mode movie rendering, etc… • GPU acceleration provides an opportunity to make some slow, or batch calculations capable of being run interactively, or on-demand… NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  4. Need for Multi-GPU Acceleration in VMD • Ongoing increases in supercomputing resources at NSF centers such as NCSA enable increased simulation complexity, fidelity, and longer time scales… • Drives need for more visualization and analysis capability at the desktop and on clusters running batch analysis jobs • Desktop use is the most compute-resource-limited scenario, where GPUs can make a big impact … NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  5. Programmable Graphics Hardware Groundbreaking research systems: AT&T Pixel Machine (1989): 82 x DSP32 processors UNC PixelFlow (1992-98): 64 x (PA-8000 + 8,192 bit-serial SIMD) UNC PixelFlow Rack SGI RealityEngine (1990s): Up to 12 i860-XP processors perform vertex operations ( u code), fixed- func. fragment hardware All mainstream GPUs now incorporate fully programmable processors SGI Reality Engine i860 Vertex Processors NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  6. GLSL Sphere Fragment Shader • Written in OpenGL Shading Language • High-level C-like language with vector types and operations • Compiled dynamically by the graphics driver at runtime • Compiled machine code executes on GPU NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  7. GPU Computing • Commodity devices, omnipresent in modern computers (over a million sold per week ) • Massively parallel hardware, hundreds of processing units, throughput oriented architecture • Standard integer and floating point types supported • Programming tools allow software to be written in dialects of familiar C/C++ and integrated into legacy software • GPU algorithms are often multicore friendly due to attention paid to data locality and data-parallel work decomposition NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  8. What Speedups Can GPUs Achieve? • Single-GPU speedups of 10x to 30x vs. one CPU core are common • Best speedups can reach 100x or more, attained on codes dominated by floating point arithmetic, especially native GPU machine instructions, e.g. expf(), rsqrtf(), … • Amdahl’s Law can prevent legacy codes from achieving peak speedups with shallow GPU acceleration efforts NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  9. Comparison of CPU and GPU Hardware Architecture CPU : Cache heavy, GPU : ALU heavy, focused on individual massively parallel, thread performance throughput oriented NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  10. NVIDIA Streaming Processor Array GT200 TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC Constant Cache 64kB, read-only Streaming Multiprocessor Texture Processor Cluster Instruction L1 Data L1 FP64 Unit Instruction Fetch/Dispatch Special Shared Memory SM Function Unit Texture Unit FP64 Unit (double precision) 1/2/3-D interpolation SIN, EXP, 8kB spatial cache, RSQRT, Etc… SP SP SM Read-only, SP SP Streaming SFU SFU Processor SP SP SM ADD, SUB SP SP MAD, Etc… NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  11. GPU Peak Single-Precision Performance: Exponential Trend NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  12. GPU Peak Memory Bandwidth: Linear Trend GT200 NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  13. CUDA Acceleration in VMD Electrostatic field calculation, ion placement Molecular orbital calculation and display Imaging of gas migration pathways in proteins with implicit ligand sampling NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  14. Electrostatic Potential Maps • Electrostatic potentials evaluated on 3-D lattice: • Applications include: – Ion placement for structure building – Time-averaged potentials for simulation – Visualization and analysis Isoleucine tRNA synthetase NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  15. Direct Coulomb Summation • Each lattice point accumulates electrostatic potential contribution from all atoms: potential[j] += charge[i] / r ij r ij : distance from lattice[j] Lattice point j to atom[i] being evaluated atom[i] NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  16. Direct Coulomb Summation on the GPU • GPU outruns a CPU core by 44x • Work is decomposed into tens of thousands of independent threads, multiplexed onto hundreds of GPU processing units • Single-precision FP arithmetic is adequate for intended application • Numerical accuracy can be improved by compensated summation, spatially ordered summation groupings, or accumulation of potential in double-precision • Starting point for more sophisticated linear-time algorithms like multilevel summation NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  17. DCS CUDA Block/Grid Decomposition (unrolled, coalesced) Unrolling increases Grid of thread blocks: computational tile size Thread blocks: 0,0 0,1 … 64-256 threads 1,0 1,1 … … … … Threads compute up to 8 potentials, skipping by half-warps Padding waste NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  18. Direct Coulomb Summation on the GPU Host Atomic Grid of thread blocks Coordinates Charges Lattice padding Thread blocks: 64-256 threads GPU Constant Memory Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Threads compute Cache Cache Cache Cache Cache Cache up to 8 potentials, Texture Texture Texture Texture Texture Texture Texture skipping by half-warps Global Memory NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  19. Direct Coulomb Summation Runtime Lower is better GPU underutilized GPU fully utilized, ~40x faster than CPU Cold start GPU initialization time: ~110ms Accelerating molecular modeling applications with graphics processors . J. Stone, J. Phillips, P. Freddolino, D. Hardy, L. Trabuco, K. Schulten. J. Comp. Chem. , 28:2618-2640, 2007. NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  20. Direct Coulomb Summation Performance Number of thread blocks modulo number of SMs results in significant performance variation for small workloads CUDA-Unroll8clx: fastest GPU kernel, 44x faster than CPU, 291 GFLOPS on GeForce 8800GTX CUDA-Simple: 14.8x faster, CPU 33% of fastest GPU kernel GPU computing. J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips. Proceedings of the IEEE , 96:879-899, 2008. NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

  21. Multi-GPU Direct Coulomb Summation NCSA GPU Cluster http://www.ncsa.uiuc.edu/Projects/GPUcluster/ Evals/sec TFLOPS Speedup * 4-GPU (2 Quadroplex) 157 billion 1.16 176 GPU 1 GPU N Opteron node at NCSA … 4-GPU GTX 280 (GT200) 241 billion 1.78 271 * Speedups relative to Intel QX6700 CPU core w/ SSE NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

Recommend


More recommend