REAL TIME CONTROL FOR ADAPTIVE OPTICS WORKSHOP (3RD EDITION) 27 th January 2016 François Courteille |Senior Solutions Architect, NVIDIA |fcourteille@nvidia.com
GAMING PRO VISUALIZATION ENTERPRISE DATA CENTER AUTO THE WORLD LEADER IN VISUAL COMPUTING 2
TESLA ACCELERATED COMPUTING PLATFORM Focused on Co-Design from Top to Bottom Productive Expert Accessibility Fast GPU Programming Co-Design Engineered for High Throughput Model & Tools TFLOPS NVIDIA GPU x86 CPU 3.0 K80 APPLICATION 2.5 MIDDLEWARE 2.0 K40 SYS SW 1.5 Fast GPU K20 + 1.0 LARGE SYSTEMS M2090 Strong CPU M1060 0.5 PROCESSOR 0.0 2008 2009 2010 2011 2012 2013 2014 3
PERFORMANCE LEAD CONTINUES TO GROW Peak Double Precision FLOPS Peak Memory Bandwidth GFLOPS GB/s 3500 600 K80 K80 3000 500 2500 400 K40 2000 K40 300 K20 K20 1500 M2090 200 M1060 M2090 1000 Haswell Haswell Ivy Bridge Sandy Bridge Ivy Bridge M1060 100 500 Westmere Sandy Bridge Westmere 0 0 2008 2009 2010 2011 2012 2013 2014 2008 2009 2010 2011 2012 2013 2014 NVIDIA GPU x86 CPU NVIDIA GPU x86 CPU 4
GPU Architecture Roadmap 72 60 48 Pascal SGEMM / W Mixed Precision 3D Memory 36 NVLink 24 Maxwell 12 Kepler Fermi Tesla 0 2008 2010 2012 2014 2016 2018 5
Kepler SM (SMX) • Scheduler not tied Instruction Cache Warp Scheduler Warp Scheduler to cores Warp Scheduler Warp Scheduler • Double issue for Register File max utilization SP SP SP DP SFU LD/ST …! 192 CUDA cores! … SP SP SP DP SFU LD/ST Shared Memory / L1 Cache On-Chip Network 5
Maxwell SM (SMM) SMM • Simplified design Instruction Cache – T ex / L1 $ T ex / L1 $ power-of-two, quadrant-based – scheduler tied to cores • Better utilization Shared Memory – single issue sufficient Quadrant – lower instruction latency Instruction Buffer Register File • Efficiency Warp Scheduler – <10% difference from SMX SP SP SP SP SFU LD/ST – … 32 SP CUDA Cores … ~50% SMX chip area SFU LD/ST SP SP SP SP
Histogram : Performance per SM 9.0 7.5 Bandwidth/SM, GiB/s 6.0 5.5x faster 4.5 3.0 1.5 0.0 1 2 4 8 16 32 64 128 Elements per thread Fermi M2070 Kepler K20X Maxwell GTX 750 Ti Higher performance expected with larger GPUS (more SMs)
TESLA GPU ACCELERATORS 2015-2016* 2015 2016 2017 MAXWELL – M40 KEPLER – K80 1xGPU, 7TF SP Peak 2xGPU, 2.9TF DP, 8.7TF SP (Boost Clock), Peak 12GB, 288 GB/s, 250W 4.4TF SGEMM/1.59TF DGEMM PCIe Passive 24GB, ~480 GB/s, 300W PCIe Passive MAXWELL – M60 GRID Enabled 2xGPU, 7.4TF SP Peak, ~6TF SGEMM 16GB, 320 GB/s, 300W PCIe Active/PCIe Passive MAXWELL – M4 KEPLER – K40 1xGPU, 2.2 TF SP Peak, 1.43TF DP, 4.3TF SP Peak 4GB, 88 GB/s, 3.3 TF SGEMM/1.22TF DGEMM 12 GB, 288 GB/s, 235W 50-75W, PCIe Low Profile PCIe Active/PCIe Passive MAXWELL – M6 GRID Enabled 1xGPU, TBD TF SP Peak, 8GB, 160 GB/s, 75-100W, MXM 16 *For End Customer Deployments In Definition POR NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
TESLA PLATFORM PRODUCT STACK Enterprise Hyperscale HPC Virtualization DL Training Web Services Accelerated Software Computing GRID 2.0 Hyperscale Suite Toolkit System Tools & Enterprise Services ∙ Data Center GPU Manager ∙ Mesos ∙ Docker Services Accelerators Tesla K80 Tesla M60, M6 Tesla M40 Tesla M4 17
NODE DESIGN KEPLER GPU PASCAL GPU FLEXIBILITY NVLink NVLINK HIGH-SPEED GPU INTERCONNECT POWER CPU NVLink PCIe PCIe X86, ARM64, X86, ARM64, POWER CPU POWER CPU 2014 2016 18
UNIFIED MEMORY: SIMPLER & FASTER WITH NVLINK Developer View With Developer View With Traditional Developer View Unified Memory Pascal & NVLink NVLink Unified Memory System GPU Memory Unified Memory Memory Share Data Structures at CPU Memory Speeds, not PCIe speeds Oversubscribe GPU Memory 19
MOVE DATA WHERE IT IS NEEDED FAST Accelerated Communication GPU Direct P2P GPU Direct RDMA NVLINK Multi-GPU Scaling Fast Access to other 2x App Performance Nodes Fast GPU Communication 5x Faster Than PCIe Eliminate CPU Latency Fast GPU Memory Access Fast Access to System Eliminate CPU Bottleneck Memory 20
NEXT-GEN SUPERCOMPUTERS ARE GPU-ACCELERATED SUMMIT SIERRA U.S. Dept. of Energy NOAA IBM Watson Pre-Exascale Supercomputers New Supercomputer for Next-Gen Breakthrough Natural Language for Science Weather Forecasting Processing for Cognitive Computing 22
U.S. TO BUILD TWO FLAGSHIP SUPERCOMPUTERS Powered by the Tesla Platform 100-300 PFLOPS Peak 10x in Scientific App Performance IBM POWER9 CPU + NVIDIA Volta GPU NVLink High Speed Interconnect 40 TFLOPS per Node, >3,400 Nodes 2017 Major Step Forward on the Path to Exascale 23
ACCELERATORS SURGE IN WORLD’S TOP SUPERCOMPUTERS 125 100 Top500: # of Accelerated Supercomputers 100+ accelerated systems now on Top500 list 75 1/3 of total FLOPS powered by accelerators NVIDIA Tesla GPUs sweep 23 of 24 new 50 accelerated supercomputers Tesla supercomputers growing at 50% CAGR 25 over past five years 0 2013 2014 2015 25
TESLA PLATFORM FOR HPC 26
TESLA ACCELERATED COMPUTING PLATFORM Data Center Infrastructure Development Infrastructure Programming Development Software System Solutions Communication Management Languages Tools Solutions / Profile and GPU System Compiler Interconnect Libraries Debug Accelerators Management Solutions GPU Direct GPU Boost NVML LLVM CUDA Debugging API cuBLAS NVLink … … … … … … “In 2014, NVIDIA enjoyed a dominant market share with 85% “ Accelerators Will Be Installed in More than Half of of the accelerator market.” New Systems ” 27 Source: Top 6 predictions for HPC in 2015
370 GPU-Accelerated Applications www.nvidia.com/appscatalog 28
70% OF TOP HPC APPS ACCELERATED INTERSECT360 SURVEY OF TOP APPS TOP 25 APPS IN SURVEY GROMACS LAMMPS SIMULIA Abaqus NWChem NAMD LS-DYNA AMBER Schrodinger ANSYS Mechanical Exelis IDL Gaussian MSC NASTRAN GAMESS ANSYS Fluent ANSYS CFX WRF Star-CD VASP CCSM OpenFOAM COMSOL CHARMM Star-CCM+ Top 50 HPC Apps Top 10 HPC Apps Quantum Espresso BLAST 90% 70% Accelerated Accelerated = All popular functions accelerated = Some popular functions accelerated = In development Intersect360, Nov 2015 = Not supported “HPC Application Support for GPU Computing” 29
TESLA FOR HYPERSCALE http://devblogs.nvidia.com/parallelforall/accelerating-hyperscale-datacenter-applications-tesla-gpus/ HYPERSCALE SUITE Deep Learning GPU Accelerated Image Compute GPU support in GPU REST Engine Toolkit FFmpeg Engine Mesos TESLA M40 TESLA M4 POWERFUL LOW POWER Fastest Deep Learning Performance Highest Hyperscale Throughput 33
TESLA PLATFORM FOR DEVELOPERS 34
TESLA FOR SIMLUATION LIBRARIES DIRECTIVES LANGUAGES ACCELERATED COMPUTING TOOLKIT TESLA ACCELERATED COMPUTING 35 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
DROP-IN ACCELERATION WITH GPU LIBRARIES BLAS | LAPACK | SPARSE | FFT Math | Deep Learning | Image Processing AmgX cuFFT 5x-10x speedups out of the box cuRAND cuBLAS NPP Automatically scale with multi-GPU libraries (cuBLAS-XT , cuFFT-XT , AmgX ,…) cuSPARSE MATH 75% of developers use GPU libraries to accelerate their application 37
“DROP - IN” ACCELERATION: NVBLAS 38
University of Illinois PowerGrid- MRI Reconstruction main() { <serial code> #pragma acc kernels //automatically runs on GPU { <parallel code> OpenACC } } 70x Speed-Up 2 Days of Effort Simple | Powerful | Portable RIKEN Japan NICAM- Climate Modeling 8000+ Fueling the Next Wave of Developers Scientific Discoveries in HPC using OpenACC 7-8x Speed-Up 5% of Code Modified http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway 39 http://on-demand.gputechconf.com/gtc/2015/presentation/S5297-Hisashi-Yashiro.pdf http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7
Minimal Effort LS-DALTON Lines of Code # of Weeks # of Codes to Large-scale Application for Modified Required Maintain Calculating High-accuracy <100 Lines 1 Week 1 Source Molecular Energies Big Performance LS-DALTON CCSD(T) Module Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X) 12.0x “ Speedup vs CPU 8.0x OpenACC makes GPU computing approachable for domain scientists. Initial OpenACC implementation “ 4.0x required only minor effort, and more importantly, no modifications of our existing CPU implementation. 0.0x Alanine-1 Alanine-2 Alanine-3 Janus Juul Eriksen, PhD Fellow 13 Atoms 23 Atoms 33 Atoms qLEAP Center for Theoretical Chemistry, Aarhus University 40
OPENACC DELIVERS TRUE PERFORMANCE PORTABILITY Paving the Path Forward: Single Code for All HPC Processors Application Performance Benchmark CPU: MPI + OpenMP CPU: MPI + OpenACC CPU + GPU: MPI + OpenACC 35x 30x 30.3x 25x Speedup vs Single CPU Core 20x 15x 10x 11.9x 7.6x 5x 7.1x 7.1x 5.3x 5.2x 4.3x 4.1x 0x 359.MINIGHOST (MANTEVO) NEMO (CLIMATE & OCEAN) CLOVERLEAF (PHYSICS) 359.miniGhost: CPU: Intel Xeon E5-2698 v3, 2 sockets, 32-cores total, GPU: Tesla K80- single GPU NEMO: Each socket CPU: Intel Xeon E5- ‐2698 v3, 16 cores; GPU: NVIDIA K80 both GPUs 41 CLOVERLEAF: CPU: Dual socket Intel Xeon CPU E5-2690 v2, 20 cores total, GPU: Tesla K80 both GPUs
Recommend
More recommend