Lattice QCD, Programming Models and Porting LQCD codes to Exascale Bálint Joó - Jefferson Lab Feb 19, 2020 HPC Roundtable Thomas Jefferson National Accelerator Facility
LQCD as an application • Replace Spacetime with a 4-Dimentional Lattice a Thomas Jefferson National Accelerator Facility
LQCD as an application • Replace Spacetime with a 4-Dimentional Lattice • Quark fields on the lattice sites: spinors (either complex 3-vectors, or 4x3 “vectors”) a Thomas Jefferson National Accelerator Facility
LQCD as an application • Replace Spacetime with a 4-Dimentional Lattice • Quark fields on the lattice sites: spinors (either complex 3-vectors, or 4x3 “vectors”) • Strong Force Gauge fields on links: 3x3 complex matrices a Thomas Jefferson National Accelerator Facility
LQCD as an application • Replace Spacetime with a 4-Dimentional Lattice • Quark fields on the lattice sites: spinors (either complex 3-vectors, or 4x3 “vectors”) • Strong Force Gauge fields on links: 3x3 complex matrices • Interactions are typically local - closed loops (3-matrix x 3-matrix) - covariant stencils (3-matrix x 3-vector ) • Also lattice wide summations: a - global sums, inner products etc. • Extremely well suited to data-parallel approaches - complex numbers and factors of 3 are often unfriendly to automatic vectorization - we need to usually build that in. Thomas Jefferson National Accelerator Facility
Typical LQCD Workflow D. J. Wilson et. al. PRD 91, 054008 (2015) … D. J. Wilson et. al. PRD 91, 054008 (2015) 1.4 1.4 1.2 1.2 1.0 1.0 0.8 0.8 5 10 15 20 25 30 5 10 15 20 25 30 1.4 1.4 1.2 1.2 1.0 1.0 0.8 0.8 5 10 15 20 25 30 5 10 15 20 25 30 1.4 1.4 1.2 1.2 1.0 1.0 … 0.8 0.8 5 10 15 20 25 30 5 10 15 20 25 30 1.4 1.4 1.2 1.2 1.0 1.0 0.8 0.8 5 10 15 20 25 30 5 10 15 20 25 30 Correlation Propagators, graph nodes & edges Graph Contractions Function Configuration Generation eigenvectors etc. - O(10K)-O(100K) diagrams Fitting and Analysis - Hybrid Molecular Dynamics Monte Carlo - sub-diagram reuse challenge - Linear Solves for quark propagators on sources - Linear Solves for Fermion Forces - workstations - main operation is batched - e.g. O(1M) solves/config for spectroscopy - Data parallel code for non-solver parts - Solver: same matrix, many right hand sides ZGEMM - Strong Scaling Limited - Potential large scale I/O challenge - Throughput limited - ‘Large’ long running jobs - Ensemble: Many single node jobs - Ensemble: Many small jobs Thomas Jefferson National Accelerator Facility
General Software Organization • Level structure worked out over last 4 iterations of the SciDAC program Apps Chroma CPS MILC • Data Parallel Layer (QDP) over a communications abstraction layer, MGProto & presents programmer with a ‘virtual Libraries QUDA QPhiX grid machine’ Grid • Applications can be written on top of Data Parallel QDP++/QDP-JIT/QDP-C the Data Parallel Layer, calling out to Highly Optimized Libraries as needed. Comms QMP • Grid is a new code, also providing a data parallel layer, and similar MPI/Other Comms layering internally (but not broken out into separate packages) Thomas Jefferson National Accelerator Facility
General Software Organization • Level structure worked out over last 4 Key Goals: years of SciDAC Apps Chroma CPS MILC • Data Parallel Layer (QDP) over a Port Data Parallel Layer, communications abstraction layer, MGProto & presents programmer with a ‘virtual Libraries QUDA QPhiX grid machine’ Port Libraries, Grid • Applications written on top of the Data Parallel QDP++/QDP-JIT/QDP-C Data Parallel Layer, calling out to Aim for Performance Highly Optimized Libraries as needed. Comms QMP • Grid is a new code, also providing a Portability data parallel layer, and similar MPI/Other Comms layering internally (but not broken out into separate packages) Thomas Jefferson National Accelerator Facility
Exascale & Pre-Exascale Systems • Perlmutter (formerly NERSC-9) - AMD CPUs, NVIDIA Next Gen GPUs. - Slingshot fabric from Cray • Aurora - Xeon CPUs + Intel Xe Accelerators - Slingshot fabric from Cray • Frontier - AMD CPUs + AMD Radeon GPUs - Slingshot fabric from Cray • MPI + X programming model • Horsepower for all the systems will come from accelerators • But the accelerators are different between the 3 systems Thomas Jefferson National Accelerator Facility
Node Programming Model Options Support OpenMP O ffl oad Kokkos/Raja DPC++/SYCL HIP C++ pSTL CUDA NVIDIA GPU AMD GPU Intel Xe CPUs Fortran FPGAs NVIDIA via POCL or Fortran via cross The way of the future? Fortran via PGI CUDA Compilers Maturing, DPC++ and HIP back Codeplay Backend, AMD calling, well parallelism in the base Fortran, well Comments some C++ issues ends in development via hipSYCL for now, well supported for AMD language. Tech supported for NVIDIA supported for Intel GPUs previews just now GPUs Can be made to work In development Supported via 3rd party extension Not supported or aspirational or product or hack Disclaimer: this is my current view, products and support levels can change. This picture may become out of date very soon Thomas Jefferson National Accelerator Facility
OpenMP Offload • Offloaded axpy in OpenMP #pragma omp target teams distribute parallel for simd map(to:z[:N]) map(a,x[:N],y[:N]) for(int i=0; i < N; i++) // N is large { z[i] = a*x[i] + b[i]; } • Collapses: - omp target - target the accelerator, - omp teams - create a league of teams - omp distribute - distribute the works amongst the teams - omp parallel for simd - perform a SIMD-ized parallel for - map a, x and y to the accelerator and map resulting z back out (data movement). Thomas Jefferson National Accelerator Facility
HIP • HIP is AMD’s “C++ Heterogeneous-Compute Interface for Portability” • Take your CUDA API and replace ‘cuda’ with ‘hip’: - cudaMemcpy() -> hipMemcpy() - kernel<<>>( ) -> hipLaunchKernelGGL(kernel,…) - and other slight changes. - You can use hipify tool to do first pass of conversion automatically • Open Source • Portability between NVIDIA and AMD GPUs only. Thomas Jefferson National Accelerator Facility
Kokkos Kokkos::View<float[N],LayoutLeft,CudaSpace> x(“x”); // N is large Kokkos::View<float[N],LayoutLeft,CudaSpace> y(“y”); Kokkos::View<float[N],LayoutLeft,CudaSpace> z(“z”); float a=0.5; Kokkos::parallel_for(“zaxpy", N, KOKKOS_LAMBDA (const int& i) { z(i) = a*x(i) + y(i); // view provides indexing operator() }); • View - multi-dimensional array, index order specified by Layout, location by MemorySpace policy. Layout allows appropriate memory access for CPU/GPU • Parallel for dispatches a C++ lambda • Kokkos developers on C++ standards committee - work to fold features into C++ Thomas Jefferson National Accelerator Facility
Portability via Kokkos • Kokkos provides portability via back- ends: e.g. OpenMP, CUDA, … Kokkos Abstractions • Most abstractions are provided in a C++ Header library - parallel_for, reduction, scans • Kokkos provides the Kokkos View SYCL/ OpenMP CUDA OpenMP HIP DPCPP target data-type Back-End Back-End Back-End Back-End Back-End - user can customize index order - explicit memory movement only Stable and - select memory space via policy In Development Production ready • Bind Execution to Execution Space - select back end via policy Thomas Jefferson National Accelerator Facility
SYCL SYCL runtime • SYCL manages sycl::queue myQueue; manages data in sycl::buffer<float,1> x_buf(LARGE_N); buffers buffers sycl::buffer<float,1> y_buf(LARGE_N); • Only access buffers sycl::buffer<float,1> z_buf(LARGE_N); access buffer data via accessors in via accessors // … fill buffers somehow … command group • can track accessor float a = 0.5; (cgh) scope or host { accessor use and build data myQueue.submit([&](handler& cgh) { auto x=x_buf. getAccess <access::mode::read>(cgh); dependency graph to auto y=y_buf. getAccess <access::mode::read>(cgh); automate data auto z=z_buf. getAccess <access::mode::write>(cgh); movement cgh. parallel_for < class zaxpy >(LARGE_N,[=](id<1> id) { • What does this mean auto i = id[0]; z[i]=a*x[i] + y[i]; kernels must have a for non SyCL }); unique name in C++ Libraries with }); pointers? (e.g. MPI) } Thomas Jefferson National Accelerator Facility
Intel OneAPI DPC++ extensions USM gives host/ • USM extension allows sycl::queue myQueue; device pointers management of arrays via sycl::device dev=myQueue.get_device(); and pointers (more CUDA-like) sycl::context con=myQueue.get_context(); • Memcpy ops to move data float* x=sycl::malloc_device(LARGE_N*sizeof(float),dev,con); between host and device float* y=sycl::malloc_device(LARGE_N*sizeof(float),dev,con); (not shown here) float* z=sycl::malloc_device(LARGE_N*sizeof(float),dev,con); • Reductions !! // … fill aarrays somehow somehow … • Unnamed Lambda extension float a = 0.5; obviates need for a class { name for parallel for myQueue.submit([&](handler& cgh) { • Libraries (e.g. MPI) can do cgh. parallel_for (LARGE_N,[=](id<1> id) { auto i = id[0]; intelligent things with USM z[i]=a*x[i] + y[i]; pointers (e.g. direct device }); access) Unnamed lambda extension }); • Subgroup Extension allows } more explicit SIMD-ization // free pointers etc.. Thomas Jefferson National Accelerator Facility
Recommend
More recommend