On the Importance of Faster Atomics S.D. Hammond, C.R. Tro< and - PowerPoint PPT Presentation

Unclassified Unlimited Release (SAND2017-9008C) On the Importance of Faster Atomics S.D. Hammond, C.R. Tro< and H.C. Edwards, Center for Scien?fic Compu?ng Sandia Na?onal Laboratories/NM Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Outline § Mo?va?ons and Background § Exposing Atomic Opera?ons in Kokkos § Performance § Conclusions More Information: http://github.com/kokkos

Mo?va?ons § Sandia is heavily focused on making sure that our produc7on applica7on codes will run well on current and future NNSA Advanced Technology System (ATS) § ATS-1 – Trinity (~9,500 dual-socket Haswell, ~9,500 single-socket KNL) § ATS-2 – Sierra (~4,000 POWER9/Volta (2018)) § ATS-3 – Crossroads ? (2020) § For all of these pla?orms we need to have performance portable algorithms and source code § Kokkos for C++ Applica?ons § OpenMP for Fortran

Mo?va?ons § Enabling performance portable, on-node parallel algorithms can be extremely challenging: § Correctness (developer dependent, some tools to help) § Portability (Kokkos helps, but developer work s?ll required) § Performance (heavily developer dependent) § In order to meet our objec7ves to have applica7ons running on these machines as quickly as possible § Need to keep changes to code to a rela?ve minimum § Keep ini?al algorithms similar to prevent significant re-development/re- coding efforts

Atomic Opera?ons § Atomic opera7ons in many ways are an applica7on enabler: § Keep roughly serial algorithms but provide atomic updates to (limited) regions of memory which threads may share § Keep code changes to a rela?vely minimum § Isolate expensive memory updates to where they need to be § Disadvantages in applica7ons: § Floa?ng point rounding differences (floa?ng point ops are not associa?ve) § Varia?on in run?mes if conten?on rates/effects change between runs § Can be expensive § Required for lock-free shared data structures § Queues, hash-maps, ...

Alterna?ves to use of Atomic Opera?ons § Requires new algorithms (e.g. coloring/data replica7on) to be implemented: § Expensive in applica?on developer ?me § Don’t always have enough parallelism to support coloring schemes § Significant code churn § Consumes vast amount of memory if thread count high (data replica?on) § Advantages of alterna7ves are: § Poten?ally higher performance (if we have enough parallelism) § Less performance varia?on between runs because very li<le shared resources § Strong reproducibility of results

Exposing Atomics in C++ § C++11 introduced atomic memory updates into the standard § But ... std::atomic is fairly clunky, requires specific alloca?ons etc . std::atomic<int> data; void updateMe() { data.fetch_add(1, std::memory_order_relaxed); } § We really want something simpler and easier to use § A fix has been proposed for C++20 More Information: http://github.com/kokkos

Exposing Atomics in Kokkos § Don’t require “atomic” types (operate over any type, including non-POD) § Implement a lightweight locking system based on pointer address for types not supported by hardware atomics/CAS int data; void updateMe() { Kokkos::atomic_fetch_add(&data, 1); } § Much simpler to use, can atomically update any value and does not propagate through the type system More Information: http://github.com/kokkos

Performance of Atomic Opera?ons We have developed three rough “categories” of atomic-issue rate and conten7on § levels from some of our ini7al applica7on ports: § Histogram (count values in a bin in parallel and update, integers) § MD (LAMMPS like use of atomic updates to reduce duplicate work,double) § Matrix Assembly (accumulate values into a matrix from an unstructured mesh, double) Run on our current systems: § § GigaUpdates per second § Ra?o of using atomics to standard memory opera?ons (i.e. atomic overhead) § Run in the “best configura?on” (Fastest use of OpenMP/processes, Single Socket for CPU systems) § Ra?o to non-atomic is performance against not using atomics (incorrect answers)

Performance of Atomic Opera?ons Atomics Performance Ratio to Non-Atomics Histogram Histo-Padded MD Assembly Histogram Histo-Padded MD Assembly 100 3 Note – Histogram has GigaUpdates/Sec (log10 Scale) Ratio to Non-Atomic Updates higher contention rate 2.5 10 2 1.5 1 1 0.5 0.1 0 P100 K80 KNL (HBM) KNL (DDR) Haswell POWER8 P100 K80 KNL (HBM) KNL (DDR) Haswell POWER8 Histo-Padded provides padding for cache lines to prevent conflicts (uses more memory)

Discussion § Atomics are clearly very fast on the latest genera?on of NVIDIA Pascal (P100) GPUs due to hardware enablement at the cache (“fire and forget”) § CPUs and historically struggled with fast atomic updates because they add a significant number of addi?onal opera?ons into the instruc?on stream § and .. Cache line sharing, inability of compiler to easily op?mize around § Faster atomics on these plalorms and easier ways to program atomics would make algorithm development for next-genera7on pla?orms easier , reduce programmer burden and improve compiler informa7on for analysis

Discussion § Most algorithms have rela?vely low (but non-zero) conten?on rates § Atomics are really used to enable correctness for the very limited cases there is a shared data conflict § But ... the overhead is high for the opera?ons where no conten?on occurs

Conclusions and Posi?on § Atomic Memory Opera7ons are poten?ally a lightweight programming choice to introduce thread safety and parallelism to exis?ng code § Use atomics to update memory loca?ons you know may have conflicts § C++11 introduced atomics to the language standard but the method of use is less than ideal for minimizing code changes § Fix has been proposed for C++20 § Kokkos provides a lightweight, use anywhere implementa?on for C++ codes Need beRer hardware support to reduce the overheads in our applica7ons • More Information: http://github.com/kokkos

On the Importance of Faster Atomics S.D. Hammond, C.R. Tro< and - PowerPoint PPT Presentation

Unclassified Unlimited Release (SAND2017-9008C) On the Importance of Faster Atomics S.D. Hammond, C.R. Tro< and H.C. Edwards, Center for Scien?fic Compu?ng Sandia Na?onal Laboratories/NM Sandia National Laboratories is a multi-program

Blowing up the (C++11) atomic barrier Optimizing C++11 atomics in LLVM Robin Morisset, Intern at

General Atomics Aeronautical Systems, Inc. (Looped Video) 1 This document does not contain U.S.

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

An out-of-order thread-local semantics for something like volatile relaxed atomics in C and the

The Importance of The Importance of The Importance of The Importance of Mechanical Insulation

Water Rights Accounting New Accounting Model New Technology: 1979 versus 2011 Faster

Faster Cover Trees Mike Izbicki and Christian R. Shelton UC Riverside Izbicki and Shelton (UC

Faster Johnson-Lindenstrauss style reductions Aditya Menon August 23, 2007 Faster

WRITING FASTER CODE 1 . 1 WRITING FASTER CODE AND NOT HATING YOUR JOB AS A SOFTWARE DEVELOPER

Faster Code Nicolas Limare 2014/11/19 faster? one task vs many speeds one operation vs many

THE IMPORTANCE OF THE IMPORTANCE OF THE IMPORTANCE OF INFORMATION SECURITY INFORMATION SECURITY

Measuring variable importance in random forests Variable Variable importance in RF importance

General Atomics Electromagnetic Systems Group (GA-EMS) 13 April 2016 Treatment of Flame Retarded

Spherical Tokamak Plasma Science Comp-X General Atomics INEL Johns Hopkins U & Fusion

Research tool development for high Comp-X General Atomics INEL Johns Hopkins U performance

SharedArrayBuffer and Atomics Stage 2.95 to Stage 3 Shu-yu Guo Lars Hansen Mozilla November

Universal Grammar: the history of an idea John Goldsmith March 2 , 2010 The term universal grammar

High Performance Liquid Chromatography David Reckhow CEE 772 # 18 1 HPLC System David Reckhow

HW/SW Codesign FSMD I ECE 522 FSMDs VHDL Essentials covers FSMD design principles Here, we

Histo Histo Histo LocalityDescriptor ldesc(); LocalityDescriptor ldesc(X, Y, Z);

DQM Current Status, Technical Issues and Plans 21th DEPFET workshop, Ringberg, 30.5.2017 B.

Streamlined InputRecord / DataAllocator API Giulio Eulisse (CERN) WP4 - 24th Nov 2017 1 class

Visualizing and Understanding Neural Machine Translation Yanzhuo Ding, Yang Liu, Huanbo Luan,

Avoid messy paperwork Simple and usable interfaces Quickly view your own medical

On the Importance of Faster Atomics S.D. Hammond, C.R. Tro< and - PowerPoint PPT Presentation

Unclassified Unlimited Release (SAND2017-9008C) On the Importance of Faster Atomics S.D. Hammond, C.R. Tro< and H.C. Edwards, Center for Scien?fic Compu?ng Sandia Na?onal Laboratories/NM Sandia National Laboratories is a multi-program

Blowing up the (C++11) atomic barrier Optimizing C++11 atomics in LLVM Robin Morisset, Intern at

General Atomics Aeronautical Systems, Inc. (Looped Video) 1 This document does not contain U.S.

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

An out-of-order thread-local semantics for something like volatile relaxed atomics in C and the

The Importance of The Importance of The Importance of The Importance of Mechanical Insulation

Water Rights Accounting New Accounting Model New Technology: 1979 versus 2011 Faster

Faster Cover Trees Mike Izbicki and Christian R. Shelton UC Riverside Izbicki and Shelton (UC

Faster Johnson-Lindenstrauss style reductions Aditya Menon August 23, 2007 Faster

WRITING FASTER CODE 1 . 1 WRITING FASTER CODE AND NOT HATING YOUR JOB AS A SOFTWARE DEVELOPER

Faster Code Nicolas Limare 2014/11/19 faster? one task vs many speeds one operation vs many

THE IMPORTANCE OF THE IMPORTANCE OF THE IMPORTANCE OF INFORMATION SECURITY INFORMATION SECURITY

Measuring variable importance in random forests Variable Variable importance in RF importance

General Atomics Electromagnetic Systems Group (GA-EMS) 13 April 2016 Treatment of Flame Retarded

Spherical Tokamak Plasma Science Comp-X General Atomics INEL Johns Hopkins U &amp; Fusion

Research tool development for high Comp-X General Atomics INEL Johns Hopkins U performance

SharedArrayBuffer and Atomics Stage 2.95 to Stage 3 Shu-yu Guo Lars Hansen Mozilla November

Universal Grammar: the history of an idea John Goldsmith March 2 , 2010 The term universal grammar

High Performance Liquid Chromatography David Reckhow CEE 772 # 18 1 HPLC System David Reckhow

HW/SW Codesign FSMD I ECE 522 FSMDs VHDL Essentials covers FSMD design principles Here, we

Histo Histo Histo LocalityDescriptor ldesc(); LocalityDescriptor ldesc(X, Y, Z);

DQM Current Status, Technical Issues and Plans 21th DEPFET workshop, Ringberg, 30.5.2017 B.

Streamlined InputRecord / DataAllocator API Giulio Eulisse (CERN) WP4 - 24th Nov 2017 1 class

Visualizing and Understanding Neural Machine Translation Yanzhuo Ding, Yang Liu, Huanbo Luan,

Avoid messy paperwork Simple and usable interfaces Quickly view your own medical

Spherical Tokamak Plasma Science Comp-X General Atomics INEL Johns Hopkins U & Fusion