Half Precision Benchmarking for HPC S7676 May 11, 2017 GPU T - PowerPoint PPT Presentation

PiotrLuszczek Half Precision Benchmarking for HPC S7676 May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 1 / 18

Major Floating Point Formats from IEEE 754 (2008) Exponent Mantissa Precision Width Epsilon Max bits bits Quadruple 128 15 112 O(10 -34 ) 1.2x10 4932 Extended 80 15 64 O(10 -19 ) Double 64 11 52 O(10 -16 ) 1.8x10 308 Single 32 8 23 O(10 -7 ) 3.4x10 38 Half* 16 5 10 O(10 -3 ) 65504 *Only storage format is specifjed May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 2 / 18

Programming Data T ypes: C/C++ and Fortran real*16 l o n g d o u b l e ● ● 128 bits 8 0 o r 1 2 8 b i t s – – double real*8 ● ● 64 bits 64 bits – – real*4 fmoat ● ● 32 bits 32 bits – – real*2 __half (short fmoat) ● ● 16 bits 16 bits – – __half2 (cuda_fp16.h) ● 2x16 bits – May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 3 / 18

FP16 Hardware (Current and Future) AMD ● MI5, MI8, MI25 – Supercomputers ● ARM ● TSUBAME 3.0 – NEON VFP FP16 in V8.2-A – T okyo T ech – Intel ● … – Xeon CPUs (vectorized – Cloud ● conversions) Google with P100 (coming soon) – NVIDIA ● Azure: Pascal debut in 2017 – Pascal: P100, TX1, TX2, ... – Volta: T ensor Core – May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 4 / 18

Applications Using FP16 Machine Learning ● Deep Neural Networks – Visualization and image processing (OpenVZ) – Linear Algebra ● Eigen – University of T ennessee libraries and projects – Molecular dynamics ● Gromacs – May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 5 / 18

Iterative Refjnement In exact arithmetic ● x 1 ← x 0 + A -1 (b – Ax 0 ) – In fjnite precision A - 1 is not available due to ● Round-ofg error – Lower-precision LU factors – In practice, Richardson Iteration is often used ● x k+1 ← x k + A -1 (b – Ax k ) – Convergence depends on the spectrum – T e x t b o o k r e s u l t w r t . I - A - 1 A ● May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 6 / 18

Classic Iterative Refjnement Implementation Linear system Ax=b may be solved through LU factorization ● L, U, P ← lu_factor( A ) – y ← L \ Pb – x ← U \ y – r ← b - Ax (use higher precision to accumulate) – z ← U \ L \ P * r – x fj n a l ← x+z (use higher precision) – All operations performed in the same fmoating-point precision ● May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 7 / 18

Mixed-Precision Iterative Refjnement Linear system in 64-bit precision Ax=b may be solved through LU ● factorization in 16-bit precision: L,U, P ← lu(A) (n 3 ) (16 bits) – y ← L \ Pb (n 2 ) (16 bits) – x ← U \ y (n 2 ) (16 bits) – r ← b - Ax (n 2 ) (64 bits) – z ← P L \ U \ r (n 2 ) (16 bits) – x fjnal ← x+z (n) (64 bits) – Requirement: ● Matrix A must be well conditioned in 16 bits: – κ(A) < 10 5 ● May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 8 / 18

Early Error Analysis Results Standard backward stability ● ( A + E ) x = b E ≤ϕ( n )ϵ‖ A ‖ where Need to generalize to two machine precision: 16 and 64 ● k →∞ ‖ x 0 − x k ‖= ForwardError (ϵ 16, ϵ 64 ) lim ‖ b − A x k ‖ lim ‖ x k ‖ = BackwardError (ϵ 16, ϵ 64 ) ‖ A ‖ ⋅ k →∞ Details: see paper and tech report ● Summary: it works if matrix cooperates ● May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 9 / 18

Hardware/Software Support: Assembly and Intrinsics x86 ● CVTSH_SS, CVTSS_SH – emmintrin.h – _cvtss_sh(), _cvtsh_ss() ● f16cintrin.h → x86intrin.h – _mm_cvtph_ps(), _mm_cvtps_ph(), _mm256_cvtph_ps(), ● _mm256_cvtps_ph() PTX ● cvt.f16.* – fma.f16x2 – ARM ● vld1_f16, vst1_f16, vcvt_f16_f32 – May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 10 / 18

In High-Level Programming Environments Julia ● A = zeros(Float16, N, N); b = zeros(Float16, N,N); – A[:,:] = randn(N, N); b[:,:] = randn(N,1); – x = A \ b; # works OK – Python ● numpy.fmoat16 – linalg.solve(randn(N,N,fmoat16), randn(N,1,fmoat16)) – T ypeError: array type fmoat16 is unsupported in linalg – MATLAB ● Must use MEX fjles – May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 11 / 18

Autotuning with FP16, FP32, and FP64 12 Tfmop/s xGETRF()* on P100 N=35000 May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 12 / 18

Autotuning with FP16,FP32,FP64 (color) 12 Tfmop/s xGETRF()* on P100 N=35000 May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 13 / 18

Best Performers for FP16, FP32, FP64 12 Tfmop/s FP16 xGETRF()* on P100 FP32 FP64 N=35000 May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 14 / 18

Convergence Results: All Precisions 10 -4 (fp16) 10 -8 (fp32) ||b-Ax || oo 10 -16 (fp64) 30 iterations May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 15 / 18

Example Convergence: FP64 to FP32 / FP16 ||b-Ax || oo 30 iterations May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 16 / 18

Example of Slow Convergence: FP64 → FP16 ||b-Ax || oo 100 iterations May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 17 / 18

Future Work T est on new Hardware Verifjcation ● ● IBM/NVIDIA Minsky Up-casting – – ARM/Cavium Down-casting – – T egra/Jetson Convergence – – New algorithm approaches Performance ● ● New iterative schemes Improve 16-bit kernels – – New precision tweaks to Use 32-bit kernels on non- – – increase accuracy supporting hardware May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 18 / 18

Half Precision Benchmarking for HPC S7676 May 11, 2017 GPU T - PowerPoint PPT Presentation

PiotrLuszczek Half Precision Benchmarking for HPC S7676 May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 1 / 18 Major Floating Point Formats from IEEE 754 (2008) Exponent Mantissa Precision Width Epsilon Max bits bits

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Color Half Toning Half Toning Digital Half Toning Half toning and Colors Half Toning Half

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

B3 Benchmarking B3 Building Benchmarking Program Overview www.CleanEnergyResourceTeams.org B3

Mixed Precision Training PAI Overview What is mixed-precision

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Benchmarking Lunch-n-Learn March 18, 2019 Agenda 1. Why Benchmarking? 2. Introduction to

EFFECTIVE USE OF MIXED PRECISION FOR HPC Kate Clark, Smoky Mountain Conference 2019 Why Mixed

for HPC workloads Key Liao Center for HPC Shanghai Jiao Tong University Jan 9th, 2019 About Me

Half Toning Color Half Toning 1 Color Half Toning 2 Half Toning Emulating 5 different levels

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

under Road Construction Department (Government of Bihar) 13 th December 2011 The State has -

A Case for Performance-Based Road Maintenance in Nigeria A Paper Presented at the Technical

Bridge and Pavement (PM2) Performance Management Target Setting Workshop Bridge and Pavement

AnaEE A European infrastructure for analysis and experimentation on ecosystems AnaEE: addressing

NICO Project Presentation May 2020 Forward-Looking Information This management presentation (the

Sydney Resources Round-up 14 May 2013 ASX : ENR INVESTMENT SUMMARY Major land position in

The Generalized Sundman Transformation for Propagation of High-Eccentricity Elliptical Orbits

Random Feature Selection for Robust Face Recognition Allen Y. Yang <yang@eecs.berkeley.edu>

Half Precision Benchmarking for HPC S7676 May 11, 2017 GPU T - PowerPoint PPT Presentation

PiotrLuszczek Half Precision Benchmarking for HPC S7676 May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 1 / 18 Major Floating Point Formats from IEEE 754 (2008) Exponent Mantissa Precision Width Epsilon Max bits bits

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Color Half Toning Half Toning Digital Half Toning Half toning and Colors Half Toning Half

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

B3 Benchmarking B3 Building Benchmarking Program Overview www.CleanEnergyResourceTeams.org B3

Mixed Precision Training PAI Overview What is mixed-precision

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Benchmarking Lunch-n-Learn March 18, 2019 Agenda 1. Why Benchmarking? 2. Introduction to

EFFECTIVE USE OF MIXED PRECISION FOR HPC Kate Clark, Smoky Mountain Conference 2019 Why Mixed

for HPC workloads Key Liao Center for HPC Shanghai Jiao Tong University Jan 9th, 2019 About Me

Half Toning Color Half Toning 1 Color Half Toning 2 Half Toning Emulating 5 different levels

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

under Road Construction Department (Government of Bihar) 13 th December 2011 The State has -

A Case for Performance-Based Road Maintenance in Nigeria A Paper Presented at the Technical

Bridge and Pavement (PM2) Performance Management Target Setting Workshop Bridge and Pavement

AnaEE A European infrastructure for analysis and experimentation on ecosystems AnaEE: addressing

NICO Project Presentation May 2020 Forward-Looking Information This management presentation (the

Sydney Resources Round-up 14 May 2013 ASX : ENR INVESTMENT SUMMARY Major land position in

The Generalized Sundman Transformation for Propagation of High-Eccentricity Elliptical Orbits

Random Feature Selection for Robust Face Recognition Allen Y. Yang &lt;yang@eecs.berkeley.edu&gt;

Random Feature Selection for Robust Face Recognition Allen Y. Yang <yang@eecs.berkeley.edu>