MRG8Random Number Generation for the Exascale Era Yusuke Nagasaka , - PowerPoint PPT Presentation

MRG8–Random Number Generation for the Exascale Era Yusuke Nagasaka † , Ken-ichi Miura ‡ , John Shalf ‡ Akira Nukada † , Satoshi Matsuoka � † † Tokyo Institute of Technology ‡ Lawrence Berkeley National Laboratory � RIKEN Center for Computational Science

Random Number Generator (PRNG) is a crucial component of ■ Pse seudo do random n number g generator ( numerous algorithms and applications – Quantum chemistry, molecular dynamics – Broader classes of Monte Carlo algorithms – Machine Learning field ■ Shuffling of training data ■ Initializing weights of neural network ■ cf.) Numpy employs Mersenne Twister ■ Ps Pseudo and Re Real random number ■ What is a requirement for “ Good P PRNG ”? 1

Random Number Generator (PRNG) is a crucial component of ■ Pse seudo do random n number g generator ( numerous algorithms and applications – Quantum chemistry, molecular dynamics • Long r recurrence l length – Broader classes of Monte Carlo algorithms • Good s statistical q quality – Machine Learning field • Deterministic J Jump-ahead f for p parallelism ■ Shuffling of training data • Performance ( (throughput) ■ Initializing weights of neural network ■ cf.) Numpy employs Mersenne Twister ■ Ps Pseudo and Re Real random number ■ What is a requirement for “ Good P PRNG ”? 2

Recurrence Length ■ PRNGs will eventually repeat themselves – Eg.) LCG in the C standard library repeat themselves in as few as 2.15 * 10 9 steps ( too short ) – Much additional cost to erase the effect of auto-correlation ■ Greatly reduce the effective performance of algorithm – Minimum requirement is for an entire year of executing at full speed on a supercomputer MT MT1993 9937 MRG32k MR 32k3a 3a Ph Philox MR MRG8 2 19937 - 1 (2 31 – 1) 8 - 1 Period 2 191 2 130 3

Statistical Quality ■ Sequence must show no statistical bias – Otherwise, PRNGs affect the outcome of a simulation ■ Te TestU01 developed by L’Ecuyer – Benchmark set for empirical statistical testing of random number generators – Three pre-defined battery ■ Small Crush: 15 tests, using 2 random numbers ■ Crush: 186 tests, using 2 random numbers ■ Big Crush: 234 tests, using 2 random numbers 4

Jump-ahead for Parallelism ■ Two primary approaches for parallelization of PRNG – Mu Multi tistre tream am ■ Different random “seed” to produce different random number sequence ■ Overhead of setting the start point is not expensive ■ Chance of correlated number sequences is not so low – cf.) birthday paradox Correlated number sequence 0 Thread0 Thread1 Thread2 Thread3 5

Jump-ahead for Parallelism ■ Two primary approaches for parallelization of PRNG – Sub Substrea eam (J (Jump-ah ahead ad) ■ Each worker get a sub-sequence that is guaranteed to be non-overlapping with its peers – Parallelization does not break the statistical quality of PRNGs ■ Cost of jump-ahead may hurt parallel scalability N * 2/4 0 N * 1/4 N * 3/4 N ⁄ ! " # Thread0 Thread1 Thread2 Thread3 6

MRG8 ■ 8 th -order full primitive polynomials – One of multiple recursive generators – Next random number is generated from previous random numbers with polynomial ■ x n = a 1 x n-1 + a 2 x n-2 + a 3 x n-3 + a 4 x n-4 + a 5 x n-5 + a 6 x n-6 + a 7 x n-7 + a 8 x n-8 mod (2 31 - 1) ■ Modulo operation can be executed by “bit shift”, “bit and” and “plus” operation ■ Long g pe period – (2 31 – 1) 8 ~ 4.5*10 74 ■ Good s statistical q quality – Pass Big crash of TestU01 7

Contribution ■ We reformulate the MRG8 for Intel’s KNL and NVIDIA’s GPU – Utilize wide 512-bit register – Exploit parallelism of many-core processors ■ Huge performance benefit from existing libraries – MRG8-AVX512 achieves a substantial 69% i improvement – MRG8-GPU shows a maximum x3.36 s speedup ■ Secure the statistical quality and long period of original MRG8 8

Reformulating to Matrix-Vector Operation ■ Compute multiple next random numbers in one matrix-vector operation 5 34' & ' & ( & + & # & - & . & / & 0 5 34( 1 0 0 0 0 0 0 0 5 34+ 0 1 0 0 0 0 0 0 5 34# 0 0 1 0 0 0 0 0 ! = , 2 34' = 5 34- 0 0 0 1 0 0 0 0 Easily a apply 5 34. 0 0 0 0 1 0 0 0 5 34/ 0 0 0 0 0 1 0 0 ve vector/parallel 5 340 0 0 0 0 0 0 1 0 processing t to 2 3:0 = ! 0 2 3 mod 9 Mat-ve Mat vec op op 2 3 = ! 2 34' mod 9 2 3:0 ! 0 2 3:'. ! '. = 2 3 mod 9 A 8 , A 16 , A 24 and A 32 2 3:(# ! (# 2 3:+( can be precomputed ! +( 9

Jump-ahead Random Sequence in MRG8 ■ Jump-ahead to arbitrary point – When jump to i-th point, compute ! ; 2 < =>? 9 – Implementation: Matrix-vector multiplication ■ Precompute ! ( @ ( B = 0, 1, 2, … , 246) ■ Compute ! ; 2 < =>? 9 – ! ; = H ' ! ' ∗ H ( ! ( ∗ H + ! # ∗ ... ∗ H (#. ! ( JKL (H M ∈ {0, 1}) – In the implementation, executed a as m mat-ve vec , not mat-mat Jump-Ahead( A, y , i ) fo for j = 0 to 246 do i if ( i & (0x1)) == 1 then y = A ^(2 j ) y mod 2 31 – 1 th i = ( i >> 1) 10 10

MRG8-AVX512: Optimization for KNL ■ Efficiently compute 2 3:0 = ! 0 2 3 =>? 9 with wide 5 512-bit v vector r register – Generate 8 double elements in parallel – Executed as outer product ■ Low c cost o of j jump-ahead f function – Exploit high parallelism (up to 272 threads) 11 11

MRG8-GPU: Optimization for GPU ■ Efficiently c y compute 3 32 x x 8 8 m matrix-vector o operation 2 3:0 ! 0 – Computed as outer product 2 3:'. ! '. ■ 1 threads compute one random number = 2 3 mod 9 2 3:(# ! (# – __umulhi() instruction 2 3:+( ! +( ■ Multiplication between 32-bit unsigned integers and output is upper 32-bit of result ■ Reduce expensive mixed-precision integer multiplications ■ Too many threads require many “jump-ahead” procedure – Carefully select best number of total threads with keeping high occupancy of GPU 12 12

API of MRG8-AVX512/-GPU ■ Single g generation: double rand(); – Each function call returns a single random number – follows C and C++ standard API – Low throughput due to the overhead of function call ■ Array g void rand(double *ran, int n); y generation: – User provides a pointer to the array with the array size – Array is filled with random numbers – Adopted by Intel MKL and cuRAND 13 13

Model for Performance Upper Bound -1- ■ Performance upper bound for the Array generation – Determined as min p c ) ; memory-bound vs compute-bound use case min (p (p m , p – Me Memory-bound c case ■ Restricted by storing the generated random numbers to memory ■ Upper bound is estimated by memory bandwidth of STREAM benchmark – Co Compute-bound c case ■ Count the number of instructions ■ Only consider the kernel part excluding jump-ahead overhead 14 14

Model for Performance Upper Bound -2- ■ Intel KNL (MRG8-AVX512) – Memory bandwidth is 166.6GB/sec => p m = 22. 22.4 bi billion RN RNG/sec – Compute-bound: p c = 34.6 b billion R RNG/sec ■ 44 instructions for 8 random number generation ■ 136 vector units (2 units/core) with 1.4GHz in Intel Xeon Phi Processor 7250 – 54 % better performance when the array size can fit entirely into L1 cach ■ NVIDIA P100 GPU (MRG8-GPU) – Memory-bandwidth is 570.5GB/sec => p m = 76 76.6 billion R RNG/sec – Compute-bound: p c = 49.7 b billion RN RNG/sec ■ 101 instructions for 1 random number generation ■ 3584 CUDA cores with 1.4 GHz in NVIDIA P100 GPU – MRG8-GPU is a compute-bound kernel in all cases 15 15

Performance Evaluation 16 16

Evaluation Environment ■ Cori P Phase 2 2 @ @NERSC ■ TS TSUBAME-3. 3.0 0 @To TokyoTech – Intel Xeon Phi 7250 – NVIDIA Tesla P100 ■ Knights Landing (KNL) ■ #SM: 56 ■ 96GB DDR4 and 16GB MCDRAM ■ Memory: 16GB ■ Quadrant/Cache mode – Compiler ■ 68 cores, 1.4GHz ■ NVCC ver.8.0.61 – Compiler – OS ■ Intel C++ Compiler ver18.0.0 ■ SUSE Linux Enterprise Server 12 SP2 – OS ■ SuSE Linux Enterprise Server 17 17

Evaluation Methodology ■ Generate 64-bit floating random number ■ Generating size – Single generation ■ 2^24 random numbers – Array generation ■ Large: 2^x ( x=24~30) – Fit into MCDRAM and global memory of GPU, but not cache ■ Small: 32, 64, 128 (only for Intel KNL) – More practical case – Repeat 1000 times by each thread on KNL – Fit into L1 cache 18 18

Evaluation Methodology PRNG Libraries ■ Single generation – C++11 standard library ■ MT19937 ■ Array generation – Intel MKL ■ MT19937, MT2203, SFMT19937, MRG32K3A, PHILOX – NVIDIA cuRAND ■ MT19937, SFMT19937, XORWOW, MRG32K3A, PHILOX 19 19

Performance on KNL Single generation ■ MRG8 shows good p performance a and s scalability – C++11 does not support jump-ahead 20 20

Performance on KNL Array generation for large size ■ MRG8 shows comparable performance to Philox – Both close to the upper bound for memory bandwidth 21 21

Performance on KNL Array generation for small size ■ MRG8 overcomes the upper bound of memory bandwidth – x1.69 f faster than the other random number generations 22 22

Performance on KNL Scalability ■ Performance goes down after 64 threads in MT19937 and SFMT – Large jump-ahead cost ■ MRG8 shows good scalability 23 23

MRG8Random Number Generation for the Exascale Era Yusuke Nagasaka , - PowerPoint PPT Presentation

MRG8Random Number Generation for the Exascale Era Yusuke Nagasaka , Ken-ichi Miura , John Shalf Akira Nukada , Satoshi Matsuoka Tokyo Institute of Technology Lawrence Berkeley National Laboratory RIKEN

ERA 1 ERA I I ( i) Deakin and Faculty of Bus. & Law Response to ERA I ( ii)

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

ECEN 5022 Cryptography Pseudo Random Number Generators Peter Mathys University of Colorado

Stochastic Simulation Random number generation Bo Friis Nielsen Applied Mathematics and Computer

Stochastic Simulation Random number generation Bo Friis Nielsen Applied Mathematics and Computer

Why Nobody Should Care About Operating Systems for Exascale Operating Systems for Exascale Ron

exascale road in China Ruibo WANG National University of Defense Technology Contents NUDT

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

E RA- MIN 2 Sta rting De c 1 st 2016 2 About ERA MIN 2 ERA MIN 2 is an ERA NET

Reactive Systems Why now? Electronic Commerce Era Multicore Era Cloud Era Backlash to the BOFH

Stochastic geometry and random generation 1 Stochastic geometry and random generation

Random Numbers, Files, and Onwards Random Numbers Computers cannot produce truly random numbers.

Generation of Non-Uniform Random Numbers Generation of Non-Uniform Random Numbers Refs: Chapter 8

Chapter 2: Random Variables In this chapter we will cover: 1. Discrete Random variables, ( 2.1

Vacuum fluctuations Secure heterodyne-based quantum random number quantum random number

Rough and Smooth: Measuring, Modeling and Forecasting Financial Market Volatility Tim Bollerslev

Cliff Jumping for Amateurs & Other Illuminating Stories Mike Sutton QCon SF 2010

Interpreters and virtual machines Michel Schinz 20070323 Interpreters Interpreters An

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling Christopher Jennison

Natural Language Understanding Lecture 11: Unsupervised Part-of-Speech Tagging with Neural

Do-While Example In C++ do { z--; while (a == b); z = b; In assembly

Building blocks to help youth achieve financial capability Sunaena K. Lehil, Office of Financial

SHORELINE SPECIAL NEEDS PTSA MEMBER MEETING AGENDA 7 p.m. - Business meeting, including

MRG8Random Number Generation for the Exascale Era Yusuke Nagasaka , - PowerPoint PPT Presentation

MRG8Random Number Generation for the Exascale Era Yusuke Nagasaka , Ken-ichi Miura , John Shalf Akira Nukada , Satoshi Matsuoka Tokyo Institute of Technology Lawrence Berkeley National Laboratory RIKEN

ERA 1 ERA I I ( i) Deakin and Faculty of Bus. &amp; Law Response to ERA I ( ii)

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

ECEN 5022 Cryptography Pseudo Random Number Generators Peter Mathys University of Colorado

Stochastic Simulation Random number generation Bo Friis Nielsen Applied Mathematics and Computer

Stochastic Simulation Random number generation Bo Friis Nielsen Applied Mathematics and Computer

Why Nobody Should Care About Operating Systems for Exascale Operating Systems for Exascale Ron

exascale road in China Ruibo WANG National University of Defense Technology Contents NUDT

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

E RA- MIN 2 Sta rting De c 1 st 2016 2 About ERA MIN 2 ERA MIN 2 is an ERA NET

Reactive Systems Why now? Electronic Commerce Era Multicore Era Cloud Era Backlash to the BOFH

Stochastic geometry and random generation 1 Stochastic geometry and random generation

Random Numbers, Files, and Onwards Random Numbers Computers cannot produce truly random numbers.

Generation of Non-Uniform Random Numbers Generation of Non-Uniform Random Numbers Refs: Chapter 8

Chapter 2: Random Variables In this chapter we will cover: 1. Discrete Random variables, ( 2.1

Vacuum fluctuations Secure heterodyne-based quantum random number quantum random number

Rough and Smooth: Measuring, Modeling and Forecasting Financial Market Volatility Tim Bollerslev

Cliff Jumping for Amateurs &amp; Other Illuminating Stories Mike Sutton QCon SF 2010

Interpreters and virtual machines Michel Schinz 20070323 Interpreters Interpreters An

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling Christopher Jennison

Natural Language Understanding Lecture 11: Unsupervised Part-of-Speech Tagging with Neural

Do-While Example In C++ do { z--; while (a == b); z = b; In assembly

Building blocks to help youth achieve financial capability Sunaena K. Lehil, Office of Financial

SHORELINE SPECIAL NEEDS PTSA MEMBER MEETING AGENDA 7 p.m. - Business meeting, including

ERA 1 ERA I I ( i) Deakin and Faculty of Bus. & Law Response to ERA I ( ii)

Cliff Jumping for Amateurs & Other Illuminating Stories Mike Sutton QCon SF 2010