Single processor tuning (1/2) Prof. Richard Vuduc Georgia Institute - PowerPoint PPT Presentation

Single processor tuning (1/2) Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.14] Thursday, February 21, 2008

Today’s sources CS 267 (Yelick @ UCB; Spring 2007) “ A survey of out-of-core algorithms in numerical linear algebra ,” by Toledo (1999) “ A family of high-performance matrix multiplication algorithms ,” by Gunnels, et al . (2006) “ On reducing TLB misses in matrix multiplication ,” by Goto and van de Geijn (2002) “ Is search really necessary to generate high-performance BLAS? ” by Yotov, et al . (2005)

Review: Accuracy, stability, and parallelism

The impact of parallelism on numerical algorithms Larger problems magnify errors: Round-off, ill-conditioning, instabilities Reproducibility : a + (b + c) ≠ (a + b) + c Fast parallel algorithm may be much less stable than fast serial algorithm Flops cheaper than communication Speeds at different precisions may vary significantly [ e.g. , SSE k , Cell] Perils of arithmetic heterogenity , e.g. , CPU vs. GPU support of IEEE

Mixed (forward-backward) stability Computed answer “near” exact solution of a nearby problem ∆ x, ∆ y : y + ∆ y = f ( x + ∆ x ) ˆ x y = f ( x ) ˆ y x + ∆ x ∆ y f ( x + ∆ x )

Conditioning: Relating forward and backward error � ˆ � � xf ′ ( x ) � � � y − y ∆ x � � � � � � � � � � � � · � � f ( x ) y x � � � � � Define (relative) condition number : � � xf ′ ( x ) � � c ( x ) = � � f ( x ) � � Roughly: (Forward error) ≤ (Condition number) * (Backward error)

Mixed-precision iterative refinement Inner-loop of mixed-precision iterative refinement algorithm: Single, O( n 3 ) ⇒ x = Estimated solution to Ax = b ˆ Double, O( n 2 ) ⇒ r ← b − A · ˆ ˆ x Single, O( n 2 ) ⇒ Solve A · ˆ d = ˆ r Double, O( n ) ⇒ x ( improved ) ← ˆ x + ˆ ˆ d Theorem : Repeated iterative refinement converges by η at each stage, and x ( t ) Estimate at iteration t , in precision ǫ ≡ r ( t ) Residual, in precision ǫ 2 ≡ ǫ · || | A − 1 | · | ˆ L | · | ˆ U | || ∞ < 1 η ≡ || x ( t ) − x || ∞ ⇐ Independent of κ (A) ! O ( ǫ ) → || x || ∞

Obstacles to fast and stable parallel numerical algorithms Algorithms that work on small problems may fail at large sizes Round-off accumulates Condition number increases Probability of “random instability” increases Fast (parallel) algorithm may be less stable ⇒ trade-off

± m × 2 e − t = y m ≥ 2 t − 1 y � = 0 ⇒ = “Normalized” − 125 = e min ≤ e ≤ e max = 128 ± 0 ≤ m < 2 24 ≈ 16 million

IEEE formats ± 0 ≤ m < 2 t e min ≤ e ≤ e max Exp. bits Format Total bits t -1 ε Fortran / C ( e min , e max ) 8 REAL*4 Single 32 23 6 × 10 -8 (-125, 128) float 11 REAL*8 Double 64 52 10 -16 (-1021, 1024) double Extended 15 REAL*10 80 64 5 × 10 -20 (Intel) (-16381, 16384) long double

Reasoning about memory hierarchies

processor control Second Secondary Main Tertiary level storage memory storage cache (Disk) datapath (DRAM) (Disk/Tape) (SRAM) on-chip registers cache Cost 1ns 10ns 100ns 10ms 10sec Size B KB MB GB TB Recall: Memory hierarchies. Cost of accessing data depends on where data lives.

Memory hierarchies reflect growing processor-memory speed gap.

Dealing with high memory latency Use caches as fast memory Store data that will be reused many times: temporal locality Save chunks of contiguous data: spatial locality Exploit fact that bandwidth improves faster than latency: prefetch Modern processors automate cache management All loads cached automatically (LRU), and loaded in chunks ( cache line size ) Typical to have a hardware prefetcher that detects simple patterns

A simple model of memory m No. words moved from slow to fast memory ≡ f No. of flops ≡ Time per slow memory op. α ≡ Time per flop τ ≡ f q m = Flop-to-mop ratio ⇐ Computational intensity ≡ � � τ · 1 1 + α T = f · t f + m · α = f · t f · q Machine balance

Example: Matrix-vector multiply // Implements y ← y + A · x ← + * for i ← 1 to n do for j ← 1 to n do y i ← y i + a ij · x j

Example: Matrix-vector multiply // Implements y ← y + A · x ← + * // Read x (into fast memory) // Read y for i ← 1 to n do // Read a i, ⋆ for j ← 1 to n do y i ← y i + a ij · x j // Write y to slow memory

Example: Matrix-vector multiply // Implements y ← y + A · x ← + * // Read x (into fast memory) // Read y for i ← 1 to n do 2 n 2 f = 3 n + n 2 // Read a i, ⋆ m = q ≈ 2 for j ← 1 to n do ⇓ y i ← y i + a ij · x j T τ · 1 1 + α ≈ // Write y to slow memory f · τ 2

Machine balance, α / τ [See my thesis] Empirically-derived sustainable machine balance 40 35 30 25 α / τ 20 15 10 5 0 Ultra 2i Ultra 3 Pentium 3 3M Power3 Power 4 Itanium 1

Simplifying assumptions Ignored flop/mop parallelism within processor → drop arithmetic term Assumed fast memory large enough to hold vectors Assumed no-cost fast memory access Memory latency is constant, charged per word Ignored cache lines / block transfers Ignored bandwidth

Predictive accuracy of this model Model Measured 1,500 1,125 Mflop/s 750 375 0 Ultra 2i Ultra 3 P3 P3M Power3 Power4 Itanium 1Itanium 2

Naive matrix-matrix multiply // Implements C ← C + A · B for i ← 1 to n do for j ← 1 to n do 2 n 3 f = for k ← 1 to n do Best case ⇒ 4 n 2 m ≥ c ij ← c ij + a ik · b kj T τ · 1 1 + α ≥ f · τ 2 n

Naive matrix-matrix multiply // Implements C ← C + A · B for i ← 1 to n do // Read row a i, ⋆ for j ← 1 to n do // Read col b ⋆ ,j // Read c i,j 2 n 3 f = for k ← 1 to n do n 3 + 3 n 2 m = c ij ← c ij + a ik · b kj T τ · 1 1 + α // Write c ij to slow memory ≈ f · τ 2

Blocked (tiled) matrix multiply J // Let I, J, K = blocks of b indices for I ← index blocks 1 to n b do K for J ← index blocks 1 to n b do K // Read block C IJ I for K ← index blocks 1 to n b do // Read block A IK // Read block B KJ B IJ ← c IJ + A IK · B KJ // Write C IJ to slow memory

Blocked (tiled) matrix multiply m ≈ n 3 q ≈ 2 b = ⇒ b T τ · 1 1 + α = f · τ 2 b

Architectural implications M ≡ Size of fast mem. Arch. ≈ α / τ KBytes 3 b 2 ≤ M Ultra 2i 25 5.7 q ≈ 2 b Ultra 3 14 1.8 ⇓ Pentium 3 6.3 0.36 3 4 q 2 M ≥ P-3M 10 0.92 Power3 8.8 0.71 τ · 1 15 2.1 1 + α Power4 < 1 . 1 q Itanium 1 36 12 � 2 � α = 75 ⇒ M ≥ Itanium 2 5.5 0.28 τ

Can we do better? � √ � b O M = � n 3 � n 3 � � ⇒ m O = O √ = = b M

Bounding amount of I/O possible Consider a schedule in phases of exactly M transfers each (except last) Definition : c(i,j) is live during phase p if ... … for some k , we compute a(i,k) * b(k, j) ; and some partial sum of c(i, j) is either in cache or moved to main memory At most 2* M alive elements in phase p At most 2* M distinct elements of A in cache during phase p ; same for B Either in cache at beginning or moved to cache during phase Let A p be set of elements in cache during phase p

How many multiplies in phase p ? Let S p,+ = set of rows of A with M 1/2 or more elements in A p Let S p,- = set of rows of A with fewer | S p,+ | ≤ 2* M 1/2 Consider rows in S p,+ : Operation “ a(i, :) × B ” touches each element of B only once So, no. of scalar multiplies ≤ | S p,+ | * (2* M ) = 4* M 3/2 For rows in S p,- , consider that “ c(i,j) = row x col” Thus, (# multiplies) ≤ (no. live) x (max row len) ≤ 2* M 3/2

Final bound on multiplies n 3 Total no. of multiplies = 3 No. of multiplies per phase 6 M ≤ 2 � n 3 � No. of phases ≥ 3 6 M 2 � n 3 � Total no. of words transferred M · 2 − 1 ≥ 3 6 M n 3 = − M √ 6 M

Can we do better? No. Theorem [Hong and Kung (1981)]: Any schedule of conventional matrix multiplication must transfer Ω ( n 3 / √ M ) words between slow and fast memory, where M < n 2 / 6. Preceding proof by Toledo (1999) So cached block matrix multiply is asymptotically optimal . � n 3 � n 3 � √ � � � b = O M ⇒ m = O = O √ = b M

What happens in practice? Experiment: One-level cache-blocked matrix multiply Block size chosen as square, by exhaustive search over sizes up to 64

Tiled MM on AMD Opteron 2.2 GHz (4.4 Gflop/s peak), 1 MB L2 cache We evidently still have a lot of work to do...

Administrivia

Two joint classes with CS 8803 SC Tues 2/19 : Floating-point issues in parallel computing by me Tues 2/26 : GPGPUs by Prof. Hyesoon Kim Scribe? Both classes meet in Klaus 1116E

Homework 1: Parallel conjugate gradients Extension : Due Wednesday 2/27 @ 8:30 am Implement a parallel solver for Ax = b (serial C version provided) Evaluate on three matrices: 27-pt stencil, and two application matrices “Simplified:” No preconditioning Performance models to understand scalability of your implementation Make measurements Build predictive models Collaboration encouraged: Compare programming models or platforms

Single processor tuning (1/2) Prof. Richard Vuduc Georgia Institute - PowerPoint PPT Presentation

Single processor tuning (1/2) Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.14] Thursday, February 21, 2008 Todays sources CS 267 (Yelick @ UCB; Spring 2007) A survey of

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Processor Design Single Cycle Processor Hung-Wei Tseng Recap: the stored-program computer

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

TUNING Russia: Development of master programmes in engineering education using the Tuning

Parameters vs hyperparameters Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

CHAPTER 9: PID TUNING Process Solve the tuning Apply, is the reaction curve problem. Requires

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

Elementary Particles Lecture 4 Niels Tuning Harry van der Graaf Niels Tuning (1) Thanks

UCSB CS240A, Winter 2016 1 Motivation Most applications in a single processor runs at only

Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single

Single processor tuning (2/2) Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803

Lecture 2: Processor Design, Single-Processor Performance G63.2011.002/G22.2945.001 September

The Processor: Datapath and Control 3/ 24/ 2016 1 A single-cycle MIPS processor An

Finding real roots of polynomials of degree 1 000 000 Guillaume Moroz March 5, 2020 Guillaume

Study of Error in Survey Reports of Move Month Using USPS Change of Address Records Mary H.

Resonance based schemes for dispersive equations via decorated trees Yvain Bruned University of

Geometric numerical integration Robert McLachlan Professor of Applied Mathematics, Massey

Explaining Program Failures via Postmortem Static Analysis Manu Sridharan Roman Manevich UC

Reliability Basic concepts and properties Computadores II / 2005-2006 Characteristics of a RTS

MC714: Sistemas Distribu dos Prof. Lucas Wanner Instituto de Computac ao, Unicamp

Algorithm-Based Fault Tolerance erault 1 , Yves Robert 1 , 2 & Fr eric Vivien 2 Thomas H

Sambuz

Useful Links

Newsletter

Mail Us

Single processor tuning (1/2) Prof. Richard Vuduc Georgia Institute - PowerPoint PPT Presentation

Single processor tuning (1/2) Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.14] Thursday, February 21, 2008 Todays sources CS 267 (Yelick @ UCB; Spring 2007) A survey of

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Processor Design Single Cycle Processor Hung-Wei Tseng Recap: the stored-program computer

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

TUNING Russia: Development of master programmes in engineering education using the Tuning

Parameters vs hyperparameters Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

CHAPTER 9: PID TUNING Process Solve the tuning Apply, is the reaction curve problem. Requires

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

Elementary Particles Lecture 4 Niels Tuning Harry van der Graaf Niels Tuning (1) Thanks

UCSB CS240A, Winter 2016 1 Motivation Most applications in a single processor runs at only

Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single

Single processor tuning (2/2) Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803

Lecture 2: Processor Design, Single-Processor Performance G63.2011.002/G22.2945.001 September

The Processor: Datapath and Control 3/ 24/ 2016 1 A single-cycle MIPS processor An

Finding real roots of polynomials of degree 1 000 000 Guillaume Moroz March 5, 2020 Guillaume

Study of Error in Survey Reports of Move Month Using USPS Change of Address Records Mary H.

Resonance based schemes for dispersive equations via decorated trees Yvain Bruned University of

Geometric numerical integration Robert McLachlan Professor of Applied Mathematics, Massey

Explaining Program Failures via Postmortem Static Analysis Manu Sridharan Roman Manevich UC

Reliability Basic concepts and properties Computadores II / 2005-2006 Characteristics of a RTS

MC714: Sistemas Distribu dos Prof. Lucas Wanner Instituto de Computac ao, Unicamp

Algorithm-Based Fault Tolerance erault 1 , Yves Robert 1 , 2 &amp; Fr eric Vivien 2 Thomas H

Sambuz

Useful Links

Newsletter

Mail Us

Algorithm-Based Fault Tolerance erault 1 , Yves Robert 1 , 2 & Fr eric Vivien 2 Thomas H