comp 633 parallel computing
play

COMP 633 - Parallel Computing Lecture 6 September 1, 2020 SMM (1) - PowerPoint PPT Presentation

COMP 633 - Parallel Computing Lecture 6 September 1, 2020 SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Prins Shared Memory Multiprocessors (1) 1 PRAM example 1 Evaluate a polynomial for a given value of and


  1. COMP 633 - Parallel Computing Lecture 6 September 1, 2020 SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Prins Shared Memory Multiprocessors (1) 1

  2. PRAM example 1 β€’ Evaluate a polynomial for a given value of 𝑦 and coefficients 𝑐 1 … 𝑐 π‘œ 𝑄 𝑦 = 𝑐 1 𝑦 π‘œβˆ’1 + β‹― + 𝑐 π‘œβˆ’1 + 𝑐 π‘œ COMP 633 - Prins Shared Memory Multiprocessors (1) 2

  3. PRAM example 2 – bitonic merge COMP 633 - Prins Shared Memory Multiprocessors (1) 3

  4. Topics β€’ PRAM algorithm examples β€’ Memory systems – organization – caches and the memory hierarchy – influence of the memory hierarchy on algorithms β€’ Shared memory systems – Taxonomy of actual shared memory systems β€’ UMA, NUMA, cc-NUMA COMP 633 - Prins Shared Memory Multiprocessors (1) 4

  5. Recall PRAM shared memory system β€’ PRAM model – assumes access latency is constant, regardless of value of p or the size of memory – simultaneous reads permitted under CR model and simultaneous writes permitted under CW model β€’ Physically impossible to realize – processors and memory occupy physical space shared memory β€’ speed of light limitations ( ) ( ) 1 3 =  + L p m – CR / CW must be reduced to ER / EW 2 β€’ β€’ β€’ 1 p β€’ requires  (lg p) time in general case processors COMP 633 - Prins Shared Memory Multiprocessors (1) 5

  6. Anatomy of a processor ο‚« memory system β€’ Performance parameters of Random Access Memory (RAM) – latency L β€’ elapsed time from presentation of memory address to arrival of data – address transit time – memory access time t mem – data transit time – bandwidth W β€’ number of values (e.g. 64 bit words) delivered to processor per unit time – simple implementation W ~ 1/L Processor Memory COMP 633 - Prins Shared Memory Multiprocessors (1) 6

  7. Processor vs. memory performance β€’ The memory β€œwall” – Processors compute faster than memory delivers data β€’ increasing imbalance 𝑒arith β‰ͺ 𝑒mem β€’ β‰ͺ COMP 633 - Prins Shared Memory Multiprocessors (1) 7

  8. Improving memory system performance (1) β€’ Decrease latency L to memory – speed of light is a limiting factor β€’ bring memory closer to processor – decrease memory access time by decreasing memory size s β€’ access time ο‚΅ s Β½ (VLSI) – use faster memory technology β€’ DRAM (Dynamic RAM) 1 transistor per stored bit – high density, low power, long access time, low cost β€’ SRAM (Static RAM) 6 transistors per stored bit – low density, high power, short access time, high cost COMP 633 - Prins Shared Memory Multiprocessors (1) 8

  9. Improving memory system performance (1) β€’ Decrease latency using cache memory – low latency access to frequently used values, high latency for the remaining values Processor Cache Memory – Example β€’ 90% of references are to cache with latency L 1 β€’ 10% of references are to memory with latency L 2 β€’ average latency is 0.9L 1 + 0.1L 2 COMP 633 - Prins Shared Memory Multiprocessors (1) 9

  10. Improving memory system performance (2) β€’ Increase bandwidth W – multiport (parallel access) memory β€’ multiple reads, multiple exclusive writes per memory cycle – High cost, very limited scalability Register file Processor – β€œblocked” memory β€’ memory supplies block of size b containing requested word – supports spatial locality in cache access b Processor Cache Memory COMP 633 - Prins Shared Memory Multiprocessors (1) 10

  11. Improving memory system performance (2) β€’ Increase bandwidth W (contd) – pipeline memory requests β€’ requires independent memory references – interleave memory β€’ problem: memory access is limited by t mem β€’ use m separate memories (or memory banks) β€’ W ~ m / L if references distribute over memory banks COMP 633 - Prins Shared Memory Multiprocessors (1) 11

  12. Latency hiding β€’ Amortize latency using a pipelined interleaved memory system – k independent references in  (L + k οƒ— t proc ) time β€’ O(L/k) amortized (expected) latency per reference β€’ Where do we get independent references? – out-of-order execution of independent load/store operations β€’ found in most modern performance-oriented processors β€’ partial latency hiding: k ~ 2 - 10 references outstanding – vector load/store operations β€’ small vector units (AVX512) – vector length 2-8 words (Intel Xeon) – partial latency hiding β€’ high-performance vector units (NEC SX-9, SX-Aurora) – vector length k = L / t proc (128 - 256 words) – crossbar network to highly interleaved memory (~ 16,000 banks) – full latency hiding: amortized memory access at processor speed – multithreaded operation β€’ independent execution threads with individual hardware contexts – partial latency hiding: 2-way hyperthreading (Intel) – full latency hiding: 128-way threading with high-performance memory (Cray MTA) COMP 633 - Prins Shared Memory Multiprocessors (1) 12

  13. Implementing the PRAM β€’ How close can we come to O(1) latency PRAM memory in practice? M 1 M 2 M 3 M m-1 M m β€’ β€’ β€’ – requires processor to memory network Network β€’ latency L = sum of – twice network latency – memory cycle time – serialization time for CR, CW β€’ L increases with m, p P 1 P 2 P p β€’ β€’ β€’ – L too large with current technology – examples β€’ NYU Ultracomputer (1987), IBM RP3 (1991), SBPRAM (1999) – logarithmic depth combining network eliminates memory contention time for CR, CW Β»  (lg p) latency in network is prohibitive COMP 633 - Prins Shared Memory Multiprocessors (1) 13

  14. Implementing PRAM – a compromise β€’ Using latency hiding with a high-performance memory system – implements p οƒ— k processor EREW PRAM slowed down by a factor of k β€’ use m ο‚³ p (t mem / t proc ) memory banks to match memory reference rate of p processors β€’ total latency 2L for k = L / t proc independent random references at each processor β€’ O(t proc ) amortized latency per reference at each processor – unit latency degrades in the presence of concurrent reads/writes M 1 M 2 M 3 M m-1 M m β€’ β€’ β€’ Network P 1 P 2 P p β€’ β€’ β€’ – Bottom line: doable but very expensive and only limited scaling in p COMP 633 - Prins Shared Memory Multiprocessors (1) 14

  15. Memory systems summary β€’ Memory performance – Latency is limited by physics – Bandwidth is limited by cost β€’ Cache memory: low latency access to some values – caching frequently used values β€’ rewards temporal locality of reference – caching consecutive values β€’ rewards spatial locality of reference – decrease average latency β€’ 90 fast references, 10 slow references: effective latency = 0.9L 1 + 0.1L 2 β€’ Parallel memories – 100 independent references β‰ˆ 100 fast references – relatively expensive – requires parallel processing COMP 633 - Prins Shared Memory Multiprocessors (1) 15

  16. Simple uniprocessor memory hierarchy β€’ Each component is characterized by Disk – capacity – block size – (associativity) β€’ Traffic between components is characterized by Main – access latency Memory – transfer rate (bandwidth) β€’ Example: – IBM RS6000/320H (ca. 1991) Cache Storage Latency Transfer Rate component (cycles) (words [8B] / cycle) Disk 1,000,000 0.001 Regs Main memory 60 0.1 Cache 2 1 ALU Registers 0 3 COMP 633 - Prins Shared Memory Multiprocessors (1) 16

  17. Cache operation β€’ ABC cache parameters – associativity Cache – block size – capacity associativity capacity β€’ CCC performance model – cache misses can be β€’ compulsory β€’ capacity block size β€’ conflict COMP 633 - Prins Shared Memory Multiprocessors (1) 17

  18. associativity = 256-way Cache operation: read block size = 64 bytes (512b) 40-bit address <26> <8> <6> address Tag Index blk data Valid Tag Data 1,2,4,8 bytes <1> <26> <512> : MUX = COMP 633 - Prins Shared Memory Multiprocessors (1) 18

  19. The changing memory hierarchy β€’ IBM RS6000 320H - 25 MHz (1991) Disk Storage Latency Transfer Rate component (cycles) (words [8B] / cycle) Disk 1,000,000 0.001 Main memory 60 0.1 Cache 2 1 Registers 1 3 Main Memory β€’ Intel Xeon 61xx [per core @3GHz] (2017) Storage Latency Transfer Rate component (cycles) (words [8B] / cycle) HDD 18,000,000 0.00007 Cache SSD 300,000 0.02 Main memory 250 0.2 L3 Cache 48 0.5 Regs L2 Cache 12 1 L1 Cache 4 2 ALU Registers 1 6 COMP 633 - Prins Shared Memory Multiprocessors (1) 20

  20. Computational Intensity: a key metric limiting performance β€’ Computational intensity of a problem I = (total # of arithmetic operations required) in flops (size of input + size of result) in 64-bit words β€’ BLAS - Basic Linear Algebra Subroutines – Asymptotic performance limited by computational intensity β€’ A,B,C οƒŽ  n ο‚΄ n x,y οƒŽ  n a οƒŽ  name defn flops refs I y = ax n 2n 0.5 scale BLAS 1 y = ax + y 2n 3n 0.67 triad x β€’ y 2n 2n 1 dot product 2n 2 +n n 2 +3n y = y + Ax ~ 2 Matrix-vector BLAS 2 A = A + xy T 2n 2 2n 2 +2n ~ 1 rank-1 update BLAS 3 C = C + AB 2n 3 4n 2 n/2 Matrix product COMP 633 - Prins Shared Memory Multiprocessors (1) 21

Recommend


More recommend