COMP 633 - Parallel Computing Lecture 6 September 1, 2020 SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Prins Shared Memory Multiprocessors (1) 1
PRAM example 1 β’ Evaluate a polynomial for a given value of π¦ and coefficients π 1 β¦ π π π π¦ = π 1 π¦ πβ1 + β― + π πβ1 + π π COMP 633 - Prins Shared Memory Multiprocessors (1) 2
PRAM example 2 β bitonic merge COMP 633 - Prins Shared Memory Multiprocessors (1) 3
Topics β’ PRAM algorithm examples β’ Memory systems β organization β caches and the memory hierarchy β influence of the memory hierarchy on algorithms β’ Shared memory systems β Taxonomy of actual shared memory systems β’ UMA, NUMA, cc-NUMA COMP 633 - Prins Shared Memory Multiprocessors (1) 4
Recall PRAM shared memory system β’ PRAM model β assumes access latency is constant, regardless of value of p or the size of memory β simultaneous reads permitted under CR model and simultaneous writes permitted under CW model β’ Physically impossible to realize β processors and memory occupy physical space shared memory β’ speed of light limitations ( ) ( ) 1 3 = ο + L p m β CR / CW must be reduced to ER / EW 2 β’ β’ β’ 1 p β’ requires ο (lg p) time in general case processors COMP 633 - Prins Shared Memory Multiprocessors (1) 5
Anatomy of a processor ο« memory system β’ Performance parameters of Random Access Memory (RAM) β latency L β’ elapsed time from presentation of memory address to arrival of data β address transit time β memory access time t mem β data transit time β bandwidth W β’ number of values (e.g. 64 bit words) delivered to processor per unit time β simple implementation W ~ 1/L Processor Memory COMP 633 - Prins Shared Memory Multiprocessors (1) 6
Processor vs. memory performance β’ The memory βwallβ β Processors compute faster than memory delivers data β’ increasing imbalance π’arith βͺ π’mem β’ βͺ COMP 633 - Prins Shared Memory Multiprocessors (1) 7
Improving memory system performance (1) β’ Decrease latency L to memory β speed of light is a limiting factor β’ bring memory closer to processor β decrease memory access time by decreasing memory size s β’ access time ο΅ s Β½ (VLSI) β use faster memory technology β’ DRAM (Dynamic RAM) 1 transistor per stored bit β high density, low power, long access time, low cost β’ SRAM (Static RAM) 6 transistors per stored bit β low density, high power, short access time, high cost COMP 633 - Prins Shared Memory Multiprocessors (1) 8
Improving memory system performance (1) β’ Decrease latency using cache memory β low latency access to frequently used values, high latency for the remaining values Processor Cache Memory β Example β’ 90% of references are to cache with latency L 1 β’ 10% of references are to memory with latency L 2 β’ average latency is 0.9L 1 + 0.1L 2 COMP 633 - Prins Shared Memory Multiprocessors (1) 9
Improving memory system performance (2) β’ Increase bandwidth W β multiport (parallel access) memory β’ multiple reads, multiple exclusive writes per memory cycle β High cost, very limited scalability Register file Processor β βblockedβ memory β’ memory supplies block of size b containing requested word β supports spatial locality in cache access b Processor Cache Memory COMP 633 - Prins Shared Memory Multiprocessors (1) 10
Improving memory system performance (2) β’ Increase bandwidth W (contd) β pipeline memory requests β’ requires independent memory references β interleave memory β’ problem: memory access is limited by t mem β’ use m separate memories (or memory banks) β’ W ~ m / L if references distribute over memory banks COMP 633 - Prins Shared Memory Multiprocessors (1) 11
Latency hiding β’ Amortize latency using a pipelined interleaved memory system β k independent references in ο (L + k ο t proc ) time β’ O(L/k) amortized (expected) latency per reference β’ Where do we get independent references? β out-of-order execution of independent load/store operations β’ found in most modern performance-oriented processors β’ partial latency hiding: k ~ 2 - 10 references outstanding β vector load/store operations β’ small vector units (AVX512) β vector length 2-8 words (Intel Xeon) β partial latency hiding β’ high-performance vector units (NEC SX-9, SX-Aurora) β vector length k = L / t proc (128 - 256 words) β crossbar network to highly interleaved memory (~ 16,000 banks) β full latency hiding: amortized memory access at processor speed β multithreaded operation β’ independent execution threads with individual hardware contexts β partial latency hiding: 2-way hyperthreading (Intel) β full latency hiding: 128-way threading with high-performance memory (Cray MTA) COMP 633 - Prins Shared Memory Multiprocessors (1) 12
Implementing the PRAM β’ How close can we come to O(1) latency PRAM memory in practice? M 1 M 2 M 3 M m-1 M m β’ β’ β’ β requires processor to memory network Network β’ latency L = sum of β twice network latency β memory cycle time β serialization time for CR, CW β’ L increases with m, p P 1 P 2 P p β’ β’ β’ β L too large with current technology β examples β’ NYU Ultracomputer (1987), IBM RP3 (1991), SBPRAM (1999) β logarithmic depth combining network eliminates memory contention time for CR, CW Β» ο (lg p) latency in network is prohibitive COMP 633 - Prins Shared Memory Multiprocessors (1) 13
Implementing PRAM β a compromise β’ Using latency hiding with a high-performance memory system β implements p ο k processor EREW PRAM slowed down by a factor of k β’ use m ο³ p (t mem / t proc ) memory banks to match memory reference rate of p processors β’ total latency 2L for k = L / t proc independent random references at each processor β’ O(t proc ) amortized latency per reference at each processor β unit latency degrades in the presence of concurrent reads/writes M 1 M 2 M 3 M m-1 M m β’ β’ β’ Network P 1 P 2 P p β’ β’ β’ β Bottom line: doable but very expensive and only limited scaling in p COMP 633 - Prins Shared Memory Multiprocessors (1) 14
Memory systems summary β’ Memory performance β Latency is limited by physics β Bandwidth is limited by cost β’ Cache memory: low latency access to some values β caching frequently used values β’ rewards temporal locality of reference β caching consecutive values β’ rewards spatial locality of reference β decrease average latency β’ 90 fast references, 10 slow references: effective latency = 0.9L 1 + 0.1L 2 β’ Parallel memories β 100 independent references β 100 fast references β relatively expensive β requires parallel processing COMP 633 - Prins Shared Memory Multiprocessors (1) 15
Simple uniprocessor memory hierarchy β’ Each component is characterized by Disk β capacity β block size β (associativity) β’ Traffic between components is characterized by Main β access latency Memory β transfer rate (bandwidth) β’ Example: β IBM RS6000/320H (ca. 1991) Cache Storage Latency Transfer Rate component (cycles) (words [8B] / cycle) Disk 1,000,000 0.001 Regs Main memory 60 0.1 Cache 2 1 ALU Registers 0 3 COMP 633 - Prins Shared Memory Multiprocessors (1) 16
Cache operation β’ ABC cache parameters β associativity Cache β block size β capacity associativity capacity β’ CCC performance model β cache misses can be β’ compulsory β’ capacity block size β’ conflict COMP 633 - Prins Shared Memory Multiprocessors (1) 17
associativity = 256-way Cache operation: read block size = 64 bytes (512b) 40-bit address <26> <8> <6> address Tag Index blk data Valid Tag Data 1,2,4,8 bytes <1> <26> <512> : MUX = COMP 633 - Prins Shared Memory Multiprocessors (1) 18
The changing memory hierarchy β’ IBM RS6000 320H - 25 MHz (1991) Disk Storage Latency Transfer Rate component (cycles) (words [8B] / cycle) Disk 1,000,000 0.001 Main memory 60 0.1 Cache 2 1 Registers 1 3 Main Memory β’ Intel Xeon 61xx [per core @3GHz] (2017) Storage Latency Transfer Rate component (cycles) (words [8B] / cycle) HDD 18,000,000 0.00007 Cache SSD 300,000 0.02 Main memory 250 0.2 L3 Cache 48 0.5 Regs L2 Cache 12 1 L1 Cache 4 2 ALU Registers 1 6 COMP 633 - Prins Shared Memory Multiprocessors (1) 20
Computational Intensity: a key metric limiting performance β’ Computational intensity of a problem I = (total # of arithmetic operations required) in flops (size of input + size of result) in 64-bit words β’ BLAS - Basic Linear Algebra Subroutines β Asymptotic performance limited by computational intensity β’ A,B,C ο ο n ο΄ n x,y ο ο n a ο ο name defn flops refs I y = ax n 2n 0.5 scale BLAS 1 y = ax + y 2n 3n 0.67 triad x β’ y 2n 2n 1 dot product 2n 2 +n n 2 +3n y = y + Ax ~ 2 Matrix-vector BLAS 2 A = A + xy T 2n 2 2n 2 +2n ~ 1 rank-1 update BLAS 3 C = C + AB 2n 3 4n 2 n/2 Matrix product COMP 633 - Prins Shared Memory Multiprocessors (1) 21
Recommend
More recommend