Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3
Memory Hierarchy
Multi-Core Processors Processor (chip) Processor (chip) Processor (chip) core core core core core core … core core core core core core Socket Socket Socket (to motherboard) Main Memory
Questions • Q1: Bottlenecks in the previous figure? • Q2: Which is larger? • The CPU data processing speed? • The memory bus speed? • Q3: Solutions? 4
General Pattern Cache hit / miss Fast memory (small, expensive) Fetch and cache Evict inactive data recently read data Slow memory (large, cheap) 5
Memory Hierarchy in a Processor PROCESSOR PROCESSOR CORE CORE CORE CPU Registers (L0) … L1 cache (~64KB*2 SRAM) Split: Instructions | Data L2 cache (~512KB SRAM) Unified: Instructions + Data L3 cache aka Last Level Cache (~4MB SRAM) shared across cores RAM (~4GB to 1+TB DRAM) shared across processors 6
Architectural Variants • Non-Uniform Memory Access (NUMA) • Popular in modern multi-socket architectures • Each socket has local RAM • Other sockets can access it (typically via a point-to-point bus) • Remote memory access is slower than local • Fundamental principles remain • Hierarchy • Locality 7
Latency Numbers (2012 & approximate!) 8
Analogy (1 ns = 1 hour) L1 cache access: 0.5 h Main memory reference: 4.17 days Watching an episode of a TV-serie A long camping trip L2 cache access: 7h Disk seek: 1141 years Almost a working day ~ time passed since Charlemagne crowned Emperor 9
Why Caching? • Temporal locality • Recently accessed data is likely to be used again soon • Spatial locality • Data close to recently accessed data is likely to be used soon 10
Example • Top-k integers in unsorted array • Maintain heap to store top-k elements • Scan the array to update heap • Q: Does locality help in this application? Min-heap check min if elem larger Array delete min min 15 insert elem 111 … 15 212 307 556 343 212 111 Scan 11
Some Answers • Temporal locality • Helps for heap management • Does not help for scanning the array • Estimating latency access • Consider the Top-100 example • Array elements are 4 bytes integers • Question: What is the expected latency to fetch a heap element? • Spatial locality? • Assume cache line is 64 bytes 12
Cache Coherency (i.e. Consistency) • Caches may have different replicas of the same data • Replication always creates consistency issues • Programs assume that they access a single shared memory • Keeping caches coherent is expensive! fetch addr. A cache Coherency protocol (HW) Main memory cache fetch addr. A 13
MESI Protocol • A cache line can be in four states • Modified: Not shared, dirty (i.e., inconsistent with main memory) • Exclusive: Not shared, clean (i.e., consistent with main memory) • Shared: Shared, clean • Invalid: Cannot be used • Only clean data is shared • Cache line transitions to Modified à All its copies become Invalid • Invalid data needs to be fetched again • Writes are detected by hw snooping on the bus • Q: Implications for programmers? 14
Write Back vs. Write Through • How to react when cache has dirty data • Write through: update lower-level caches & main memory immediately • Write back: delay that update 15
False Sharing • A core updates variable x and never reads y • Another core reads y and never reads x • Q: Can cache coherence kick in? • A: Yes, if x and y are stored in the same cache line struct foo { int x; int y; } 16
Keeping the CPU busy
Modern CPU Architectures • Many optimizations to deal with stagnant clock speed • Pipelining • Execute multiple instructions in a pipeline, not one at a time • Pre-load and enqueue instructions to be executed next • Out-of-order execution • Execute instructions whose input data is available out-of-order • Q: When do these not work? • A: Branches • Speculation • Processor predicts which branch is taken and pipelines • Speculative work thrown away in case of branch misprediction 18
Example (Again) • Top-k integers in unsorted array • Maintain heap to store top-k elements • Scan the array to update heap • Q: Can we leverage speculation and prefetching? Min-heap check min if elem larger Array delete min min 15 insert elem 111 … 15 212 307 556 343 212 111 Scan 19
Micro-Architectural Analysis q3 L1 LLC branch 50 150 cycles IPC instr. miss miss miss 40 100 30 Q1 Typer 34 2.0 68 0.6 0.57 0.01 Q3 Typer 25 0.8 21 0.5 0.16 0.27 20 50 Q3 TW 24 1.8 42 0.9 0.16 0.08 10 0 Typer TW • CPU counters ( perf tool) – normalized per row scanned • IPC: Instructions per cycle • L1, LLC: Cache misses • Branch miss: branch mispredictions • Q: Which system performs better? Source: T. Kersten et al., “Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask”, VLDB’19
Memory Stalls Memory Stall Cycles Other Cycles Typer Tectorwise 0 Cycles / Tuple 60 40 40 q3 20 20 0 0 100 1 3 10 30 100 1 3 10 30 100 50 Data Size (TPC-H Scale Factor) • CPU cycles wasted waiting for data • Speculation, prefetching = Lower cost for cache misses
External I/O
Direct Memory Access (DMA) • Hardware subsystem to transfers to/from memory • CPU is offloaded • CPU sends request and does something else • It gets notification when transfer is done • Used for • Disk, Network, GPU I/O • Memory copying • RDMA: Remote DMA • Fast data transfer across nodes in a distributed system
Disk I/O • Write/read blocks of bits • Sequential writes and reads are more efficient • There is a fixed cost for each I/O call • Calls to operating system functions are expensive • Q: How to amortize this cost? • A: Batching: read/write larger blocks • Same concept applies to network I/O and request processing • Latency vs. throughput tradeoff
Processes vs. Threads 25
Processes & Threads • We have discussed that multicores is the future • How to make use of parallelism? • OS/PL support for parallel programming • Processes • Threads
Processes vs. Threads • Process: separate memory space • Thread: shared memory space (except stack) Processes Threads Heap not shared shared Global variables not shared shared Local variables (Stack) not shared not shared Code shared shared File handles not shared shared
Parallel Programming • Shared memory • Threads • Access same memory locations (in heap & global variables) • Message-Passing • Processes • Explicit communication: message-passing
Threads/Shared Memory
Shared Memory Example void main (){ x = 12; // assume that x is a global variable t = new ThreadX(); t.start(); // starts thread t y = 12/x; This is “pseudo-Java” System.out.println(y); t.join(); // wait until t completes in C++: } pthread_create pthread_join class ThreadX extends Thread{ void run (){ x = 0; } } • Question: What is printed as output?
Desired: Atomicity and Isolation void foo (){ Thread a Thread b Atomic: All or nothing … x = 0; … Isolation: Run as if there foo() x = 1; foo() was no concurrency … y = 1/x; … } POSSIBLE DESIRED Thread a Thread b Thread a Thread b x = 0 x = 0 x = 1 x = 1 y = 1 happens- x = 0 before x = 0 y = 1/0 x = 1 changes become y = 1 visible time
Race Conditions • Non-deterministic access to shared variables • Correctness requires specific sequence of accesses • But we cannot rely on it because of non-determinism! • Solutions • Enforce a specific order using synchronization • Enforce a sequence of happen-before relationships • Locks, mutexes, semaphores: threads block each other • Lock-free algorithms: threads do not wait for each other • Hard to implement correctly! Typical programmer uses locks • Java has optimized data structures with thread-safety, e.g., ConcurrentHashMap
Locks We use a lock variable l void foo (){ Thread a Thread b and use it to synchronize x = 0; … … x ++; l.lock() l.lock() Equivalent: declare y = 1/x; foo() foo() void synchronized l.unlock() l.unlock() } foo() Possible Impossible now Thread a Thread b Thread a Thread b l.lock() x = 0 l.lock() - waits foo() x = 1 l.unlock() x = 0 l.lock() - acquires foo() l.unlock() time
Inter-Thread Communication Thread a Thread b … … synchronized(o){ synchronized(o){ o.wait(); foo(); foo(); o.notify(); } } Thread a Thread b Notify on an object sends a signal that activates o.wait() other threads waiting on that object … Thread a waits Useful for controlling order of actions: foo() … Thread b executes foo before Thread a o.notify() o.wait() Example: Producer/Consumer pairs . foo() Consumer can avoid busy waiting .
Recommend
More recommend