Understanding CPU Caches Ulrich Drepper
Introduction Discrepancy main CPU and main memory speed ● Intel lists for Pentium M nowadays: – ~240 cycles to access main memory ● The gap is widening ● Faster memory is too expensive
The Solution for Now CPU caches: additional set(s) of memory added between CPU and main memory ● Designed to not change the programs' semantics ● Controlled by the CPU/chipset ● Can have multiple layers with different speed (i.e., cost) and size
How Does It Look Like ~240 cycles Main Memory System Bus 3 rd Level Cache ~14 cycles ~3 cycles 1 st Level Data 2 nd Level Cache Cache 1 st Level Instruction Execution Unit Cache ≤1 cycle ~3 cycles
Cache Usage Factors Numerous factors decide cache performance: ● Cache size ● Cacheline handling – associativity ● Replacement strategy ● Automatic prefetching
Cache Addressing Address (32/64 Bits) M Bits H Bits Aliases ! Cacheline Hash Bucket Size Address N-way Buckets
Observing the Effects Test Program to see the effects: ● Walks single linked list – Sequential in memory – Randomly distributed ● Write to list elements struct l { struct l *n; long pad[NPAD]; };
Sequential Access (NPAD=0) 9.5 9 8.5 Cycles / List Element 8 7.5 7 6.5 6 5.5 5 4.5 4 2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 Working Set Size (Bytes)
Sequential List Access 325 300 Cycles / List Element 275 250 225 200 175 150 125 100 75 50 25 0 2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 Working Set Size (Bytes) Size=8 Size=64 Size=128 Size=256
Sequential vs Random Access (NPAD=0) 500 450 Cycles / List Element 400 350 300 250 200 150 100 50 0 2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 Working Set Size (in bytes) Sequential Random
Sequential Access (NPAD=1) 30 28 Cycles / List Element 25 23 20 18 15 13 10 8 5 3 2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 Working Set Size (Bytes) Follow Inc Addnext0
Optimizing for Caches I ● Use memory sequentially – For data, use arrays instead of lists – For instructions, avoid indirect calls ● Chose data structures as small as possible ● Prefetch memory
Sequential Access w/ vs w/out L3 500 450 Cycles / List Element 400 350 300 250 200 150 100 50 0 2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 Working Set Size (Bytes) P4/64/16k/1M-128b P4/64/16k/1M-256b P4/32/?/512k/2M- P4/32/?/512k/2M- 128b 256b
More Fun: Multithreading CPU Core CPU Core CPU Core CPU Core #3 #4 #1 #2 L1 L1 L1 L1 L2 L2 Main Memory 1. CPU Core #1 and #3 read from a memory location; L2 the relevant L1 contain the data 2. CPU Core #2 writes to the memory location a)Notify L1 of core #1 that content is obsolete b)Notify L2 and L1 of second proc that content is obsolete
More Fun: Multithreading CPU Core CPU Core CPU Core CPU Core #3 #4 #1 #2 L1 L1 L1 L1 L2 L2 Main Memory 3. Core #4 writes to the memory location a)Wait for core #2's cache content to land in main memory b)Notify core #2's L1 and L2 that content is obsolete
Sequential Increment 128 Byte Elements 1000 Cycles / List Element 100 10 1 2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 Working Set Size (Bytes) Nthreads=1 Nthreads=2 Nthreads=4
Optimizing for Caches II Cacheline ping-pong is deadly for performance ● If possible, write always on the same CPU ● Use per-CPU memory; lock thread to specific CPU ● Avoid placing often independently read and written-to data in the same cacheline
Questions?
Recommend
More recommend