understanding cpu caches
play

Understanding CPU Caches Ulrich Drepper Introduction Discrepancy - PowerPoint PPT Presentation

Understanding CPU Caches Ulrich Drepper Introduction Discrepancy main CPU and main memory speed Intel lists for Pentium M nowadays: ~240 cycles to access main memory The gap is widening Faster memory is too expensive The Solution


  1. Understanding CPU Caches Ulrich Drepper

  2. Introduction Discrepancy main CPU and main memory speed ● Intel lists for Pentium M nowadays: – ~240 cycles to access main memory ● The gap is widening ● Faster memory is too expensive

  3. The Solution for Now CPU caches: additional set(s) of memory added between CPU and main memory ● Designed to not change the programs' semantics ● Controlled by the CPU/chipset ● Can have multiple layers with different speed (i.e., cost) and size

  4. How Does It Look Like ~240 cycles Main Memory System Bus 3 rd Level Cache ~14 cycles ~3 cycles 1 st Level Data 2 nd Level Cache Cache 1 st Level Instruction Execution Unit Cache ≤1 cycle ~3 cycles

  5. Cache Usage Factors Numerous factors decide cache performance: ● Cache size ● Cacheline handling – associativity ● Replacement strategy ● Automatic prefetching

  6. Cache Addressing Address (32/64 Bits) M Bits H Bits Aliases ! Cacheline Hash Bucket Size Address N-way Buckets

  7. Observing the Effects Test Program to see the effects: ● Walks single linked list – Sequential in memory – Randomly distributed ● Write to list elements struct l { struct l *n; long pad[NPAD]; };

  8. Sequential Access (NPAD=0) 9.5 9 8.5 Cycles / List Element 8 7.5 7 6.5 6 5.5 5 4.5 4 2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 Working Set Size (Bytes)

  9. Sequential List Access 325 300 Cycles / List Element 275 250 225 200 175 150 125 100 75 50 25 0 2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 Working Set Size (Bytes) Size=8 Size=64 Size=128 Size=256

  10. Sequential vs Random Access (NPAD=0) 500 450 Cycles / List Element 400 350 300 250 200 150 100 50 0 2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 Working Set Size (in bytes) Sequential Random

  11. Sequential Access (NPAD=1) 30 28 Cycles / List Element 25 23 20 18 15 13 10 8 5 3 2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 Working Set Size (Bytes) Follow Inc Addnext0

  12. Optimizing for Caches I ● Use memory sequentially – For data, use arrays instead of lists – For instructions, avoid indirect calls ● Chose data structures as small as possible ● Prefetch memory

  13. Sequential Access w/ vs w/out L3 500 450 Cycles / List Element 400 350 300 250 200 150 100 50 0 2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 Working Set Size (Bytes) P4/64/16k/1M-128b P4/64/16k/1M-256b P4/32/?/512k/2M- P4/32/?/512k/2M- 128b 256b

  14. More Fun: Multithreading CPU Core CPU Core CPU Core CPU Core #3 #4 #1 #2 L1 L1 L1 L1 L2 L2 Main Memory 1. CPU Core #1 and #3 read from a memory location; L2 the relevant L1 contain the data 2. CPU Core #2 writes to the memory location a)Notify L1 of core #1 that content is obsolete b)Notify L2 and L1 of second proc that content is obsolete

  15. More Fun: Multithreading CPU Core CPU Core CPU Core CPU Core #3 #4 #1 #2 L1 L1 L1 L1 L2 L2 Main Memory 3. Core #4 writes to the memory location a)Wait for core #2's cache content to land in main memory b)Notify core #2's L1 and L2 that content is obsolete

  16. Sequential Increment 128 Byte Elements 1000 Cycles / List Element 100 10 1 2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 Working Set Size (Bytes) Nthreads=1 Nthreads=2 Nthreads=4

  17. Optimizing for Caches II Cacheline ping-pong is deadly for performance ● If possible, write always on the same CPU ● Use per-CPU memory; lock thread to specific CPU ● Avoid placing often independently read and written-to data in the same cacheline

  18. Questions?

Recommend


More recommend