✂ ✔ ✓ ✗ ✑ ☞ ✗ ✘ ✞ ✙ ☞ ✚ ✡ ✓ ✥ ✣ ✞ ✤ ✓ ✑ ✔ ✏ ✗ ✖ ✒ ✞ ☎ ✝ ✆ � ✁✂ ✄ ☎ ✆ ✆ ✆ ☞ ✕ ✞ ★ ✑ ✏ ✒ ☞ ✍✓ ☞ ✔ ✡ ✣ Cache-Oblivious Algorithms and Data Structures Gerth Stølting Brodal University of Aarhus ✠☛✡ ✌✎✍ ✌✎✏ ✛✢✜ ✚✧✦ ✝✟✞ 1
Outline • Motivation – A typical workstation – A trivial program • Memory models – I/O model – Ideal cache model • Basic cache-oblivious algorithms – Matrix multiplication – Search trees – Sorting • Some experimental results • Conclusion Cache-Oblivious Algorithms and Data Structures 2
A Typical Workstation Cache-Oblivious Algorithms and Data Structures 3
Customizing a Dell 650 Processor speed 2.4 – 3.2 GHz L3 cache size 0.5 – 2 MB Memory 1/4 – 4 GB Hard Disk 36 GB – 146 GB 7.200 – 15.000 RPM www.dell.dk CD/DVD 8 – 48x L2 cache size 256 – 512 KB L2 cache line size 128 Bytes L1 cache line size 64 Bytes L1 cache size 16 KB www.intel.com Cache-Oblivious Algorithms and Data Structures 4
Customizing a Dell 650 Processor speed 2.4 – 3.2 GHz L3 cache size 0.5 – 2 MB ? w Memory 1/4 – 4 GB o n Hard Disk 36 GB – 146 GB k o t 7.200 – 15.000 RPM t n www.dell.dk CD/DVD 8 – 48x a w e w o L2 cache size 256 – 512 KB D L2 cache line size 128 Bytes L1 cache line size 64 Bytes L1 cache size 16 KB www.intel.com Cache-Oblivious Algorithms and Data Structures 4
Hierarchical Memory Basics B 3 B 1 R CPU L1 L2 L3 Disk A M B 2 B 4 Increasing access time and space • Data moved between adjacent memory levels in blocks Cache-Oblivious Algorithms and Data Structures 5
A Trivial Program for (i=0; i+d<n; i+=d) A[i]=i+d; A[i]=0; for (i=0, j=0; j<8*1024*1024; j++) i=A[i]; d A n Cache-Oblivious Algorithms and Data Structures 6
A Trivial Program (cont.) d = 1 200 180 160 140 120 Seconds 100 80 60 40 20 0 0 5 10 15 20 25 log n RAM : n ≈ 2 25 ≡ 128 MB Cache-Oblivious Algorithms and Data Structures 7
A Trivial Program (cont.) d = 1 3 2.5 2 Seconds 1.5 1 0.5 0 2 4 6 8 10 12 14 16 18 20 log n L1 : n ≈ 2 12 ≡ 16 KB L2 : n ≈ 2 16 ≡ 256 KB Cache-Oblivious Algorithms and Data Structures 8
n = 2 24 A Trivial Program (cont.) 2 1.8 1.6 1.4 1.2 Seconds 1 0.8 0.6 0.4 0.2 0 0 5 10 15 20 25 log d Cache line d = 2 3 ≡ 32 Bytes Cache-Oblivious Algorithms and Data Structures 9
n = 2 24 A Trivial Program (cont.) 2 1.8 1.6 ? w 1.4 o n 1.2 k Seconds o 1 t t n 0.8 a w 0.6 e w 0.4 o D 0.2 0 0 5 10 15 20 25 log d Cache line d = 2 3 ≡ 32 Bytes Cache-Oblivious Algorithms and Data Structures 9
A Trivial Program (cont.) — If you want to know... Experiments were performed on a DELL 8000, Pentium III, 850 MHz, 128 MB RAM, running Linux 2.4.2, and using gcc version 2.96 with optimization -O3 L1 instruction and data caches • 4-way set associative, 32-byte line size • 16 KB instruction cache and 16 KB write-back data cache L2 level cache • 8-way set associative, 32-byte line size • 256 KB www .Intel. com Cache-Oblivious Algorithms and Data Structures 10
� Algorithmic Problem • Memory hierarchy has become a fact of life • Accessing non-local storage may take a very long time • Good locality is important for achieving high performance Latency Relative to CPU Register 0.5 ns 1 L1 cache 0.5 ns 1-2 L2 cache 3 ns 2-7 DRAM 150 ns 80-200 TLB 500+ ns 200-2000 Increasing Disk 10 ms 10 Cache-Oblivious Algorithms and Data Structures 11
Algorithmic Problem • Modern hardware is not uniform — many different parameters – Number of memory levels – Cache sizes – Cache line/disk block sizes – Cache associativity – Cache replacement strategy – CPU/BUS/memory speed Cache-Oblivious Algorithms and Data Structures 12
Algorithmic Problem • Modern hardware is not uniform — many different parameters – Number of memory levels – Cache sizes – Cache line/disk block sizes – Cache associativity – Cache replacement strategy – CPU/BUS/memory speed • Programs should ideally run for many different parameters Cache-Oblivious Algorithms and Data Structures 12
Algorithmic Problem • Modern hardware is not uniform — many different parameters – Number of memory levels – Cache sizes – Cache line/disk block sizes – Cache associativity – Cache replacement strategy – CPU/BUS/memory speed • Programs should ideally run for many different parameters – by knowing many of the parameters at runtime – by knowing few essential parameters – ignoring the memory hierarchies Cache-Oblivious Algorithms and Data Structures 12
Algorithmic Problem • Modern hardware is not uniform — many different parameters – Number of memory levels – Cache sizes – Cache line/disk block sizes – Cache associativity – Cache replacement strategy – CPU/BUS/memory speed • Programs should ideally run for many different parameters – by knowing many of the parameters at runtime – by knowing few essential parameters – ignoring the memory hierarchies practice Cache-Oblivious Algorithms and Data Structures 12
Algorithmic Problem • Modern hardware is not uniform — many different parameters – Number of memory levels – Cache sizes – Cache line/disk block sizes – Cache associativity – Cache replacement strategy – CPU/BUS/memory speed • Programs should ideally run for many different parameters – by knowing many of the parameters at runtime – by knowing few essential parameters – ignoring the memory hierarchies practice • Programs are executed on unpredictable configurations – Generic portable and scalable software libraries – Code downloaded from the Internet, e.g. Java applets – Dynamic environments, e.g. multiple processes Cache-Oblivious Algorithms and Data Structures 12
Outline • Motivation – A typical workstation – A trivial program • Memory models – I/O model – Ideal cache model • Basic cache-oblivious algorithms – Matrix multiplication – Search trees – Sorting • Some experimental results • Conclusion Cache-Oblivious Algorithms and Data Structures 13
Hierarchical Memory Models — many parameters R CPU L1 L2 L3 Disk A M Increasing access time and space • Limited success since model to complicated Cache-Oblivious Algorithms and Data Structures 14
I/O Model — two parameters Aggarwal and Vitter 1988 I/O • Measure number of block transfers between two memory levels M c B e a m c • Bottleneck in many computations CPU o h r e y • Very successful (simplicity) M Cache-Oblivious Algorithms and Data Structures 15
I/O Model — two parameters Aggarwal and Vitter 1988 I/O • Measure number of block transfers between two memory levels M c B e a m c • Bottleneck in many computations CPU o h r e y • Very successful (simplicity) M Limitations • Parameters B and M must be known • Does not handle multiple memory levels • Does not handle dynamic M Cache-Oblivious Algorithms and Data Structures 15
Ideal Cache Model — no parameters!? Frigo, Leiserson, Prokop, Ramachandran 1999 I/O • Program with only one memory M • Analyze in the I/O model for c B e a m c CPU o • Optimal off-line cache replacement h r e y strategy arbitrary B and M M Cache-Oblivious Algorithms and Data Structures 16
Ideal Cache Model — no parameters!? Frigo, Leiserson, Prokop, Ramachandran 1999 I/O • Program with only one memory M • Analyze in the I/O model for c B e a m c CPU o • Optimal off-line cache replacement h r e y strategy arbitrary B and M M Advantages • Optimal on arbitrary level ⇒ optimal on all levels • Portability, B and M not hard-wired into algorithm • Dynamic changing parameters Cache-Oblivious Algorithms and Data Structures 16
Justification of the Ideal-Cache Model Frigo, Leiserson, Prokop, Ramachandran 1999 Optimal replacement LRU + 2 × cache size ⇒ at most 2 × cache misses Sleator and Tarjan, 1985 Corollary T M,B ( N ) = O ( T 2 M,B ( N )) ⇒ #cache misses using LRU is O ( T M,B ( N )) Two memory levels Optimal cache-oblivious algorithm satisfying T M,B ( N ) = O ( T 2 M,B ( N )) ⇒ optimal #cache misses on each level of a multilevel LRU cache Fully associativity cache Simulation of LRU • Direct mapped cache • Explicit memory management • Dictionary (2-universal hash functions) of cache lines in memory • Expected O (1) access time to a cache line in memory Cache-Oblivious Algorithms and Data Structures 17
Outline • Motivation – A typical workstation – A trivial program • Memory models – I/O model – Ideal cache model • Basic cache-oblivious algorithms – Matrix multiplication – Search trees – Sorting • Some experimental results • Conclusion Cache-Oblivious Algorithms and Data Structures 18
Warm-up : Scanning sum = 0 for i = 1 to N do sum = sum + A [ i ] � N � O I/Os B B A N Cache-Oblivious Algorithms and Data Structures 19
Warm-up : Scanning sum = 0 for i = 1 to N do sum = sum + A [ i ] � N � O I/Os B B A N Corollary Cache-oblivious selection requires O ( N/B ) I/Os Hoare 1961 / Blum et al. 1973 Cache-Oblivious Algorithms and Data Structures 19
Recommend
More recommend