cache oblivious sorting
play

Cache Oblivious Sorting Gerth Stlting Brodal University of Aarhus - PowerPoint PPT Presentation

Cache Oblivious Sorting Gerth Stlting Brodal University of Aarhus Algorithms and Data Structures, Bertinoro, Forl` , Italy, June 22-28, 2003 1 Foundation 2 Outline of Talk Cache oblivious model Sorting problem Binary and


  1. Cache Oblivious Sorting Gerth Stølting Brodal University of Aarhus Algorithms and Data Structures, Bertinoro, Forl` ı, Italy, June 22-28, 2003 1

  2. – Foundation 2

  3. Outline of Talk • Cache oblivious model • Sorting problem • Binary and multiway merge-sort • Funnel-sort • Lower bound — tall cache assumption • Experimental results • Conclusions Gerth S. Brodal: Cache Oblivious Sorting 3

  4. Cache Oblivious Model Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 I/O • Program in the RAM model M • Analyze in the I/O model for c B e a m c CPU o arbitrary B and M h r e y M Gerth S. Brodal: Cache Oblivious Sorting 4

  5. Cache Oblivious Model Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 I/O • Program in the RAM model M • Analyze in the I/O model for c B e a m c CPU o arbitrary B and M h r e y M Advantages: • Optimal on arbitrary level ⇒ optimal on all levels • Portability R CPU L1 L2 A Disk M Increasing access time and space Gerth S. Brodal: Cache Oblivious Sorting 4

  6. Sorting Problem • Input : array containing x 1 , . . . , x N • Output : array with x 1 , . . . , x N in sorted order • Elements can be compared and copied 3 4 8 2 8 4 0 4 4 6 ⇓ 0 2 3 4 4 4 4 6 8 8 Gerth S. Brodal: Cache Oblivious Sorting 5

  7. Binary Merge-Sort 0 2 3 4 4 4 4 6 8 8 Ouput Merging 2 3 4 8 8 0 4 4 4 6 Merging 3 4 2 8 8 0 4 4 4 6 Merging 2 8 0 4 Merging 3 4 8 2 8 4 0 4 4 6 Input Gerth S. Brodal: Cache Oblivious Sorting 6

  8. Binary Merge-Sort 0 2 3 4 4 4 4 6 8 8 Ouput Merging 2 3 4 8 8 0 4 4 4 6 Merging 3 4 2 8 8 0 4 4 4 6 Merging 2 8 0 4 Merging 3 4 8 2 8 4 0 4 4 6 Input • Recursive; two arrays; size O ( M ) internally in cache � � N N • O ( N log N ) comparisons • O B log 2 I/Os M Gerth S. Brodal: Cache Oblivious Sorting 6

  9. Merge-Sort Degree I/O � � N N 2 O B log 2 M � � N N d O B log d M ( d ≤ M B − 1) � � � � M N N Θ O B log M/B = O (Sort M,B ( N )) B M Aggarwal and Vitter 1988 Funnel-Sort O ( 1 2 ε Sort M,B ( N )) ( M ≥ B 1+ ε ) Frigo, Leiserson, Prokop and Ramachandran 1999 Brodal and Fagerberg 2002 Gerth S. Brodal: Cache Oblivious Sorting 7

  10. Outline of Talk • Cache oblivious model • Sorting problem • Binary and multiway merge-sort • Funnel-sort • Lower bound — tall cache assumption • Experimental results • Conclusions Gerth S. Brodal: Cache Oblivious Sorting 8

  11. Funnel-Sort Gerth S. Brodal: Cache Oblivious Sorting 9

  12. k -merger Frigo et al., FOCS’99 Sorted output stream M · · · k sorted input streams Gerth S. Brodal: Cache Oblivious Sorting 10

  13. k -merger Frigo et al., FOCS’99 Sorted output stream ← k 1 / 2 -mergers M 0 Recursive def. ← buffers of size k 3 / 2 = B 1 B √ · · · k M · · · M 1 M √ k · · · · · · k sorted input streams Gerth S. Brodal: Cache Oblivious Sorting 10

  14. k -merger Frigo et al., FOCS’99 Sorted output stream ← k 1 / 2 -mergers M 0 Recursive def. ← buffers of size k 3 / 2 = B 1 B √ · · · k M · · · M 1 M √ k · · · · · · k sorted input streams M 0 B 1 M 1 B 2 M 2 B √ k M √ · · · k Recursive Layout Gerth S. Brodal: Cache Oblivious Sorting 10

  15. Lazy k -merger Brodal and Fagerberg 2002 M 0 → B 1 B √ · · · k · · · M 1 M √ k · · · Gerth S. Brodal: Cache Oblivious Sorting 11

  16. Lazy k -merger Brodal and Fagerberg 2002 M 0 → B 1 B √ · · · k · · · M 1 M √ k · · · Procedure Fill ( v ) while out-buffer not full if left in-buffer empty Fill (left child) if right in-buffer empty Fill (right child) perform one merge step Gerth S. Brodal: Cache Oblivious Sorting 11

  17. Lazy k -merger Brodal and Fagerberg 2002 M 0 → B 1 B √ · · · k · · · M 1 M √ k · · · Procedure Fill ( v ) Lemma while out-buffer not full If M ≥ B 2 and output buffer has size if left in-buffer empty Fill (left child) k 3 then O ( k 3 B log M ( k 3 ) + k ) I/Os are if right in-buffer empty done during an invocation of Fill (root) Fill (right child) perform one merge step Gerth S. Brodal: Cache Oblivious Sorting 11

  18. Funnel-Sort Brodal and Fagerberg 2002 Frigo, Leiserson, Prokop and Ramachandran 1999 Divide input in N 1 / 3 segments of size N 2 / 3 Recursively MergeSort each segment Merge sorted segments by an N 1 / 3 -merger k N 1 / 3 N 2 / 9 N 4 / 27 . . . 2 Gerth S. Brodal: Cache Oblivious Sorting 12

  19. Funnel-Sort Brodal and Fagerberg 2002 Frigo, Leiserson, Prokop and Ramachandran 1999 Divide input in N 1 / 3 segments of size N 2 / 3 Recursively MergeSort each segment Merge sorted segments by an N 1 / 3 -merger k N 1 / 3 N 2 / 9 N 4 / 27 . . . 2 Funnel-Sort performs O (Sort M,B ( N )) I/Os for M ≥ B 2 Theorem Gerth S. Brodal: Cache Oblivious Sorting 12

  20. Outline of Talk • Cache oblivious model • Sorting problem • Binary and multiway merge-sort • Funnel-sort • Lower bound — tall cache assumption • Experimental results • Conclusions Gerth S. Brodal: Cache Oblivious Sorting 13

  21. Lower Bound Brodal and Fagerberg 2003 Block Size Memory I/Os B 1 M t 1 Machine 1 Machine 2 B 2 M t 2 One algorithm, two machines, B 1 ≤ B 2 Trade-off 8 t 1 B 1 + 3 t 1 B 1 log 8 Mt 2 ≥ N log N M − 1 . 45 N t 1 B 1 Gerth S. Brodal: Cache Oblivious Sorting 14

  22. Lower Bound Assumption I/Os ( a ) B 2 = M 1 − ε : Sort B 2 ,M ( N ) Lazy B ≤ M 1 − ε Funnel-sort Sort B 1 ,M ( N ) · 1 ( b ) B 1 = 1 : ε ( a ) B 2 = M/ 2 : Sort B 2 ,M ( N ) Binary B ≤ M/ 2 Merge-sort ( b ) B 1 = 1 : Sort B 1 ,M ( N ) · log M Corollary ( a ) ⇒ ( b ) Gerth S. Brodal: Cache Oblivious Sorting 15

  23. Fake Proof Goal: 8 t 1 B 1 + 3 t 1 B 1 log 8 Mt 2 ≥ N log N M − 1 . 45 N t 1 B 1 Merging sorted lists X and Y takes ≈ | X | log | Y | | X | comparisons In total t 1 B 1 elements touched ⇒ t 1 B 1 /t 2 elements touched on average per B 2 -I/O ⇒ effective B 2 is t 1 B 1 /t 2 B 2 : Comparisons gained per B 2 -I/O: M : M t 1 B 1 /t 2 · log t 1 B 1 /t 2 Hence: t 1 B 1 · log Mt 2 ≥ N log N − 1 . 45 N t 1 B 1 Gerth S. Brodal: Cache Oblivious Sorting 16

  24. Fake Proof Goal: 8 t 1 B 1 + 3 t 1 B 1 log 8 Mt 2 ≥ N log N M − 1 . 45 N t 1 B 1 Merging sorted lists X and Y takes ≈ | X | log | Y | | X | comparisons In total t 1 B 1 elements touched ⇒ t 1 B 1 /t 2 elements touched on average per B 2 -I/O ⇒ effective B 2 is t 1 B 1 /t 2 B 2 : Comparisons gained per B 2 -I/O: M : M t 1 B 1 /t 2 · log One problem : t 1 B 1 /t 2 Online choice Hence: t 1 B 1 · log Mt 2 ≥ N log N − 1 . 45 N t 1 B 1 Gerth S. Brodal: Cache Oblivious Sorting 16

  25. Ideas from Real Proof I/O 1 [ s, t ] , . . . I/O 2 [ s, t ] , . . . A [ i ] ≤ A [ j ] A [ i ] ← A [ j ] T Answers T T ∗ ∗ ∗ s A : i 8 t 1 B 1 + 3 t 1 B 1 log 8 Mt 2 ≥ height ≥ N log N M − 1 . 45 N B 1 t 1 Gerth S. Brodal: Cache Oblivious Sorting 17

  26. Outline of Talk • Cache oblivious model • Sorting problem • Binary and multiway merge-sort • Funnel-sort • Lower bound — tall cache assumption • Experimental results • Conclusions Gerth S. Brodal: Cache Oblivious Sorting 18

  27. � Hardware Processor type Pentium 4 Pentium 3 MIPS 10000 Workstation Dell PC Delta PC SGI Octane Operating system GNU/Linux Kernel GNU/Linux Kernel IRIX version 6.5 version 2.4.18 version 2.4.18 Clock rate 2400 MHz 800 MHz 175 MHz Address space 32 bit 32 bit 64 bit Integer pipeline stages 20 12 6 L1 data cache size 8 KB 16 KB 32 KB L1 line size 128 Bytes 32 Bytes 32 Bytes L1 associativity 4 way 4 way 2 way L2 cache size 512 KB 256 KB 1024 KB L2 line size 128 Bytes 32 Bytes 32 Bytes L2 associativity 8 way 4 way 2 way TLB entries 128 64 64 TLB associativity Full 4 way 64 way TLB miss handler Hardware Hardware Software Main memory 512 MB 256 MB 128 MB Gerth S. Brodal: Cache Oblivious Sorting 19

  28. Wall Clock ffunnelsort Pentium 4, 512/512 funnelsort 100.0µs lowscosa stdsort ami_sort msort-c msort-m Wall clock time per element 10.0µs 1.0µs 0.1µs 1,000,000 10,000,000 100,000,000 1,000,000,000 Elements Kristoffer Vinther 2003 Gerth S. Brodal: Cache Oblivious Sorting 20

  29. Page Faults ffunnelsort Pentium 4, 512/512 funnelsort 30.0 lowscosa stdsort msort-c 25.0 msort-m Page faults per block of elements 20.0 15.0 10.0 5.0 0.0 1,000,000 10,000,000 100,000,000 1,000,000,000 Elements Kristoffer Vinther 2003 Gerth S. Brodal: Cache Oblivious Sorting 21

  30. Cache Misses ffunnelsort MIPS 10000, 1024/128 funnelsort 30.0 lowscosa stdsort msort-c 25.0 msort-m L2 cache misses per lines of elements 20.0 15.0 10.0 5.0 0.0 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 Elements Kristoffer Vinther 2003 Gerth S. Brodal: Cache Oblivious Sorting 22

  31. TLB Misses ffunnelsort MIPS 10000, 1024/128 funnelsort 10.0 lowscosa stdsort msort-c msort-m TLB misses per block of elements 1.0 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 Elements Kristoffer Vinther 2003 Gerth S. Brodal: Cache Oblivious Sorting 23

  32. Outline of Talk • Cache oblivious model • Sorting problem • Binary and multiway merge-sort • Funnel-sort • Lower bound — tall cache assumption • Experimental results • Conclusions Gerth S. Brodal: Cache Oblivious Sorting 24

Recommend


More recommend