Software Engineering Seminar Stephan Semmler A Dynamically Tuned Sorting Library Xiaoming Li, María Jesús Garzarán, and David Padua 2004
The Sorting Library Installation Runtime Hardware Input data Empirical Machine Sorting Optimized Search Learning Algorithms Algorithms Fastest Algorithm
Overview • Sorting Algorithms • Optimizing the Algorithms • Factors of Performance • The Library • Results • Final Words
Merge Sort 4 1 3 2 1 4 2 3 1 2 3 4 • Divide and Conquer • Runtime O(n·log(n)) • Needs additional memory to merge
Multiway Merge Sort • Partition data into p subsets Heap 1 2 … p • Sort each subset • Merge p subsets using a heap Sorted Sorted Sorted Sorted Subset Subset Subset Subset p subsets
Quicksort 9 9 9 9 9 9 9 9 9 9 8 8 8 8 8 8 8 8 8 8 7 7 7 7 7 7 7 7 7 7 6 6 6 6 6 6 6 6 5 6 5 6 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 2 3 2 3 2 3 2 3 2 3 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 • Average runtime O(n·log(n)) • Inplace • Worst Case runtime O(n²)
Radix Sort 322 13 44 142 431 34 1 23 0 1 2 3 4
Radix Sort 322 13 44 142 431 34 1 23 1 142 23 34 431 322 13 44 0 1 2 3 4 431 1 322 142 13 23 44 34
Radix Sort 431 1 322 142 13 23 44 34 0 1 2 3 4
Radix Sort 431 1 322 142 13 23 44 34 23 34 44 1 13 322 431 142 0 1 2 3 4 1 13 322 23 431 34 142 44
Radix Sort 1 13 322 23 431 34 142 44 1 2 3 4 0
Radix Sort 1 13 322 23 431 34 142 44 44 142 322 431 34 23 1 2 3 4 13 1 1 13 23 34 44 142 322 431 0
Radix Sort • Non-comparative sorting algorithm • Needs Integer Keys • Linear Time Complexity O(n) • Highly dependent on key distribution
Insertion Sort 5 1 1 2 2 2 2 2 2 2 2 2 2 5 2 5 5 5 2 1 4 4 4 4 4 4 4 5 2 2 5 2 2 2 4 4 4 4 1 5 5 5 4 5 5 4 4 4 5 4 4 6 5 6 5 6 5 5 6 6 6 1 6 6 6 6 6 6 1 1 6 1 1 6 1 6 1 1 1 6 1 1 6 1 1 • Average case O(n²) • Best case O(n) for sorted data • Good for small partitions
Sorting Networks 0 0 1 1 2 2 3 3 Unsorted Sorted 4 4 5 5 6 6 7 7 • Like hardwired • Only appropriate for very small amount of data
Optimizing Algorithms • For a given Architecture – Cache Size – Registers – Cache Line Size • Which Parameters to tune? Hardware Sorting Algorithms Empirical Optimized Search Algorithms Parameters
Tuning Quicksort • Small partitions – Insertion Sort – Sorting Networks • Threshold for small partitions • Apply immediately Small Partition or at the end
Tuning Radix sort (CC-radix sort) • Create sub-buckets if data is too large for cache • Apply radix sort for sub-buckets • Insertion sort / Sorting networks for small partitions
Tuning Multiway Merge sort • Number of subset p • Operation on heap: find smallest child – Adapt fanout such that children fit into Cache Line Subset Subset Subset Subset Cache Line p subsets
Hardware Sorting Algorithms Optimized Empirical Algorithms Search Parameters
Comparison of Sorting Algorithms Execution Time Intel PIII Xeon (G=2^30 Cycles) 9 Quicksort 8 Radix sort 7 Merge sort 6 5 4 3 2 1 0 0 5 10 15 20 Number of Keys (M=2^20)
Varied Standard Deviations Execution Intel PIII Xeon (2M) Execution Intel PIII Xeon (12M) Time (Cycles) Time (Cycles) 1.1E+09 1.1E+09 1.1E+09 1.0E+09 1.0E+09 9.5E+08 9.5E+08 9.0E+08 9.0E+08 8.5E+08 8.5E+08 8.0E+08 8.0E+08 7.5E+08 7.5E+08 7.0E+08 7.0E+08 6.5E+08 6.5E+08 6.0E+08 6.0E+08 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 Standard Deviation Standard Deviation Quicksort Radix sort Merge sort Quicksort Radix sort Merge sort
Impact of Input Data • Number of keys does not affect the relative performance • Standard deviation matters! – Distribution among buckets in radix sort – Fewer operations on the heap in multiway merge sort • Problems with Standard deviation – Only related to distribution of digit in keys – Expensive to compute – Use Entropy instead
Entropy • Expected value of the information 1 2 4 −𝑄 𝑗 ∗ log 2 𝑄 𝑗 1 3 5 𝑗 1 3 6 Entropy vector 0 0.9 1.58
Building the Library Input Input sizes data sets Entropy Sorting Quicksort Empirical Machine Algorithms Radix sort Search Learning Merge sort Parameters Optimized Prediction Algorithms function
Runtime Procedure Input data Input size Entropy Select Prediction Best Sorting function Algorithm Algorithm Sorted data
Results AMD Athlon Execution Time (Cycles per key) 600 550 500 Quicksort Radix sort 450 Merge sort 400 Result 350 Stadard Deviation 300 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 • Library chooses best algorithm • Overhead of 5% • On average 44% better than worst algorithm
Final words • Optimizies Sorting Algorithms • Works well for unknown input • Overhead for known data • Unclear degree of Optimization • Hard-coded decisions • Further work: See next presentation
Recommend
More recommend