fast scalable parallel comparison sort fast scalable
play

Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel - PowerPoint PPT Presentation

Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel Comparison Sort On Hybrid Multicore Architectures Dip Sankar Dip Sankar Banerjee Dip Sankar Dip Sankar Banerjee Banerjee Parikshit Sakurikar Banerjee, Parikshit Sakurikar and


  1. Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel Comparison Sort On Hybrid Multicore Architectures Dip Sankar Dip Sankar Banerjee Dip Sankar Dip Sankar Banerjee Banerjee Parikshit Sakurikar Banerjee, Parikshit Sakurikar and Kishore Kothapalli Center For Security Theory and Algorithmic Research Center For Security ,Theory, and Algorithmic Research International Institute of Information Technology Hyderabad, INDIA y , AsHES 2013 CSTAR, IIIT Hyderabad 20 May, 2013

  2. GPGPU • General purpose computation on GPUs. General purpose computation on GPUs (GPGPU) is very common and widely practiced. • Provides the lowest cost to FLOPS ratio. P id th l t t t FLOPS ti • A many-core device which consists of: • Symmetric multi-processors. • Low power cores in each SM • Low power cores in each SM. • SIMD programmability. • Shared and global memory. Sh d d l b l • High use in general purpose computations and g g remarkable results in widely used primitives. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  3. Accelerators in Computing p g • Typical usage model Typical usage model • Transfer input from CPU to the accelerator the accelerator • Transfer program to the accelerator accelerator • Execute on the accelerator • Transfer results back to the Transfer results back to the CPU. • Above model necessitated partly because the Abo e model necessitated partl beca se the accelerators do not have I/O capability. • Truly auxiliary devices. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  4. Accelerators in Computing p g • The big issue: Utilizing multiple multicore The big issue: Utilizing multiple multicore devices for computation. • CPU Utilization for solving C f generic problems: • CPUs have high compute power cores. compute power cores. • Computational power of CPUs is also on the rise CPUs is also on the rise. • Hybrid Multi-core Computing • Use all resources available(in a single platform). Use all reso rces a ailable(in a single platform) • Provides a higher level of parallelism and efficiency. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  5. Outline • General Hybrid Computing Platform General Hybrid Computing Platform • Problem Statement • Our Solution • Implementation Details p • Results • Conclusion Conclusion AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  6. Hybrid Multicore Platforms y • Target of our research is to validate the • Target of our research is to validate the implementation of algorithms on both high-end systems as well as commodity low-end system. systems as well as commodity low end system. • A high end system will have a high throughput GPU connected to a multi-core CPU. connected to a multi core CPU. • An Intel i7 980 coupled with an NVidia GTX 580 GPU GPU • A low-end system is typically found on commodity systems such as laptops and desktops systems such as laptops and desktops. • An Intel Core 2 Duo E7400 CPU coupled with an NVidia GT520 GPU NVidia GT520 GPU. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  7. Our Results • In this work we implemented comparison sort p p on a hybrid multicore platform. • We used hybrid sample sorting on the platform We used hybrid sample sorting on the platform using different data sets. • Our sorting implementation is 20% better than • Our sorting implementation is 20% better than the current best known parallel comparison sorting due to Davidson et al at InPAR 2012 sorting, due to Davidson et. al. at InPAR 2012. • Our results are on an average 40% better than the GPU Sample Sorting algorithm published at th GPU S l S ti l ith bli h d t IPDPS 2010. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  8. Problem Definition • Sorting is a fundamental algorithm which finds massive application in scientific computations, f databases, searching, ranking etc. • The problem is to arrange a certain set of input according in a particular order. according in a particular order. 6 10 2 7 11 9 5 2 3 4 6 7 1 5 5 5 6 6 7 7 9 9 10 10 11 11 2 2 2 3 4 6 7 1 5 •Sorting is an irregular operation and is not Sorting is an irreg lar operation and is not entirely suited for GPU or parallel architectures. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  9. Parallel Sorting • Effective use of all available processors by • Effective use of all available processors by creating independent sub-problems. • Quick sort is a popular sorting technique where Q i k t i l ti t h i h sub-problems are created and solved in a recursive fashion. • Sample sort is a generalization of the quick- p g q sorting algorithm that chooses many pivots and hence creates higher number of sub-problems hence creates higher number of sub problems. • Each of these sub-problems can be efficiently allocated to either a CPU or a GPU for sorting allocated to either a CPU or a GPU for sorting. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  10. Algorithm Overview g • Phase I • Create sqrt(n) bins where n in no. of input elements. • Efficiently bin elements using a BST. Effi i tl bi l t i BST • Phase II • Compute histograms of the bins allocated to each Compute histograms of the bins allocated to each CPU and GPU. • Phase III • Phase III • Scatter elements across all SMs on GPU and cores in CPU in an synchronous manner cores in CPU in an synchronous manner. • Phase IV • Recurse Phases I-III and until bin sizes are Recurse Phases I III and until bin sizes are reduced to a certain threshold. • Sort the small bins. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  11. Algorithm Overview g PHASE 1 : BINNING PHASE 2: HISTOGRAM H H Y PHASE 3 B SCATT R I D PHASE 4 : RECURSION 11December 2011 11December 2011 HiPC 2011 HiPC 2011 AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  12. PHASE I : BINNING PHASE 1 • Select splitters at uniform intervals of sqrt(n) • Form a Binary Search Tree using the splitters F Bi S h T i th litt • Now set a threshold for separation of the labels between the CPU and the GPU. • Transfer GPU labels to the device a s e G U abe s to t e de ce • Use BST on both CPU and GPU to bin the elements elements. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  13. PHASE II : Histograms g PHASE 2 • Compute histograms in an overlapped fashion on both the CPU and the GPU. • Store histogram H c of CPU for LEN/BLOCK size of elements. of elements. • Store H g of GPU on LEN/BLOCK size of elements elements. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  14. PHASE III : Histograms g PHASE 3 • Perform scan on the GPU and CPU histograms to compute the block-wise offsets. t th bl k i ff t • Scatter elements in an hybrid fashion to all bins • GPU: Perform local scattering in each BLOCK BLOCK • CPU: Perform global scattering across the single BLOCK single BLOCK. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  15. PHASE IV : Recurse and Sort • Recurse from phases I to III until the size for each block comes down to a size where we can do a normal quick sort on each thread. • Separate the bins among the CPU and GPU and apply the sorting on each of the bins until a and apply the sorting on each of the bins until a final sorted sequence is obtained. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  16. Memory Access Optimization y p AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  17. Memory Access Optimization y p • Available memory is a vital resource. y • Reuse of data-structures is vital for synchronization and consolidation. synchronization and consolidation. • We reuse our histogram store in the scattering step where we do not write all the entries for step where we do not write all the entries for all the labels together. • Instead writing all the entries in one space, we I t d iti ll th t i i write it in the order in which it will be read back. • This facilitates a higher coalescing of reads as This facilitates a higher coalescing of reads as well as the re-use of a data-structure. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  18. Results on Key ‐ Value Pairs y 180 Hybrid Key-value Hybrid Key value Merge Sort Key-value Sample Sort Key-value Satish Key-value 160 /sec on Pairs/ 140 140 Millio 120 100 80 15 16 17 18 19 20 21 22 23 24 No No. of Elements (power of 2) of Elements (power of 2) AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  19. Results on 32 bit integers 3 g 300 Hybrid Key-only Hybrid Key only Merge Sort Key-only Sample Sort Key-only Satish Key-only 250 c MKeys/sec 200 M 150 100 15 16 17 18 19 20 21 22 23 24 No. of Elements (power of 2) AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  20. Results of Key ‐ Value Pairs on Low ‐ End Platform y 55 H b id K Hybrid Key-value l Merge Sort Key-value Sample Sort Key-value Satish Key-value 50 50 /sec on Pairs/ 45 Millio 40 35 30 5 10 15 20 25 30 No. of Elements (Million) AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

  21. Variation of Threshold 30 Threhshold High-End Threhshold Low-End 25 (%) esholed 20 Thre 15 5 10 15 16 17 18 19 20 No. of Elements (power of 2) AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

More recommend