Highly Scalable Parallel Sorting Edgar Solomonik University of - PowerPoint PPT Presentation

Highly Scalable Parallel Sorting Edgar Solomonik University of Illinois at Urbana-Champaign April 29, 2010 1

Outline ● Parallel sorting background ● Histogram Sort overview ● Histogram Sort optimizations ● Charm++ implementation ● Results ● Limitations of work ● Contributions ● Future work 2

Parallel Sorting ● Input – There are n unsorted keys, distributed evenly over p processors – The distribution of keys in the range is unknown and possibly skewed ● Goal – Sort the data globally according to keys – Ensure no processor has more than ( n/p)+threshold keys 3

Scaling Challenges ● Load balance – Main objective of most parallel sorting algorithms – Each processor needs a continuous chunk of data ● Data exchange communication – Can require complete communication graph – All-to-all contains n elements in p² messages 4

Parallel Sorting Algorithms Type Data movement ● Merge-based – Bitonic Sort ½* n*log ²(p ) – Cole's Merge Sort O(n*log(p)) ● Splitter-based – Sample Sort n – Histogram Sort n ● Other – Parallel Quicksort O(n*log(p)) 5 – Radix Sort O(n)~4*n

Splitter-Based Parallel Sorting ● A splitter is a key that Splitting of Initial Data partitions the global data at a desired location ● p-1 global splitters needed to subdivide the data into p Proc 3 continuous chunks Number of Keys ● Each processor can send Splitter 1 out its local data based on Splitter 2 Proc 2 the splitters – Data moves only once ● Each processor merges the Proc 1 data chunks as it receives them Key 6

Splitter on Key Density Function n Number of Keys Smaller than x k*(n/p) 0 S key_min key_max p l i t t e r Key k 7

Sample Sort Processor 1 sorted data Processor 1 ...... ...... Processor p-1 sorted data Processor p-1 Extract local samples Sample Sample Concatenate samples Combined Sample Sort combined sample Combined Sorted Sample Extract splitters Splitters Broadcast Splitters Splitters Splitters Apply splitters to data ...... 8 All-to-All

Sample Sort ● The sample is typically regularly spaced in the local sorted data s=p-1 – Worst case final load imbalance is 2*(n/p) keys – In practice, load imbalance is typically very small ● Combined sample becomes bottleneck since (s*p)~p ² – With 64 -bit keys, if p = 8192 , sample is 16 GB ! 9

Basic Histogram Sort ● Splitter-based ● Uses iterative guessing to find splitters – O(p) probe rather than O(p ² ) combined sample – Probe refinement based on global histogram ● Histogram calculated by applying splitters to data ● Kale and Krishnan, ICPP 1993 ● Basis for this work 10

Basic Histogram Sort Test probe of splitter-guesses Broadcast probe ...... Processor 1 sorted data Processor 1 Processor p sorted data Processor 1 Calculate histograms Refine probe Add up histograms Analyze global histogram If probe not converged If converged Apply splitters to data ...... 11 All-to-All

Basic Histogram Sort ● Positives – Splitter-based: single all-to-all data transpose – Can achieve arbitrarily small threshold – Probing technique is scalable compared to sample sort, O(p) vs O(p ² ) – Allows good overlap between communication and computation (to be shown) ● Negatives – Harder to implement 12 – Running time dependent on data distribution

Sorting and Histogramming Overlap ● Don't actually need to sort local data first ● Splice data instead – Use splitter-guesses as Quicksort pivots – Each splice determines location of a guess and partitions data ● Sort chunks of data while histogramming happens 13

Histogramming by Splicing Data Unsorted data Splice data with probe Sort chunks Sorted data Splice data with new probe Splice here Search here 14

Histogram Overlap Analysis ● Probe generation work should be offloaded to one processor – Reduces critical path ● Splicing is somewhat expensive – O((n/p)*log(p)) for first iteration ● log(p) approaches log(n/p) in weak scaling – Small theoretical overhead (limited pivot selection) – Slight implementation overhead (libraries faster) – Some optimizations/code necessary 15

Sorting and All-to-All Overlap ● Histogram and local sort overlap is good but the all-to-all is the worst scaling bottleneck ● Fortunately, much all-to-all overlap available ● All-to-all can initially overlap with local sorting – Some splitters converge every histogram iteration ● This is also prior to completion of local sorting ● Can begin sending to any defined ranges 16

Eager Data Movement Sorted data Unsorted Data Receive message with resolved ranges Extract chunk Sort chunk 17 Send to destination processor Send to destination processor

All-to-All and Merge Overlap ● The k -way merge done when the data arrives should be implemented as a tree merge – A k -way heap merge requires all k arrays – A tree merge can start with just two arrays ● Some data arrives much earlier than the rest – Tree merge allows overlap 18

Tree k-way Merging B1 First chunk Buffer 1 First chunk Buffer 1 Buffer 2 B2 Another chunk arrives First chunk First chunk Second chunk Buffer 1 B1 Merge First merged data B2 Two more chunks arrive B1 First chunk Buffer 1 Third chunk Fourth chunk Merge B2 First merged data Second merged data B1 First chunk Final merged data Buffer 1 Merge 19 B2 First merged data Second merged data

Charm++ Implementation ● Why? – Sort is compatible with Charm++ applications – Division between histogramming analysis work and data containers ● More natural ● Flexible – Charm++ scheduler used to automatically overlap executing stages and push probes through ● MPI implementation possible, but more difficult 20

Overlap Benefit (Weak Scaling) 21 Tests done on Intrepid (BG/P) and Jaguar (XT4) with 8 million 64-bit keys per core.

Effect of All-to-All Overlap N 100% O Histogram Send data O Processor Utilization V E R Sort all data Merge L Idle time A P V S 100% O V Sort by chunks Processor Utilization E Send data R Splice data L Merge A P 24 Tests done on 4096 cores of Intrepid (BG/P) with 8 million 64-bit keys per core.

All-to-All Spread and Staging ● Personalized all-to-all collective communication strategies important – All-to-all eventually dominates execution time ● Some basic optimizations easily applied – Varying order sends ● Minimizes network contention – Only a subset of processors should send data to one destination at a time ● Prevents network overload 25

Communication Spread Data Splicing Sorting Sending Merging 26 Tests done on 4096 cores of Intrepid (BG/P) with 8 million 64-bit keys per core.

Algorithm Scaling Comparison Out of memory 27 Tests done on Intrepid (BG/P) with 8 million 64-bit keys per core.

Histogram Sort Parallel Efficiency 28 Tests done on Intrepid (BG/P) and Jaguar (XT4) with 8 million 64-bit keys per core.

Some Limitations of this Work ● Benchmarking done with 64-bit keys rather than key-value pairs ● Optimizations presented are only beneficial for certain parallel sorting problems – Generally, we assumed n > p ² ● Splicing useless unless n/p > p ● Different all-to-all optimizations required if n/p is small (combine messages) – Communication usually cheap until p> 512 ● Complex implementation another issue 29

Future/Ongoing Work ● Write a further optimized library implementation of Histogram Sort – Sort key-value pairs – Almost completed, code to be released ● To scale past 32k cores, histogramming needs to be better optimized – As p→n/p , probe creation cost matches the cost of local sorting and merging – One promising solution is to parallelize probing 30 ● Can use early determined splitters to divide probing

Contributions ● Improvements on original Histogram Sort algorithm – Overlap between computation and communication – Interleaved algorithm stages ● Efficient and well-optimized implementation ● Scalability up to tens of thousands of cores ● Ground work for further parallel scaling of sorting algorithms 31

Acknowledgements ● Everyone in PPL for various and generous help ● IPDPS reviewers for excellent feedback ● Funding and Machine Grants DOE Grant DEFG05-08OR23332 through ORNL LCF – Blue Gene/P at Argonne National Laboratory, which is supported by DOE – under contract DE-AC02-06CH11357 Jaguar at Oak Ridge National Laboratory, which is supported by the DOE – under contract DE-AC05-00OR22725 Accounts on Jaguar were made available via the Performance Evaluation and – Analysis Consortium End Station, a DOE INCITE project. 32

Highly Scalable Parallel Sorting Edgar Solomonik University of - PowerPoint PPT Presentation

Highly Scalable Parallel Sorting Edgar Solomonik University of Illinois at Urbana-Champaign April 29, 2010 1 Outline Parallel sorting background Histogram Sort overview Histogram Sort optimizations Charm++ implementation

SORTING Review of Sorting Merge Sort Sets sorting 1 Sorting Algorithms

Overview/Questions What is sorting? Why does sorting matter? How is sorting

Sorting Lower Bound Sorting Lower Bound 1 Comparison-Based Sorting (10.4) Many sorting

Sorting Insertion sort Bubble sort Divide and conquer sorting Sorting Last time: introduction

+ Design of Parallel Algorithms Parallel Sorting Algorithms + Topic Overview n Issues in

Highly Scalable Parallel Sorting Edgar Solomonik and Laxmikant Kale University of Illinois at

Cache and TLB-aware Parallel Sorting Kynan Shook Sorting Sorting is used in many places

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Sorting with Pop Stacks Stack sorting Pop stack sorting 1-pop-stack sortability 2-pop-stack

Sorting Sorting used as a step in many algorithms Savitch Chapter 7.4 Sorting algorithms

Sorting Sorting as a tool Sorting problem: Given a list a with n elements possessing a There are

Sorting Sorting: to arrange data in some sequential order Sorting occurs as a part in

Chapter 7 External Sorting Sorting Tables Larger Than Main Memory Query Processing Sorting

Sorting Algorithms Introduction Sorting Problem Sorting Problem Given a sequence A = a 1 , .

Highly Scalable Highly Scalable Ethernets Ethernets Paul Bottorff, Chief Architect, Carrier

Sorting Algorithms CENG 707 Data Structures and Algorithms Sorting Sorting is a process

Recommended Reading W. Gropp, E. Lusk and A. Skjellum: Using MPI: Portable Parallel MPI: The

COMP30112: Concurrency Topic 9: Termination Detection Alan Williams Room 2.107, email:

Probabilistic Model Checking Michaelmas Term 2011 Dr. Dave Parker Department of

CSE 373: Open addressing 3 If we collide, checking each next element until we fjnd an open slot.

gNMI gRPC Network Management Interface Samuel Ribeiro Fall 2017 - Faucet Conference Why gNMI?

Enabling Silicon-as-a-Service ENABLING SILICON-AS-A-SERVICE Algodone at a Glance When/Where

Mathematical Programming: Modelling and Applications September 2009 Sonia Cafieri LIX, cole

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics

Sambuz

Useful Links

Newsletter

Mail Us