High-Throughput Sorting by Dynamically Merging Multiple Hardware Sequential Sorters Wei Song 03/04/2014 Advanced Processor Technologies Group The School of Computer Science
Motivation • Hardware sorter is important. • Parallel sorters have size limit. – Sorting N numbers need a network sized 2 log ( ) N N • Sequential sorters have throughput limit. – Sorting throughput is limited to 1 number per cycle. • Is there a way to sort N (N>1M) numbers with a throughput larger than 1 number per cycle? Advanced Processor Technologies Group 03/04/2014 2 School of Computer Science
Content • Review of existing sorters – Parallel sorters – Sequential sorters • Parallel merge-tree sorter – Key ideas – Hardware structure – Performance Advanced Processor Technologies Group 03/04/2014 3 School of Computer Science
Parallel Sorters (Bitonic Sorting Network) S0 S1 S2 S3 S4 S5 12 12 12 9 9 9 9 9 BN(8) BN(4) BN(2) 89 89 9 12 12 12 12 12 I 7 O 7 BM(8) BM(4) 53 9 89 53 30 30 17 17 I 6 O 6 9 53 53 89 17 17 30 30 BN(2) I 5 O 5 30 30 30 17 89 62 53 53 I 4 O 4 79 79 17 30 53 53 62 62 BN(4) BN(2) 62 17 79 62 62 89 79 79 I 3 O 3 BM(4) 17 62 62 79 79 79 89 89 I 2 O 2 BN(2) I 1 O 1 B min{ A , B } I 0 O 0 A max{ A , B } Advanced Processor Technologies Group 03/04/2014 School of Computer Science
Parallel Sorters (Bitonic Sorting Network) BM(8) S0 S1 S2 S3 S4 S5 4 4 BN(8) BN(4) BN(2) BM(4) BM(4) I 7 O 7 BM(8) BM(4) 2 2 2 2 I 6 O 6 BN(2) BM(2) BM(2) BM(2) BM(2) I 5 O 5 I 4 O 4 Bitonic Network (BN) BN(4) BN(2) Bitonic Merger (BM) I 3 O 3 BM(4) P Data Set Size: I 2 O 2 BN(2) I 1 O 1 P Throughput: I 0 O 0 2 log ( ) Size(Compare): P P 2 Delay: log ( ) P Advanced Processor Technologies Group 03/04/2014 School of Computer Science
Sequential Sorters (Insertion Sorter) data_in 0 > > > data_out 0 Cell 0 Cell 1 Cell N-1 3 N Data Set Size: 12 3 7 3 12 Throughput: 1 1 3 7 12 Size(cells): N 9 1 3 7 12 20 1 3 7 9 12 N Delay: 1 3 7 9 12 20 Advanced Processor Technologies Group 03/04/2014 6 School of Computer Science
Sequential Sorter (FIFO-merge) I 1 S 2 S 1 S 0 N /8 N /4 N /2 I 0 O I 1 I 0 5 12 16 19 22 4 9 10 22 N Data Set Size: 5 12 16 19 19 22 4 9 10 Throughput: 1 5 12 16 16 19 22 2 N Size(Memory): 4 9 10 5 12 N Delay: 12 16 19 22 4 9 10 D. Koch and J. Torresen , “ FPGASort: a high performance sorting architecture exploiting run- time reconfiguration on FPGAs for large problem sorting ,” in Proc. of FPGA , February 2011, pp. 45 – 54. Advanced Processor Technologies Group 03/04/2014 7 School of Computer Science
Summarise Existing Sorters • Parallel Sorters – High throughput – Area increases significantly with the quantity of data – Sorting a small quantity of numbers • Sequential Sorters – Linear area overhead – Feasible for large data sets – Low throughput Advanced Processor Technologies Group 03/04/2014 8 School of Computer Science
Can we dynamically merge multiple sequential sorters? Advanced Processor Technologies Group 03/04/2014 9 School of Computer Science
Parallel Merging Merge multiple sequential sorters using a Bitonic network? YES Sequential sorter 3 5 15 22 28 34 1 5 10 17 24 29 Sequential sorter 1 9 10 20 24 30 3 6 13 20 26 30 Sequential sorter 4 7 15 17 26 29 4 7 15 22 28 34 Sequential sorter 5 6 13 24 28 37 5 9 15 24 28 37 Advanced Processor Technologies Group 03/04/2014 10 School of Computer Science
Parallel Merging Merge multiple sequential sorters using a Bitonic network? YES Sequential sorter 3 5 15 22 28 34 1 5 10 17 24 29 Sequential sorter 1 9 10 20 24 30 3 6 13 20 26 30 Sequential sorter 4 7 15 17 26 29 4 7 15 22 28 34 Sequential sorter 5 6 13 24 28 37 5 9 15 24 28 37 NO! Sequential sorter 5 9 10 22 30 34 1 3 7 13 24 28 Sequential sorter 1 3 15 20 24 28 4 6 10 20 26 28 Sequential sorter 5 6 7 13 26 28 5 9 15 22 29 34 Sequential sorter 4 15 17 24 29 37 5 15 17 24 30 37 Numbers may not be distributed evenly among sequences. Advanced Processor Technologies Group 03/04/2014 11 School of Computer Science
Parallel Merging Increase the comparing window. Sequential sorter 5 9 10 22 30 34 Sequential sorter 1 3 15 20 24 28 Advanced Processor Technologies Group 03/04/2014 12 School of Computer Science
Parallel Merging Increase the comparing window. Return unselected numbers. 24 Sequential sorter 5 9 10 22 30 34 28 30 Sequential sorter 1 3 15 20 24 28 34 24 10 5 9 10 22 30 34 5 9 10 22 28 15 30 20 24 30 1 3 15 20 24 28 1 3 15 20 34 22 28 34 10 3 5 9 10 22 22 5 9 10 9 24 30 1 3 15 20 24 28 10 20 24 30 1 3 15 28 34 15 22 28 34 YES! Advanced Processor Technologies Group 03/04/2014 13 School of Computer Science
Parallel Merging Increase the comparing window. Return unselected numbers. 24 Sequential sorter 5 9 10 22 30 34 28 30 Sequential sorter 1 3 15 20 24 28 34 Requirement: To merge S pre-sorted sequence and at a speed of S numbers per cycle, 1. Increase the comparing window to S x S ; 2. Using an S x S -input Bitonic sorting network; [Area overhead] 3. Return the S x ( S - 1) unselected numbers; [Control overhead] 4. Unselected numbers should be returned in one cycle. [Slow clock] 5. Maximal shifting rate of S numbers per cycle. [Speed mismatch] Advanced Processor Technologies Group 03/04/2014 14 School of Computer Science
Parallel Merging 10 13 Sequential sorter 5 9 10 22 30 34 15 15 17 20 Sequential sorter 1 3 15 20 24 28 22 24 24 26 28 Sequential sorter 5 6 7 13 26 28 28 29 30 Sequential sorter 4 15 17 24 29 37 34 37 Advanced Processor Technologies Group 03/04/2014 15 School of Computer Science
Optimising the Parallel Merging 10 13 Sequential sorter 5 9 10 22 30 34 15 Using a tree structure reduces the number of 15 17 comparators by > 50%. 20 Sequential sorter 1 3 15 20 24 28 22 24 24 26 28 Sequential sorter 5 6 7 13 26 28 28 29 30 Sequential sorter 4 15 17 24 29 37 34 37 20 Sequential sorter 5 9 10 22 30 34 24 22 26 24 30 Sequential sorter 1 3 15 20 28 28 34 28 29 30 17 Sequential sorter 5 6 7 13 26 28 34 24 37 26 29 Sequential sorter 4 15 17 24 28 37 Advanced Processor Technologies Group 03/04/2014 16 School of Computer Science
Optimising the Parallel Merging 20 Sequential sorter 5 9 10 22 30 34 24 Replace the Bitonic sorting 22 26 networks with Bitonic 24 30 Sequential sorter 1 3 15 20 28 28 34 mergers because the 28 29 sequences are pre-sorted. 30 17 Sequential sorter 5 6 7 13 26 28 34 24 37 26 29 Sequential sorter 4 15 17 24 28 37 20 Sequential sorter 5 9 10 22 30 34 24 22 26 1. Reduce the comparing 24 30 Sequential sorter 1 3 15 20 28 window. 28 34 2. Reduce the size of 28 soring networks. 29 3. Reduce the numbers 30 17 being returned. Sequential sorter 5 6 7 13 26 28 34 24 37 26 29 Sequential sorter 4 15 17 24 28 37 Advanced Processor Technologies Group 03/04/2014 17 School of Computer Science
Bitonic Partial Merger I 7 O 7 I 7 I 6 O 6 I 6 I 5 I 5 O 5 I 4 O 4 I 4 I 3 O 3 I 3 O 3 I 2 O 2 I 2 O 2 I 1 O 1 I 1 O 1 I 0 O 0 I 0 O 0 • A. Farmahini-Farahani, H. J. Duwe, III, M. J. Schulte, and K. Compton, “ Modular design of high-throughput, low-latency sorting units ,” IEEE Transactions on Computers , vol. 62, no. 7, pp. 1389 – 1402, July 2013. Advanced Processor Technologies Group 03/04/2014 18 School of Computer Science
Optimising the Parallel Merging control control Single clock Sequential sorter 5 9 10 22 30 34 data return. 24 30 Sequential sorter 1 3 15 20 28 34 control 29 30 Sequential sorter 5 6 7 13 26 28 34 37 26 29 Sequential sorter 4 15 17 24 28 37 Advanced Processor Technologies Group 03/04/2014 19 School of Computer Science
Optimising the Parallel Merging control 2 N/cyc 1 N/cyc control Single clock Sequential sorter 5 9 10 22 30 34 data return. 24 30 Sequential sorter 1 3 15 20 28 34 4 N/cyc control 29 1 N/cyc 30 Sequential sorter 5 6 7 13 26 28 34 37 26 29 Sequential sorter 4 15 17 24 28 37 The last issue: Speed mismatch between inputs and outputs. Advanced Processor Technologies Group 03/04/2014 20 School of Computer Science
Speed Mismatch: Using FIFO and Allow Stalls control control Sequential sorter Sequential sorter control Sequential sorter Sequential sorter Advanced Processor Technologies Group 03/04/2014 21 School of Computer Science
How Stalls Occur control Sequential sorter Even distribution has 0 stall. Sequential sorter 16 4 20 6 10 12 18 2 14 8 Original Sequences 11 3 9 5 13 1 19 7 17 15 2 4 6 8 10 12 14 16 18 20 Pre-sorted 1 3 5 7 9 11 13 15 17 19 0 stall R = 0% α = 0 Advanced Processor Technologies Group 03/04/2014 22 School of Computer Science
Recommend
More recommend