FLiMS: Fast Lightweight Merge Sorter 2018 International Conference on Field-Programmable Technology (FPT) P h i l i p p o s P a p a p h i l i p p o u C h r i s B r o o k s Wa y n e L u k D e p t . o f C o m p u t i n g S c i e n c e I n n o v a t i o n D e p t . o f C o m p u t i n g I m p e r i a l C o l l e g e L o n d o n d u n n h u m b y I m p e r i a l C o l l e g e L o n d o n L o n d o n ,U n i t e d K i n g d o m L o n d o n ,U n i t e d K i n g d o m L o n d o n ,U n i t e d K i n g d o m p p 6 1 6 @i m p e r i a l . a c . u k C h r i s . B r o o k s @d u n n h u m b y . c o m w . l u k @i m p e r i a l . a c . u k 12/12/2018 Philippos Papaphilippou 1
Novel merger design ● Task 1 2 – Merge 2 sorted sequences in parallel 4 3 ● Contributions 5 3 – Highly-efficient parallel merger 7 8 Half hardware resources of the state-of-the-art ● Half the latency ● 10 – Open source – Evaluation 11 FPGA 15 ● CPU with SIMD registers ● 16 18 12/12/2018 Philippos Papaphilippou 2
Introduction: Bitonic sorter Bitonic sort [S. Batcher, 1968] ● A parallel sorting algorithm – N/2 comparisons per step – O (log 2 (N ))s teps – (log 2 (N ) · (log 2 (N ) + 1))/2) Pipelineable → FPGAs – Compare and swap (CAS) ● if (a<b) swap(a, b) – Sorter of 2 elements – 12/12/2018 Philippos Papaphilippou 3
Introduction: Bitonic sorter ● Bitonic sort is based on mergesort – Hierarchical merge module sorter (4) sorter (4) merger (8) sorter (4) sorter (P=8) 12/12/2018 Philippos Papaphilippou 4
Introduction: Bitonic sorter (P=64) 12/12/2018 Philippos Papaphilippou 5
Bitonic sort example 8 3 5 2 1 7 9 0 12/12/2018 Philippos Papaphilippou 6
Bitonic sort example 8 8 3 3 5 5 2 2 1 7 7 1 9 9 0 0 12/12/2018 Philippos Papaphilippou 7
Bitonic sort example 8 8 3 5 5 3 2 2 7 7 1 9 9 1 0 0 12/12/2018 Philippos Papaphilippou 8
Bitonic sort example 8 8 5 5 3 3 2 2 7 9 9 7 1 1 0 0 12/12/2018 Philippos Papaphilippou 9
Bitonic sort example 8 8 5 5 3 7 2 9 9 2 7 3 1 1 0 0 12/12/2018 Philippos Papaphilippou 10
Bitonic sort example 8 8 5 9 7 7 9 5 2 2 3 3 1 1 0 0 12/12/2018 Philippos Papaphilippou 11
Bitonic sort example 8 9 9 8 7 7 5 5 2 3 3 2 1 1 0 0 12/12/2018 Philippos Papaphilippou 12
Merging ● Bitonic sort: – The merger module can be used to merge 2 lists of P elements → 1 list of 2P elements ● Problem: How about sorting bigger lists with limited hardware/logic? – Merge 2 lists of arbitrary length – The data can be streamed and queued ● Design an efficient parallel merger for arbitrarily long lists ● Simple (unrealistic) FPGA example: – CPU mergesort, but – Parallel merging in FPGA 12/12/2018 Philippos Papaphilippou 13
Basic merger for longer lists ● Based on bitonic merger 13 13 Queue A Output 6 13 13 11 12 3 11 12 9 11 1 9 11 0 7 10 7 10 12 9 Queue B 4 12 10 8 2 10 8 7 1 8 Lower 4 need 0 5 5 5 to be fed back 12/12/2018 Philippos Papaphilippou 14
Basic merger for longer lists ● Based on bitonic merger 13 9 Queue A Output 6 13 13 12 8 3 11 12 11 1 9 7 11 0 7 10 10 5 9 6 Queue B 4 12 8 3 2 10 7 1 1 8 Lower 4 need 0 5 5 0 to be fed back. Not Pipelined! (works fine for CPUs [Chhugani, et al., 2008]) 12/12/2018 Philippos Papaphilippou 15
Structures overview 12/12/2018 Philippos Papaphilippou 16
Optimisation for FPGAs ● Based on 2P-to-P bitonic partial merger, such as in [Song et al., 2016] 9 Queue A Output 6 13 13 7 3 11 12 6 1 9 11 0 7 10 3 8 Queue B 4 12 5 Not needed 2 10 4 1 8 0 5 2 ‘unsorted’ lower 4 12/12/2018 Philippos Papaphilippou 17
Related work examples ● Merging 2 already-sorted sequences – Related work [Song et al., 2016] [Saitoh et al., 2018] – Problem: Trade-off between feedback (low frequency) and resources ● 12/12/2018 Philippos Papaphilippou 18
Contributions ● New parallel merger design – Merge algorithm – Feedback-less – Half hardware resources than the state-of-the-art MMS [Saitoh et al., 2018] Lookup-table and Flip-flop utilisation ● (also with half the latency (pipeline length)) ● ● Proof & Evaluation ● Open source implementation – Verilog generator for AXI peripherals FLiMS & MMS merger ● – SIMD version in C AVX2 & AVX-512 ● 12/12/2018 Philippos Papaphilippou 19
Merge sorter – solution Just one 2P-to-P bitonic partial merger ● Modified 1 st pipeline stage – No need for barrel shifters ● 1 int i ; i i s t h e e n t i t y t a g 2 int cA i , cB i , in i ; 3 2 - b i t r e g i s t e r s 3 while forever do receive (positive clock edge); 4 if cA i >cB i then 5 in i ← cA i ; 6 cA i ← dequeue( a i ); 7 else 8 in i ← cB i ; 9 cB i ← dequeue( b P −i ); 10 end 11 12 end Algorithm A: Distributed algorithm pseudocode 12/12/2018 Philippos Papaphilippou 20
Brief proof overview ● Main principles – Top P works for equally rotated input → No need to rotate input – Top P is a rotated bitonic sequence → also bitonic → rest of the sorting network works 12/12/2018 Philippos Papaphilippou 21
Brief proof overview ● Main principles – Top P works for equally rotated input → No need to rotate input – Top P is a rotated bitonic sequence → also bitonic → rest of the sorting network works 12/12/2018 Philippos Papaphilippou 22
Comparison with FLiMS [Song et al., 2016] [Saitoh et al., 2018] FLiMS Feedback log 2 (P)+1 1 1 datapath length log 2 (P)+log 2 (2P) 2×log 2 (2P) log 2 (2P) Latency 1 × b.p.m 2 × b.p.m, H/W modules 1 × b.p.m 2 crossbars (barrel shifters) shift registers 12/12/2018 Philippos Papaphilippou 23
FLiMS example run (P=4) Max CAS CAS 4 16 29 A Output 3 11 26 3 5 26 4 17 7 15 22 B 0 12 21 9 19 8 18 12/12/2018 Philippos Papaphilippou 24
FLiMS example run (P=4) Max CAS CAS 4 16 29 29 max A Output 3 11 26 26 max 3 5 26 26 max 4 17 17 7 15 22 22 max B 0 12 21 21 9 19 19 8 18 18 12/12/2018 Philippos Papaphilippou 25
FLiMS example run (P=4) Max CAS CAS 4 16 29 16 29 A Output 3 11 26 11 26 3 5 26 5 26 4 17 17 22 max 7 15 22 15 B 0 12 21 21 max 9 19 19 max 8 18 18 max 12/12/2018 Philippos Papaphilippou 26
FLiMS example run (P=4) Max CAS CAS 4 16 29 16 18 29 max A Output 3 11 26 11 19 26 max 3 5 26 5 21 26 4 17 4 17 22 7 15 22 15 max B 0 12 21 12 max 9 19 9 8 18 8 12/12/2018 Philippos Papaphilippou 27
FLiMS example run (P=4) Max CAS CAS 4 16 29 4 16 21 29 A Output 3 11 26 26 3 11 19 3 5 26 26 5 12 18 max 4 17 22 4 15 17 7 15 22 7 max B 0 12 21 0 9 19 9 max 8 18 8 max 12/12/2018 Philippos Papaphilippou 28
FLiMS example run (P=4) Max CAS CAS 4 16 29 4 8 16 21 29 max A Output 3 11 26 19 26 3 9 15 max 3 5 26 18 26 3 5 12 max 4 17 17 22 4 7 11 max 7 15 22 B 0 12 21 0 9 19 8 18 12/12/2018 Philippos Papaphilippou 29
FLiMS example run (P=4) Max CAS CAS 4 16 29 4 8 16 21 29 A Output 3 11 26 15 19 26 3 9 3 5 26 12 18 26 3 5 4 17 11 17 22 4 7 7 15 22 B 0 12 21 0 max 9 19 8 18 12/12/2018 Philippos Papaphilippou 30
FLiMS example run (P=4) Max CAS CAS 4 16 29 4 9 16 21 29 A Output 3 11 26 8 15 19 26 4 3 5 26 7 12 18 26 0 3 4 17 5 11 17 22 3 7 15 22 B 0 12 21 9 19 8 18 12/12/2018 Philippos Papaphilippou 31
FLiMS example run (P=4) Max CAS CAS 4 16 29 0 4 9 16 21 29 A Output 3 11 26 4 8 15 19 26 3 5 26 3 7 12 18 26 4 17 3 5 11 17 22 7 15 22 B 0 12 21 9 19 8 18 12/12/2018 Philippos Papaphilippou 32
FLiMS example run (P=4) Max CAS CAS 4 16 29 0 4 9 16 21 29 A Output 3 11 26 4 8 15 19 26 3 5 26 3 7 12 18 26 4 17 3 5 11 17 22 7 15 22 B 0 12 21 9 19 8 18 12/12/2018 Philippos Papaphilippou 33
Merge sorter – results Board: MYIR Z-turn (Xilinx Zynq 7020) ● 90K 100K Proposal Proposal 90K 80K MMS MMS 80K 70K 70K 105 60K 60K 7z020 100 50K Operating frequency (MHz) LUT 50K FF 40K 95 40K 30K 90 30K 20K 85 20K 10K 10K 80 Proposal 0K 0K LUT utilisation improvement 75 FF utilisation improvement MMS 2 1.8 70 1.9 4 8 16 32 64 128 1.7 1.8 P (integers/cycle) Observations Observations 1.6 1.7 Fitting Fitting 1.6 1.5 4 8 16 32 64 128 4 8 16 32 64 128 P (integers/cycle) P (integers/cycle) 12/12/2018 Philippos Papaphilippou 34
Recommend
More recommend