flims fast lightweight merge sorter
play

FLiMS: Fast Lightweight Merge Sorter 2018 International Conference - PowerPoint PPT Presentation

FLiMS: Fast Lightweight Merge Sorter 2018 International Conference on Field-Programmable Technology (FPT) P h i l i p p o s P a p a p h i l i p p o u C h r i s B r o o k s Wa y n e L u k D e p t . o f


  1. FLiMS: Fast Lightweight Merge Sorter 2018 International Conference on Field-Programmable Technology (FPT) P h i l i p p o s P a p a p h i l i p p o u C h r i s B r o o k s Wa y n e L u k D e p t . o f C o m p u t i n g S c i e n c e I n n o v a t i o n D e p t . o f C o m p u t i n g I m p e r i a l C o l l e g e L o n d o n d u n n h u m b y I m p e r i a l C o l l e g e L o n d o n L o n d o n ,U n i t e d K i n g d o m L o n d o n ,U n i t e d K i n g d o m L o n d o n ,U n i t e d K i n g d o m p p 6 1 6 @i m p e r i a l . a c . u k C h r i s . B r o o k s @d u n n h u m b y . c o m w . l u k @i m p e r i a l . a c . u k 12/12/2018 Philippos Papaphilippou 1

  2. Novel merger design ● Task 1 2 – Merge 2 sorted sequences in parallel 4 3 ● Contributions 5 3 – Highly-efficient parallel merger 7 8 Half hardware resources of the state-of-the-art ● Half the latency ● 10 – Open source – Evaluation 11 FPGA 15 ● CPU with SIMD registers ● 16 18 12/12/2018 Philippos Papaphilippou 2

  3. Introduction: Bitonic sorter Bitonic sort [S. Batcher, 1968] ● A parallel sorting algorithm – N/2 comparisons per step – O (log 2 (N ))s teps – (log 2 (N ) · (log 2 (N ) + 1))/2) Pipelineable → FPGAs – Compare and swap (CAS) ● if (a<b) swap(a, b) – Sorter of 2 elements – 12/12/2018 Philippos Papaphilippou 3

  4. Introduction: Bitonic sorter ● Bitonic sort is based on mergesort – Hierarchical merge module sorter (4) sorter (4) merger (8) sorter (4) sorter (P=8) 12/12/2018 Philippos Papaphilippou 4

  5. Introduction: Bitonic sorter (P=64) 12/12/2018 Philippos Papaphilippou 5

  6. Bitonic sort example 8 3 5 2 1 7 9 0 12/12/2018 Philippos Papaphilippou 6

  7. Bitonic sort example 8 8 3 3 5 5 2 2 1 7 7 1 9 9 0 0 12/12/2018 Philippos Papaphilippou 7

  8. Bitonic sort example 8 8 3 5 5 3 2 2 7 7 1 9 9 1 0 0 12/12/2018 Philippos Papaphilippou 8

  9. Bitonic sort example 8 8 5 5 3 3 2 2 7 9 9 7 1 1 0 0 12/12/2018 Philippos Papaphilippou 9

  10. Bitonic sort example 8 8 5 5 3 7 2 9 9 2 7 3 1 1 0 0 12/12/2018 Philippos Papaphilippou 10

  11. Bitonic sort example 8 8 5 9 7 7 9 5 2 2 3 3 1 1 0 0 12/12/2018 Philippos Papaphilippou 11

  12. Bitonic sort example 8 9 9 8 7 7 5 5 2 3 3 2 1 1 0 0 12/12/2018 Philippos Papaphilippou 12

  13. Merging ● Bitonic sort: – The merger module can be used to merge 2 lists of P elements → 1 list of 2P elements ● Problem: How about sorting bigger lists with limited hardware/logic? – Merge 2 lists of arbitrary length – The data can be streamed and queued ● Design an efficient parallel merger for arbitrarily long lists ● Simple (unrealistic) FPGA example: – CPU mergesort, but – Parallel merging in FPGA 12/12/2018 Philippos Papaphilippou 13

  14. Basic merger for longer lists ● Based on bitonic merger 13 13 Queue A Output 6 13 13 11 12 3 11 12 9 11 1 9 11 0 7 10 7 10 12 9 Queue B 4 12 10 8 2 10 8 7 1 8 Lower 4 need 0 5 5 5 to be fed back 12/12/2018 Philippos Papaphilippou 14

  15. Basic merger for longer lists ● Based on bitonic merger 13 9 Queue A Output 6 13 13 12 8 3 11 12 11 1 9 7 11 0 7 10 10 5 9 6 Queue B 4 12 8 3 2 10 7 1 1 8 Lower 4 need 0 5 5 0 to be fed back. Not Pipelined! (works fine for CPUs [Chhugani, et al., 2008]) 12/12/2018 Philippos Papaphilippou 15

  16. Structures overview 12/12/2018 Philippos Papaphilippou 16

  17. Optimisation for FPGAs ● Based on 2P-to-P bitonic partial merger, such as in [Song et al., 2016] 9 Queue A Output 6 13 13 7 3 11 12 6 1 9 11 0 7 10 3 8 Queue B 4 12 5 Not needed 2 10 4 1 8 0 5 2 ‘unsorted’ lower 4 12/12/2018 Philippos Papaphilippou 17

  18. Related work examples ● Merging 2 already-sorted sequences – Related work [Song et al., 2016] [Saitoh et al., 2018] – Problem: Trade-off between feedback (low frequency) and resources ● 12/12/2018 Philippos Papaphilippou 18

  19. Contributions ● New parallel merger design – Merge algorithm – Feedback-less – Half hardware resources than the state-of-the-art MMS [Saitoh et al., 2018] Lookup-table and Flip-flop utilisation ● (also with half the latency (pipeline length)) ● ● Proof & Evaluation ● Open source implementation – Verilog generator for AXI peripherals FLiMS & MMS merger ● – SIMD version in C AVX2 & AVX-512 ● 12/12/2018 Philippos Papaphilippou 19

  20. Merge sorter – solution Just one 2P-to-P bitonic partial merger ● Modified 1 st pipeline stage – No need for barrel shifters ● 1 int i ; i i s t h e e n t i t y t a g 2 int cA i , cB i , in i ; 3 2 - b i t r e g i s t e r s 3 while forever do receive (positive clock edge); 4 if cA i >cB i then 5 in i ← cA i ; 6 cA i ← dequeue( a i ); 7 else 8 in i ← cB i ; 9 cB i ← dequeue( b P −i ); 10 end 11 12 end Algorithm A: Distributed algorithm pseudocode 12/12/2018 Philippos Papaphilippou 20

  21. Brief proof overview ● Main principles – Top P works for equally rotated input → No need to rotate input – Top P is a rotated bitonic sequence → also bitonic → rest of the sorting network works 12/12/2018 Philippos Papaphilippou 21

  22. Brief proof overview ● Main principles – Top P works for equally rotated input → No need to rotate input – Top P is a rotated bitonic sequence → also bitonic → rest of the sorting network works 12/12/2018 Philippos Papaphilippou 22

  23. Comparison with FLiMS [Song et al., 2016] [Saitoh et al., 2018] FLiMS Feedback log 2 (P)+1 1 1 datapath length log 2 (P)+log 2 (2P) 2×log 2 (2P) log 2 (2P) Latency 1 × b.p.m 2 × b.p.m, H/W modules 1 × b.p.m 2 crossbars (barrel shifters) shift registers 12/12/2018 Philippos Papaphilippou 23

  24. FLiMS example run (P=4) Max CAS CAS 4 16 29 A Output 3 11 26 3 5 26 4 17 7 15 22 B 0 12 21 9 19 8 18 12/12/2018 Philippos Papaphilippou 24

  25. FLiMS example run (P=4) Max CAS CAS 4 16 29 29 max A Output 3 11 26 26 max 3 5 26 26 max 4 17 17 7 15 22 22 max B 0 12 21 21 9 19 19 8 18 18 12/12/2018 Philippos Papaphilippou 25

  26. FLiMS example run (P=4) Max CAS CAS 4 16 29 16 29 A Output 3 11 26 11 26 3 5 26 5 26 4 17 17 22 max 7 15 22 15 B 0 12 21 21 max 9 19 19 max 8 18 18 max 12/12/2018 Philippos Papaphilippou 26

  27. FLiMS example run (P=4) Max CAS CAS 4 16 29 16 18 29 max A Output 3 11 26 11 19 26 max 3 5 26 5 21 26 4 17 4 17 22 7 15 22 15 max B 0 12 21 12 max 9 19 9 8 18 8 12/12/2018 Philippos Papaphilippou 27

  28. FLiMS example run (P=4) Max CAS CAS 4 16 29 4 16 21 29 A Output 3 11 26 26 3 11 19 3 5 26 26 5 12 18 max 4 17 22 4 15 17 7 15 22 7 max B 0 12 21 0 9 19 9 max 8 18 8 max 12/12/2018 Philippos Papaphilippou 28

  29. FLiMS example run (P=4) Max CAS CAS 4 16 29 4 8 16 21 29 max A Output 3 11 26 19 26 3 9 15 max 3 5 26 18 26 3 5 12 max 4 17 17 22 4 7 11 max 7 15 22 B 0 12 21 0 9 19 8 18 12/12/2018 Philippos Papaphilippou 29

  30. FLiMS example run (P=4) Max CAS CAS 4 16 29 4 8 16 21 29 A Output 3 11 26 15 19 26 3 9 3 5 26 12 18 26 3 5 4 17 11 17 22 4 7 7 15 22 B 0 12 21 0 max 9 19 8 18 12/12/2018 Philippos Papaphilippou 30

  31. FLiMS example run (P=4) Max CAS CAS 4 16 29 4 9 16 21 29 A Output 3 11 26 8 15 19 26 4 3 5 26 7 12 18 26 0 3 4 17 5 11 17 22 3 7 15 22 B 0 12 21 9 19 8 18 12/12/2018 Philippos Papaphilippou 31

  32. FLiMS example run (P=4) Max CAS CAS 4 16 29 0 4 9 16 21 29 A Output 3 11 26 4 8 15 19 26 3 5 26 3 7 12 18 26 4 17 3 5 11 17 22 7 15 22 B 0 12 21 9 19 8 18 12/12/2018 Philippos Papaphilippou 32

  33. FLiMS example run (P=4) Max CAS CAS 4 16 29 0 4 9 16 21 29 A Output 3 11 26 4 8 15 19 26 3 5 26 3 7 12 18 26 4 17 3 5 11 17 22 7 15 22 B 0 12 21 9 19 8 18 12/12/2018 Philippos Papaphilippou 33

  34. Merge sorter – results Board: MYIR Z-turn (Xilinx Zynq 7020) ● 90K 100K Proposal Proposal 90K 80K MMS MMS 80K 70K 70K 105 60K 60K 7z020 100 50K Operating frequency (MHz) LUT 50K FF 40K 95 40K 30K 90 30K 20K 85 20K 10K 10K 80 Proposal 0K 0K LUT utilisation improvement 75 FF utilisation improvement MMS 2 1.8 70 1.9 4 8 16 32 64 128 1.7 1.8 P (integers/cycle) Observations Observations 1.6 1.7 Fitting Fitting 1.6 1.5 4 8 16 32 64 128 4 8 16 32 64 128 P (integers/cycle) P (integers/cycle) 12/12/2018 Philippos Papaphilippou 34

Recommend


More recommend