46 LEVEL #1 – SORTING NETWORKS Abstract model for sorting keys. → Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies. Input Output 5 3 9 3 9 6 5 5 5 5 6 3 3 6 6 9 6 9
47 LEVEL #1 – SORTING NETWORKS 12 21 4 13 9 8 6 7 1 14 3 0 5 11 15 10
48 LEVEL #1 – SORTING NETWORKS 12 21 4 13 9 8 6 7 1 14 3 0 5 11 15 10 Instructions: → 4 LOAD
49 LEVEL #1 – SORTING NETWORKS Sort Across Registers 12 21 4 13 9 8 6 7 1 14 3 0 5 11 15 10 Instructions: → 4 LOAD
50 LEVEL #1 – SORTING NETWORKS Sort Across Registers 12 21 4 13 9 8 6 7 1 14 3 0 5 11 15 10 Instructions: → 4 LOAD
51 LEVEL #1 – SORTING NETWORKS Sort Across Registers 12 21 4 13 1 8 3 0 9 8 6 7 5 11 4 7 1 14 3 0 9 14 6 10 5 11 15 10 12 21 15 13 Instructions: → 4 LOAD
52 LEVEL #1 – SORTING NETWORKS Sort Across Registers 12 21 4 13 1 8 3 0 9 8 6 7 5 11 4 7 1 14 3 0 9 14 6 10 5 11 15 10 12 21 15 13 Instructions: Instructions: → 4 LOAD → 10 MIN/MAX
53 LEVEL #1 – SORTING NETWORKS Sort Across Transpose Registers Registers 12 21 4 13 1 8 3 0 9 8 6 7 5 11 4 7 1 14 3 0 9 14 6 10 5 11 15 10 12 21 15 13 Instructions: Instructions: → 4 LOAD → 10 MIN/MAX
54 LEVEL #1 – SORTING NETWORKS Sort Across Transpose Registers Registers 12 21 4 13 1 8 3 0 1 5 9 12 9 8 6 7 5 11 4 7 8 11 14 21 1 14 3 0 9 14 6 10 3 4 6 15 5 11 15 10 12 21 15 13 0 7 10 13 Instructions: Instructions: → 4 LOAD → 10 MIN/MAX
55 LEVEL #1 – SORTING NETWORKS Sort Across Transpose Registers Registers 12 21 4 13 1 8 3 0 1 5 9 12 9 8 6 7 5 11 4 7 8 11 14 21 1 14 3 0 9 14 6 10 3 4 6 15 5 11 15 10 12 21 15 13 0 7 10 13 Instructions: Instructions: → 4 LOAD → 10 MIN/MAX
56 LEVEL #1 – SORTING NETWORKS Sort Across Transpose Registers Registers 12 21 4 13 1 8 3 0 1 5 9 12 9 8 6 7 5 11 4 7 8 11 14 21 1 14 3 0 9 14 6 10 3 4 6 15 5 11 15 10 12 21 15 13 0 7 10 13 Instructions: Instructions: Instructions: → 4 LOAD → 10 MIN/MAX → 8 SHUFFLE → 4 STORE
57 LEVEL #2 – BITONIC MERGE NETWORK Like a Sorting Network but it can merge two locally-sorted lists into a globally-sorted list. Can expand network to merge progressively larger lists ( ½ cache size). Intel’s Measurements → 2.25–3.5x speed-up over SISD implementation. EFFI EFFICIENT ENT IMPLEM EMENT ENTATION N OF F SORT RTING NG ON ON MULTI-CO CORE VLDB 2008
58 LEVEL #2 – BITONIC MERGE NETWORK Input Output a 1 a 2 S S a 3 H H U U a 4 F F b 4 F F b 3 L L E E b 2 b 1
59 LEVEL #2 – BITONIC MERGE NETWORK Input Output a 1 a 2 Sorted Run S S a 3 H H U U a 4 F F b 4 F F b 3 L L E E b 2 b 1
60 LEVEL #2 – BITONIC MERGE NETWORK Input Output a 1 a 2 Sorted Run S S a 3 H H U U a 4 F F b 4 F F Reverse b 3 L L Sorted Run E E b 2 b 1
61 LEVEL #2 – BITONIC MERGE NETWORK Input Output a 1 a 2 Sorted Run S S a 3 H H U U a 4 F F b 4 F F Reverse b 3 L L Sorted Run E E b 2 b 1 min/max min/max min/max
62 LEVEL #2 – BITONIC MERGE NETWORK Input Output a 1 a 2 Sorted Run S S a 3 H H U U a 4 Sorted Run F F b 4 F F Reverse b 3 L L Sorted Run E E b 2 b 1 min/max min/max min/max
63 LEVEL #3 – MULTI-WAY MERGING Use the Bitonic Merge Networks but split the process up into tasks. → Still one worker thread per core. → Link together tasks with a cache-sized FIFO queue. A task blocks when either its input queue is empty or its output queue is full. Requires more CPU instructions, but brings bandwidth and compute into balance.
64 LEVEL #3 – MULTI-WAY MERGING Sorted Runs Cache-Sized Queue MERGE MERGE MERGE MERGE MERGE MERGE MERGE
65 LEVEL #3 – MULTI-WAY MERGING Sorted Runs Cache-Sized Queue MERGE MERGE MERGE MERGE MERGE MERGE MERGE
66 LEVEL #3 – MULTI-WAY MERGING Sorted Runs Cache-Sized Queue MERGE MERGE MERGE MERGE MERGE MERGE MERGE
67 MERGE PHASE Iterate through the outer table and inner table in lockstep and compare join keys. May need to backtrack if there are duplicates. Can be done in parallel at the different cores without synchronization if there are separate output buffers.
68 SORT-MERGE JOIN VARIANTS Multi-Way Sort-Merge ( M-WAY ) Multi-Pass Sort-Merge ( M-PASS ) Massively Parallel Sort-Merge ( MPSM )
69 MULTI-WAY SORT-MERGE Outer Table → Each core sorts in parallel on local data (levels #1/#2). → Redistribute sorted runs across cores using the multi- way merge (level #3). Inner Table → Same as outer table. Merge phase is between matching pairs of chunks of outer/inner tables at each core. MU MULTI-CO CORE, MAIN-ME MEMO MORY JOINS: SORT VS. VS. HASH SH REVI VISI SITED VLDB 2013
70 MULTI-WAY SORT-MERGE
71 MULTI-WAY SORT-MERGE Local-NUMA Partitioning
72 MULTI-WAY SORT-MERGE Local-NUMA Partitioning
73 MULTI-WAY SORT-MERGE Local-NUMA Sort Partitioning
74 MULTI-WAY SORT-MERGE Multi-Way Local-NUMA Sort Merge Partitioning
75 MULTI-WAY SORT-MERGE Multi-Way Local-NUMA Sort Merge Partitioning
76 MULTI-WAY SORT-MERGE Multi-Way Local-NUMA Sort Merge Partitioning
77 MULTI-WAY SORT-MERGE Multi-Way Local-NUMA Sort Merge Partitioning
78 MULTI-WAY SORT-MERGE Multi-Way Local-NUMA Sort Merge Partitioning
79 MULTI-WAY SORT-MERGE Multi-Way Local-NUMA Sort Merge Partitioning
80 MULTI-WAY SORT-MERGE Multi-Way Same steps as Local-NUMA Sort Merge Outer Table Partitioning
81 MULTI-WAY SORT-MERGE Multi-Way Same steps as Local-NUMA Sort Merge Outer Table Partitioning SORT! SORT! SORT! SORT!
82 MULTI-WAY SORT-MERGE Multi-Way Local Merge Same steps as Local-NUMA Sort Merge Join Outer Table Partitioning SORT! SORT! SORT! SORT!
83 MULTI-WAY SORT-MERGE Multi-Way Local Merge Same steps as Local-NUMA Sort Merge Join Outer Table Partitioning ⨝ SORT! ⨝ SORT! ⨝ SORT! ⨝ SORT!
84 MULTI-WAY SORT-MERGE Multi-Way Local Merge Same steps as Local-NUMA Sort Merge Join Outer Table Partitioning ⨝ SORT! ⨝ SORT! ⨝ SORT! ⨝ SORT!
85 MULTI-PASS SORT-MERGE Outer Table → Same level #1/#2 sorting as Multi-Way. → But instead of redistributing, it uses a multi-pass naïve merge on sorted runs. Inner Table → Same as outer table. Merge phase is between matching pairs of chunks of outer table and inner table. MU MULTI-CO CORE, MAIN-ME MEMO MORY JOINS: SORT VS. VS. HASH SH REVI VISI SITED VLDB 2013
86 MASSIVELY PARALLEL SORT-MERGE Outer Table → Range-partition outer table and redistribute to cores. → Each core sorts in parallel on their partitions. Inner Table → Not redistributed like outer table. → Each core sorts its local data. Merge phase is between entire sorted run of outer table and a segment of inner table. MA MASSIVELY PARALLEL SORT-ME MERGE JOINS IN MA MAIN ME MEMO MORY MU MULTI-CO CORE DATABA BASE SYSTEMS VLDB 2012
87 MASSIVELY PARALLEL SORT-MERGE
88 MASSIVELY PARALLEL SORT-MERGE Cross-NUMA Partitioning
89 MASSIVELY PARALLEL SORT-MERGE Cross-NUMA Partitioning
90 MASSIVELY PARALLEL SORT-MERGE Cross-NUMA Partitioning Sort
91 MASSIVELY PARALLEL SORT-MERGE Cross-NUMA Partitioning Sort
92 MASSIVELY PARALLEL SORT-MERGE Cross-NUMA Partitioning Sort SORT! SORT! SORT! SORT!
93 MASSIVELY PARALLEL SORT-MERGE Cross-NUMA Cross-Partition Partitioning Sort Merge Join SORT! SORT! SORT! SORT!
94 MASSIVELY PARALLEL SORT-MERGE Cross-NUMA Cross-Partition Partitioning Sort Merge Join ⨝ SORT! SORT! SORT! SORT!
95 MASSIVELY PARALLEL SORT-MERGE Cross-NUMA Cross-Partition Partitioning Sort Merge Join ⨝ SORT! SORT! SORT! SORT!
96 MASSIVELY PARALLEL SORT-MERGE Cross-NUMA Cross-Partition Partitioning Sort Merge Join ⨝ SORT! SORT! SORT! SORT!
97 MASSIVELY PARALLEL SORT-MERGE Cross-NUMA Cross-Partition Partitioning Sort Merge Join ⨝ SORT! ⨝ SORT! SORT! SORT!
98 MASSIVELY PARALLEL SORT-MERGE Cross-NUMA Cross-Partition Partitioning Sort Merge Join ⨝ SORT! ⨝ SORT! SORT! SORT!
99 MASSIVELY PARALLEL SORT-MERGE Cross-NUMA Cross-Partition Partitioning Sort Merge Join ⨝ SORT! ⨝ SORT! ⨝ SORT! ⨝ SORT!
100 HYPER’s RULES FOR PARALLELIZATION Rule #1: No random writes to non-local memory → Chunk the data, redistribute, and then each core sorts/works on local data. Rule #2: Only perform sequential reads on non-local memory → This allows the hardware prefetcher to hide remote access latency. Rule #3: No core should ever wait for another → Avoid fine-grained latching or sync barriers. Source: Martina-Cezara Albutiu
Recommend
More recommend