15-721 ADVANCED DATABASE SYSTEMS Lecture #19 Parallel Join - PowerPoint PPT Presentation

15-721 ADVANCED DATABASE SYSTEMS Lecture #19 – Parallel Join Algorithms (Sorting) Andy Pavlo / / Carnegie Mellon University / / Spring 2016 @Andy_Pavlo // Carnegie Mellon University // Spring 2017

2 TODAY’S AGENDA Background SIMD SPOILER: This doesn’t work Parallel Sort-Merge Join on current Xeon CPUs. Evaluation CMU 15-721 (Spring 2017)

3 SINGLE INSTRUCTION, MULTIPLE DATA A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously. Both current AMD and Intel CPUs have ISA and microarchitecture support SIMD operations. → MMX, 3DNow!, SSE, SSE2, SSE3, SSE4, AVX CMU 15-721 (Spring 2017)

4 SIMD EXAMPLE X + Y = Z 8 7 x 1 y 1 x 1 +y 1 6 X 5 x 2 y 2 x 2 +y 2 + = 4 ⋮ ⋮ ⋮ Z 3 x n y n x n +y n 2 1 1 for (i=0; i<n; i++) { 1 Z[i] = X[i] + Y[i]; 1 } Y 1 1 1 1 1 CMU 15-721 (Spring 2017)

4 SIMD EXAMPLE X + Y = Z 8 7 x 1 y 1 x 1 +y 1 6 X 5 x 2 y 2 x 2 +y 2 + = 4 ⋮ ⋮ ⋮ Z 3 x n y n x n +y n 2 SISD 1 9 + 1 for (i=0; i<n; i++) { 1 Z[i] = X[i] + Y[i]; 1 } Y 1 1 1 1 1 CMU 15-721 (Spring 2017)

4 SIMD EXAMPLE X + Y = Z 8 7 x 1 y 1 x 1 +y 1 6 X 5 x 2 y 2 x 2 +y 2 + = 4 ⋮ ⋮ ⋮ Z 3 x n y n x n +y n 2 SISD 1 9 8 7 6 5 4 3 2 + 1 for (i=0; i<n; i++) { 1 Z[i] = X[i] + Y[i]; 1 } Y 1 1 1 1 1 CMU 15-721 (Spring 2017)

4 SIMD EXAMPLE 128-bit SIMD Register X + Y = Z 8 7 x 1 y 1 x 1 +y 1 8 7 6 5 6 X 5 x 2 y 2 x 2 +y 2 + = 4 ⋮ ⋮ ⋮ Z 3 x n y n x n +y n 2 SIMD 1 + 1 for (i=0; i<n; i++) { 1 1 1 1 1 Z[i] = X[i] + Y[i]; 1 128-bit SIMD Register } Y 1 1 1 1 1 CMU 15-721 (Spring 2017)

4 SIMD EXAMPLE 128-bit SIMD Register X + Y = Z 8 7 x 1 y 1 x 1 +y 1 8 7 6 5 6 X 5 x 2 y 2 x 2 +y 2 + = 4 ⋮ ⋮ ⋮ Z 3 x n y n x n +y n 2 SIMD 1 + 9 8 7 6 1 for (i=0; i<n; i++) { 128-bit SIMD Register 1 1 1 1 1 Z[i] = X[i] + Y[i]; 1 128-bit SIMD Register } Y 1 1 1 1 1 CMU 15-721 (Spring 2017)

4 SIMD EXAMPLE X + Y = Z 8 7 x 1 y 1 x 1 +y 1 6 X 5 x 2 y 2 x 2 +y 2 + = 4 ⋮ ⋮ ⋮ Z 3 4 3 2 1 x n y n x n +y n 2 SIMD 1 + 9 8 7 6 5 4 3 2 1 for (i=0; i<n; i++) { 1 Z[i] = X[i] + Y[i]; 1 } Y 1 1 1 1 1 1 1 1 1 CMU 15-721 (Spring 2017)

5 SIMD TRADE-OFFS Advantages: → Significant performance gains and resource utilization if an algorithm can be vectorized. Disadvantages: → Implementing an algorithm using SIMD is still mostly a manual process. → SIMD may have restrictions on data alignment. → Gathering data into SIMD registers and scattering it to the correct locations is tricky and/or inefficient. CMU 15-721 (Spring 2017)

6 WHY NOT GPUS? Moving data back and forth between DRAM and GPU is slow over PCI-E bus. There are some newer GPU-enabled DBMSs → Examples: MapD, SQream, Kinetica Emerging co-processors that can share CPU’s memory may change this. → Examples: AMD’s APU, Intel’s Knights Landing CMU 15-721 (Spring 2017)

7 SORT-MERGE JOIN (R ⨝ S) Phase #1: Sort → Sort the tuples of R and S based on the join key. Phase #2: Merge → Scan the sorted relations and compare tuples. → The outer relation R only needs to be scanned once. CMU 15-721 (Spring 2017)

8 SORT-MERGE JOIN (R ⨝ S) Relation R Relation S CMU 15-721 (Spring 2017)

8 SORT-MERGE JOIN (R ⨝ S) Relation R Relation S SORT! SORT! CMU 15-721 (Spring 2017)

8 SORT-MERGE JOIN (R ⨝ S) Relation R Relation S MERGE! ⨝ SORT! SORT! CMU 15-721 (Spring 2017)

9 PARALLEL SORT-MERGE JOINS Sorting is always the most expensive part. Take advantage of new hardware to speed things up as much as possible. → Utilize as many CPU cores as possible. → Be mindful of NUMA boundaries. MULTI-CORE, MAIN-MEMORY JOINS: SORT VS. HASH REVISITED VLDB 2013 CMU 15-721 (Spring 2017)

10 PARALLEL SORT-MERGE JOIN (R ⨝ S) Phase #1: Partitioning (optional) → Partition R and assign them to workers / cores. Phase #2: Sort → Sort the tuples of R and S based on the join key. Phase #3: Merge → Scan the sorted relations and compare tuples. → The outer relation R only needs to be scanned once. CMU 15-721 (Spring 2017)

11 PARTITIONING PHASE Divide the relations into chunks and assign them to cores. → Explicit vs. Implicit Explicit: Divide only the outer relation and redistribute among the different CPU cores. → Can use the same radix partitioning approach we talked about last time. CMU 15-721 (Spring 2017)

12 SORT PHASE Create runs of sorted chunks of tuples for both input relations. It used to be that Quicksort was good enough. But NUMA and parallel architectures require us to be more careful… CMU 15-721 (Spring 2017)

13 CACHE-CONSCIOUS SORTING Level #1: In-Register Sorting → Sort runs that fit into CPU registers. Level #2: In-Cache Sorting → Merge the output of Level #1 into runs that fit into CPU caches. → Repeat until sorted runs are ½ cache size. Level #3: Out-of-Cache Sorting → Used when the runs of Level #2 exceed the size of caches. CMU 15-721 (Spring 2017)

14 CACHE-CONSCIOUS SORTING UNSORTED Level #1 Level #2 Level #3 SORTED CMU 15-721 (Spring 2017)

15 LEVEL #1 – SORTING NETWORKS Abstract model for sorting keys. → Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches. Input Output 9 5 3 6 CMU 15-721 (Spring 2017)

15 LEVEL #1 – SORTING NETWORKS Abstract model for sorting keys. → Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches. Input Output 5 9 9 5 3 6 CMU 15-721 (Spring 2017)

15 LEVEL #1 – SORTING NETWORKS Abstract model for sorting keys. → Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches. Input Output 5 9 9 5 3 3 6 6 CMU 15-721 (Spring 2017)

15 LEVEL #1 – SORTING NETWORKS Abstract model for sorting keys. → Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches. Input Output 5 3 9 9 5 5 3 3 6 6 CMU 15-721 (Spring 2017)

15 LEVEL #1 – SORTING NETWORKS Abstract model for sorting keys. → Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches. Input Output 5 3 9 3 9 5 5 3 3 6 6 CMU 15-721 (Spring 2017)

15 LEVEL #1 – SORTING NETWORKS Abstract model for sorting keys. → Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches. Input Output 5 3 9 3 9 6 5 5 3 3 6 9 6 9 CMU 15-721 (Spring 2017)

15 LEVEL #1 – SORTING NETWORKS Abstract model for sorting keys. → Always has fixed wiring “paths” for lists with the same number of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches. Input Output 5 3 9 3 9 6 5 5 5 5 6 3 3 6 6 9 6 9 CMU 15-721 (Spring 2017)

16 LEVEL #1 – SORTING NETWORKS 12 21 4 13 9 8 6 7 1 14 3 0 5 11 15 10 Instructions: → 4 LOAD CMU 15-721 (Spring 2017)

16 LEVEL #1 – SORTING NETWORKS Sort Across Registers 12 21 4 13 1 8 3 0 9 8 6 7 5 11 4 7 1 14 3 0 9 14 6 10 5 11 15 10 12 21 15 13 Instructions: Instructions: → 4 LOAD → 10 MIN/MAX CMU 15-721 (Spring 2017)

16 LEVEL #1 – SORTING NETWORKS Sort Across Transpose Registers Registers 12 21 4 13 1 8 3 0 1 5 9 12 9 8 6 7 5 11 4 7 8 11 14 21 1 14 3 0 9 14 6 10 3 4 6 15 5 11 15 10 12 21 15 13 0 7 10 13 Instructions: Instructions: → 4 LOAD → 10 MIN/MAX CMU 15-721 (Spring 2017)

15-721 ADVANCED DATABASE SYSTEMS Lecture #19 Parallel Join - PowerPoint PPT Presentation

15-721 ADVANCED DATABASE SYSTEMS Lecture #19 Parallel Join Algorithms (Sorting) Andy Pavlo / / Carnegie Mellon University / / Spring 2016 @Andy_Pavlo // Carnegie Mellon University // Spring 2017 2 TODAYS AGENDA Background SIMD

ADVANCED DATABASE SYSTEMS Server-side Logic Execution @ Andy_Pavlo // 15- 721 // Spring 2019

Homework Assignment: 5 11-721: Grammars and Lexicons 11-721: Grammars and Lexicons Fall 2007

ADVANCED DATABASE SYSTEMS Networking @ Andy_Pavlo // 15- 721 // Spring 2019 CMU 15-721

ADVANCED DATABASE SYSTEMS History of Databases @ Andy_Pavlo // 15- 721 // Spring 2020 2

ADVANCED DATABASE SYSTEMS Vectorized Execution @ Andy_Pavlo // 15- 721 // Spring 2019 CMU

ADVANCED DATABASE SYSTEMS OLTP Indexes (Trie Data Structures) @ Andy_Pavlo // 15- 721 //

ADVANCED DATABASE SYSTEMS Self-Driving Database Management Systems @ Andy_Pavlo // 15- 721 //

ADVANCED DATABASE SYSTEMS Database Compression @ Andy_Pavlo // 15- 721 // Spring 2019 CMU

15-721 DATABASE SYSTEMS [Source] Lecture #08 Indexing (OLAP) Andy Pavlo / / Carnegie

ADVANCED DATABASE SYSTEMS Vectorization vs. Compilation @ Andy_Pavlo // 15- 721 // Spring

ADVANCED DATABASE SYSTEMS Storage Models & Data Layout @ Andy_Pavlo // 15- 721 // Spring

15-721 DATABASE SYSTEMS Lecture #09 Storage Models & Data Layout Andy Pavlo / /

15-721 DATABASE SYSTEMS [Source] Lecture #03 Concurrency Control Part I Andy Pavlo / /

Lect ure # 03 ADVANCED DATABASE SYSTEMS Query Compilation @ Andy_Pavlo // 15- 721 // Spring

15-721 DATABASE SYSTEMS [Source] Lecture #04 Concurrency Control Part II Andy Pavlo / /

ADVANCED DATABASE SYSTEMS Recovery Protocols @ Andy_Pavlo // 15- 721 // Spring 2019 CMU

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #22: PARALLEL

Yandex DC Design Evolution Dmitry Afanasiev, fl0w@yandex-team.ru Network Architect Yandex

Some application of rational based number system Shigeki Akiyama, Niigata University, Japan

The Page Cache System Calls Kernel Todays Lecture RCU File System Networking Sync

14. 174 Figure 3.19, change "costzone" to "costzones" 15. 179 Figure 3.21,

Design of a Probabilistic Robust Track-Following Controller for Hard Disk Drive Servo Systems E.

Towards Trustworthy Testbeds thanks to Throughout Testing Lucas Nussbaum lucas.nussbaum@loria.fr

Computer Organization & Assembly Language Programming (CSE 2312) Lecture 21: Caches Taylor

15-721 ADVANCED DATABASE SYSTEMS Lecture #19 Parallel Join - PowerPoint PPT Presentation

15-721 ADVANCED DATABASE SYSTEMS Lecture #19 Parallel Join Algorithms (Sorting) Andy Pavlo / / Carnegie Mellon University / / Spring 2016 @Andy_Pavlo // Carnegie Mellon University // Spring 2017 2 TODAYS AGENDA Background SIMD

ADVANCED DATABASE SYSTEMS Server-side Logic Execution @ Andy_Pavlo // 15- 721 // Spring 2019

Homework Assignment: 5 11-721: Grammars and Lexicons 11-721: Grammars and Lexicons Fall 2007

ADVANCED DATABASE SYSTEMS Networking @ Andy_Pavlo // 15- 721 // Spring 2019 CMU 15-721

ADVANCED DATABASE SYSTEMS History of Databases @ Andy_Pavlo // 15- 721 // Spring 2020 2

ADVANCED DATABASE SYSTEMS Vectorized Execution @ Andy_Pavlo // 15- 721 // Spring 2019 CMU

ADVANCED DATABASE SYSTEMS OLTP Indexes (Trie Data Structures) @ Andy_Pavlo // 15- 721 //

ADVANCED DATABASE SYSTEMS Self-Driving Database Management Systems @ Andy_Pavlo // 15- 721 //

ADVANCED DATABASE SYSTEMS Database Compression @ Andy_Pavlo // 15- 721 // Spring 2019 CMU

15-721 DATABASE SYSTEMS [Source] Lecture #08 Indexing (OLAP) Andy Pavlo / / Carnegie

ADVANCED DATABASE SYSTEMS Vectorization vs. Compilation @ Andy_Pavlo // 15- 721 // Spring

ADVANCED DATABASE SYSTEMS Storage Models &amp; Data Layout @ Andy_Pavlo // 15- 721 // Spring

15-721 DATABASE SYSTEMS Lecture #09 Storage Models &amp; Data Layout Andy Pavlo / /

15-721 DATABASE SYSTEMS [Source] Lecture #03 Concurrency Control Part I Andy Pavlo / /

Lect ure # 03 ADVANCED DATABASE SYSTEMS Query Compilation @ Andy_Pavlo // 15- 721 // Spring

15-721 DATABASE SYSTEMS [Source] Lecture #04 Concurrency Control Part II Andy Pavlo / /

ADVANCED DATABASE SYSTEMS Recovery Protocols @ Andy_Pavlo // 15- 721 // Spring 2019 CMU

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #22: PARALLEL

Yandex DC Design Evolution Dmitry Afanasiev, fl0w@yandex-team.ru Network Architect Yandex

Some application of rational based number system Shigeki Akiyama, Niigata University, Japan

The Page Cache System Calls Kernel Todays Lecture RCU File System Networking Sync

14. 174 Figure 3.19, change &quot;costzone&quot; to &quot;costzones&quot; 15. 179 Figure 3.21,

Design of a Probabilistic Robust Track-Following Controller for Hard Disk Drive Servo Systems E.

Towards Trustworthy Testbeds thanks to Throughout Testing Lucas Nussbaum lucas.nussbaum@loria.fr

Computer Organization &amp; Assembly Language Programming (CSE 2312) Lecture 21: Caches Taylor

ADVANCED DATABASE SYSTEMS Storage Models & Data Layout @ Andy_Pavlo // 15- 721 // Spring

15-721 DATABASE SYSTEMS Lecture #09 Storage Models & Data Layout Andy Pavlo / /

14. 174 Figure 3.19, change "costzone" to "costzones" 15. 179 Figure 3.21,

Computer Organization & Assembly Language Programming (CSE 2312) Lecture 21: Caches Taylor