Spinning Relations: High-Speed Networks for Distributed Join Processing Philip Frey, Romulo Goncalves, Martin Kersten, Jens Teubner
Problem Statement We address a core database problem, but for large problem sizes: Process a join R � θ S (arbitrary join predicate). R and S are large (many gigabytes, even terabytes). Traditional approach: Use a big machine and/or suffer the severe disk I/O bottleneck of block nested loops join. Can do distributed evaluation only for certain θ or certain data distributions (or suffer high network I/O cost). Today: Assume a cluster of commodity machines only. Leverage modern high-speed networks (10 Gb/s and beyond). Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 2 / 11
Modern Networks: High Speed? It is actually very hard to saturate modern ( e.g. , 10 Gb/s) networks. System 1 System 2 underutilized network CPU CPU RAM NIC NIC RAM High CPU demand ◮ Rule of thumb: 1 GHz CPU per 1 Gb/s network throughput (!) Memory bus contention ◮ Data typically has to cross the memory bus three times → ≈ 3 GB/s bus capacity needed for 10 Gb/s network Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 3 / 11
RDMA: Remote Direct Memory Access RDMA -capable network cards (RNICs) can saturate the link using direct data placement (avoid unnecessary bus transfers), OS bypassing (avoid context switches), and TCP offloading (avoid CPU load). System 1 System 2 fully utilized network CPU CPU RAM RNIC RNIC RAM Data is read/written on both ends using intra-host DMA . Asynchronous transfer after work request issued by CPU. Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 4 / 11
Cyclo-Join Idea 1 distribute input S Host H 1 2 join locally RDMA Host H 2 R 3 R 3 R 3 R 3 RDMA 3 rotate R 4 S 1 R 4 R 4 S 2 R 2 R 4 R 2 RDMA R 2 R 2 S 0 Host H 0 RDMA S 3 R 5 input R Host H 3 R 5 R 5 R 1 R 5 S 5 R 1 R 1 S 4 RDMA R 1 R 0 RDMA R 0 R 0 R 0 Host H 5 Host H 4 RDMA: join and rotate Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 5 / 11
Analysis Cyclo-join has similarities to block nested loops join . Cut input data into blocks R i and S j . Join all combinations R i � S j in memory . As such, cyclo-join can be paired with any in-memory join algorithm , can be used to distribute the processing of any join predicate . Cyclo-join fits into a “cloud-style” environment: additional nodes can be hooked in as needed, arbitrary assignment host ↔ task, cyclo-join consumes and produces distributed tables → n -way joins. Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 6 / 11
Cyclo-Join Put Into Practice We implemented a prototype of cyclo-join : four processing nodes ◮ Intel Xeon quad-core 2.33 GHz ◮ 6 GB RAM per node; memory bandwidth: 3.4 GB/s (measured) 10 Gb/s Ethernet ◮ Chelsio T3 RDMA-enabled network cards ◮ Nortel 10 Gb/s Ethernet switch in-memory hash join ◮ hash phase physically re-organizes data (on each node) → better cache efficiency during join phase ◮ I/O complexity: O ( | R | + | S | ) Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 7 / 11
Experiments Experiment 1: Distribute evaluation of a join where | R | = | S | = 1 . 8 GB. 80 hash buildup synchronization wall-clock time [s] join execution 60 MonetDB 40 (single-host) 20 0 1 host 2 hosts 3 hosts 4 hosts 1 . 8 � 1 . 8 1 . 8 � 1 . 8 1 . 8 � 1 . 8 1 . 8 � 1 . 8 # hosts / sizes of S � R [GB] Main benefit: reduced hash buildup time . Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 8 / 11
Experiments Experiment 2: Scale up and join larger S (hash buildup ignored here) . 4 0.26 synchronization join execution wall-clock time [s] 3.54 0.58 3 0.80 2.83 2 2.08 1.35 1 0 1 host 2 hosts 3 hosts 4 hosts 1 . 8 � 1 . 8 3 . 6 � 1 . 8 5 . 4 � 1 . 8 7 . 2 � 1 . 8 # hosts / sizes of S � R [GB] � System scales like a machine with large RAM would. � CPUs have to wait for network transfers (“synchronization”). Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 9 / 11
Memory Transfers Need to wait for network: Does that mean RDMA doesn’t work at all? 1 . 8 GB 10 Gb/s = 1 . 44 s time memory bandwidth [GB/s] 5 0.58 3 2.83 RDMA trans. 4 2 bus bandwidth 1 3 0 2 3 hosts 5 . 4 � 1 . 8 join R i � S j 0.58 s 1 2.83 s time 0 0 1 2 3 4 The culprit is the local memory bus ! If RDMA hadn’t saved us some bus transfers, this would be worse . Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 10 / 11
Conclusions I demonstrated cyclo-join : ring topology to process large joins , use distributed memory to process arbitrary joins , hardware acceleration via RDMA is crucial: ◮ reduce CPU load and memory bus contention . Cyclo-join is part of the Data Cyclotron project: support for more local join algorithms , process full queries in a merry-go-round setup . Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 11 / 11
Recommend
More recommend