The Case for Heterogeneous HTAP Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki Data-Intensive Applications and Systems Lab EPFL 1
HTAP – the contract with the hardware Hybrid OLTP & OLAP Processing HTAP on multicores Massive parallelism => high concurrency Global shared memory => data sharing High-throughput OLTP Low-latency OLAP System-wide coherence => synchronization DRAM DRAM DRAM DRAM Core Core Core Core HTAP DBMS Core Core Core Core LLC LLC Core Core Core Core Database Fresh data Core Core Core Core Necessary for current systems 2
Shifting hardware landscape (1): Specialization of CPUs Multisocket multicores Intel SCC, ARM v8, Cell SPE DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core LLC LLC LLC LLC Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core PCIe PCIe PCIe PCIe 1 coherence domain Multiple coherence domains CPUs: general-purpose à customizable features 3
Shifting hardware landscape (2): Generalization of GPUs Pascal Paging UM Programmability NVLink 20 Interface (80-200GB/s) Normalized SGEMM/Watt Maxwell 16 UM Kepler 8 Dynamic Parallelism Fermi PCIe 3.0 (16 GB/s) 4 UVA Tesla 0 GPUs: Niche accelerators à general-purpose processors 2008 2010 2012 2014 2016 4
Emerging hardware: Revisiting the contract CurrentEmerging hardware HTAP software • Homogeneous Heterogeneous parallelism • Cannot exploit heterogeneity • Task-parallel CPUs • HTAP across processors • Data-parallel GPUs • System-wide Relaxed cache coherence • Shared-everything OLTP: N/A • No synch. sans coherence • OS (FOS), FS (Hare) • runtimes (Cosh) • Server as distributed system • Global shared memory • Fails to exploit shared memory • Unified address space Clean slate redesign in order 5
Heterogeneous HTAP (H 2 TAP): Caldera • Store data in shared memory • Run OLTP workloads on task-parallel archipelago • Run OLAP workloads on data-parallel archipelago Task-parallel archipelago (OLTP) Data-parallel archipelago (OLAP) Core Core Core Core Core Core GPU GPU DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM In-memory data store Loose job-to-core assignment exploits heterogeneity 6
H 2 TAP Challenges • Store data in shared memory • Choose optimal data layout • OLTP on task-parallel archipelago • Make up for (lack of) cache coherence • OLAP on data-parallel archipelago • Share transactionally-consistent snapshots across processors Task-parallel archipelago (OLTP) Data-parallel archipelago (OLAP) Core Core Core Core Core Core GPU GPU DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM In-memory data store 7
Data layout • Need to minimize PCIe data transfer to GPU • Data access on GPU should be sequential to enable “coalescing” • Caldera implements NSM, DSM, and PAX C1 C1 C1 C1 C2 C3 C4 PAX minipage C1 C1 C2 C3 C4 C2 C1 C2 C2 C2 C1 C2 C3 C4 C2 C3 C1 PAX minipage NSM page C3 C2 C3 C3 C3 C3 PAX minipage DSM page PAX fits GPUs best (PCIe & coalesced accesses) PAXpage
OLTP without cache coherence • Use Data-Oriented Transaction Execution principles • Thread-to-data assignment leads to partitioned data, metadata (2PL, index) Thd A Thd B 9
OLTP without cache coherence • Use explicit messaging instead of implicit latching • Exploit shared memory by exchanging pointers instead of data 1. Msg (lookup, k) 2. Reply(&k) Thd A Thd B 4. Release(k) 3. Access *k Enforce coherence in software 10
Transactionally-consistent data sharing • Data sharing across workloads • Use Unified Virtual Addressing (UVA) for CPU—GPU sharing • Consistent data sharing via hardware snapshotting (ex: Hyper) • CUDA runtime restricts use in H 2 TAP context • Caldera supports lightweight software snapshotting • OLAP queries run on immutable snapshot • Copy-on-write performed by update transactions Snapshots across GPU-CPU archipelagos 11
Caldera blueprint Determine ideal OLTP without processor for Query parser & optimizer cache coherence query Query compiler Compile query to X86 or PTX code Query runtime Scheduler Task-parallel archipelago Data-parallel archipelago Core Core Core GPU Elastic core to workload OLAP on database assignment DRAM DRAM DRAM DRAM snapshot In-memory data store 12
Experiments Setup • Two 12-core Intel Xeon E5-2650L v3 CPUs, 256GB RAM • GeForce GTX 980 GPU (PCIe 3.0) with 4GB memory • TPC-C, TPC-H, YCSB in various scale factors • Silo, MonetDB, DBMS-C Goals • Message passing and Software snapshotting overhead • PAX performance compared to NSM and DSM on GPUs • Caldera performance compared to state-of-the-art 13
OLTP throughput 2 Throughput (MTps) Caldera Silo 1.5 1 0.5 0 1 2 4 8 12 16 20 24 # cores running TPC-C NewOrder (1WH/core) Message passing-based design scales well Better code & data locality (partitioning), no synchronization overhead 14
OLAP response time (incl. data movement) 10 Execution Time (sec) 8 6 4 Exploits GPU parallelism Saturates PCIe b/w 2 0 Caldera DBMS-C MonetDB TPCH SF 300 - Query 6 Bounded by PCIe bandwidth (12GB/s) Emerging interconnects (NVLink): 80-200 GB/s 15
Impact of snapshotting Ideal 6 200 OLAP Response Time (secs) q1 OLTP Throughput (KTps) q1-10 150 4 9x 2x 100 2 3.5x Ideal 50 0 0 1 2 4 8 16 32 64 100 1 2 4 8 16 32 64 100 % records touched by OLTP % records touched by OLTP Limitation: Software shadow copying imposes a high overhead Possible fix: data classification, snapshot sharing, h/w acceleration 16
Impact of data layout 1 table ( i1 integer, i2 integer, …. i16 integer ) SELECT SUM(colA + colB) FROM table Data (1GB) in GPU memory Data (16GB) in host memory NSM only 2x worse 3 NSM 14x worse 4 (GPUs have reduced Execution Time ( msec. ) Execution Time (sec.) (non-coalesced the access “tax”) accesses) 3 PAX exploits GPU 2 memory BW PAX, DSM 2 saturate PCIe 1 1 0 0 DSM PAX NSM DSM PAX NSM Hybrid layouts like PAX a good fit for H 2 TAP 17
Conclusion • Hardware architecture is changing • New opportunities: massive parallelism, fast interconnects • New challenges: heterogeneity, relaxed coherence • Databases can and should exploit hardware trends • Exploit hardware heterogeneity in their core architecture design • Decouple system-wide coherence from shared memory • Time to move from HTAP to H 2 TAP • H 2 TAP architecture: revisit age-old h/w—s/w contract • Caldera: Preliminary prototype to prove that H 2 TAP is possible 18
Recommend
More recommend