the glass half full
play

The Glass Half Full Using Programmable Hardware Accelerators in - PowerPoint PPT Presentation

The Glass Half Full Using Programmable Hardware Accelerators in Analytical Databases Zsolt Istvn IMDEA Software Institute 1 IM IMDEA Soft ftware In Institute 16 Faculty in the areas of: Program Analysis and Verification


  1. The Glass Half Full Using Programmable Hardware Accelerators in Analytical Databases Zsolt István IMDEA Software Institute 1

  2. IM IMDEA Soft ftware In Institute • 16 Faculty in the areas of: • Program Analysis and Verification • Languages and Compilers • Security and Privacy • Theoretical Computer Science • Distributed Systems and Databases • ~10 Post-docs, ~25 PhD Students, ~10 Interns • Located in UPM Montegancedo Campus, Madrid • We are hiring! https://software.imdea.org/

  3. Context: Analytical Databases ▪ OLAP – Online Analytical Processing ▪ Large datasets – up to TBs ▪ Ad-hoc querying to extract insight, recurring reporting – Possibly complex operations ▪ Read-mostly workloads, updates in batches ▪ OLTP – Online Transaction Processing ▪ Smaller datasets ▪ Queries known, relate to business actions ▪ Makes heavy use of indexes ▪ Reads and updates intermixed 3

  4. Databases were a 25 Billion $ market in 2018… Could we specialize machines to them? 4 https://www.statista.com/statistics/810188/worldwide-commercial-database-market-size/

  5. Database Computer – ’70s “The first goal is to design it with the capability of handling a very large on-line database of 10^10 bytes or beyond since special-purpose machines are not likely to be cost- effective for small databases.” ▪ Fully custom machine for databases ▪ Processors – special ISA microprocessors ▪ Memory – magnetic bubbles and CCDs ▪ Semiconductor technology and general purpose CPUs took over Jayanta Banerjee, David K. Hsiao, Krishnamurthi Kannan: DBC - A Database Computer for Very Large Databases . 5 IEEE Trans. Computers 28(6): 414-429 (1979)

  6. Gamma Machine – ’80s ▪ Based on VAX multi- processor system ▪ By the time the software and hardware were developed, CPUs have become much faster ▪ Couldn’t keep up with Moore’s law David J. DeWitt, Robert H. Gerber, Goetz Graefe, Michael L. Heytens, Krishna B. Kumar, M. Muralikrishna: GAMMA - A High 6 Performance Dataflow Database Machine. VLDB 1986: 228-237

  7. Data/Compute Gap Specialized CPU Scaling Commodity in Cloud Hardware Revival 7

  8. Renewed interest in Specialized Hardware CPUs FPGAs ASICs 8

  9. Re-programmable Specialized Hardware F ield P rogrammable G ate A rray (FPGA) ▪ Free choice of architecture Op 1 ▪ Fine-grained pipelining, communication, distributed memory Op 2 ▪ Tradeoff: all “code” occupies chip space Op 3 ▪ Evolving platform: larger chips, more heterogeneity 9

  10. Integration Options Data Data Accel. Accel. Data Accel. 1) On the side 3) Co-processor 2) In data-path 10

  11. In the Cloud Today ▪ Accelerator ▪ Amazon F1 ▪ In data path ▪ Microsoft Catapult ▪ Co-processor FPGA ▪ Intel Xeon+FPGA FPGA FPGA CPU CPU CPU Socket1 Socket2 Socket1 Intel Xeon+FPGA Gen.2 Intel Xeon+FPGA Gen.1 11

  12. The Glass Half Empty… 12

  13. The Glass Half Empty… ▪ 1) On the side acceleration introduces overhead Query execution time 120 100 80 Accel. 60 2x 40 Data 20 0 Software With Acceleration Compute Data Movement ▪ Many related work offers no real speedup if we factor in data movement, transformation, software overhead… 13

  14. The Glass Half Empty… ▪ 2) “All or nothing” behavior makes query planning difficult ▪ Example: fixed capacity hash table on FPGA ▪ Constant time access for reads and writes ▪ What happens if data doesn’t fit? ▪ Can’t always know the number of keys aprioi # 14

  15. The Glass Half Empty… ▪ 3) Analytical databases becoming more optimized / not much compute in core SQL ▪ X100 [CIDR05] showed that <10% of compute time spent on SQL operators +,-,*,SUM,AVG in analytical queries ▪ Columnar stores often memory bound (10s of GB/s) 15

  16. The Glass Half Empty… ▪ On the side acceleration introduces overhead ▪ “All or nothing” behavior makes query planning difficult ▪ Analytical databases becoming more optimized / not much compute in core SQL 16

  17. The Glass Half Full… ▪ On the side acceleration introduces overhead ✓ Reduce data movement bottlenecks 17

  18. Processing in data path: Smart Flash ▪ IBEX: Database storage engine with processing offload ▪ Filter and pre-aggregate for analytic workloads → Larger bandwidth, more IOPS (Samsung YourSQL, MIT BlueDBM) ▪ Opportunity to extend SSDs/Flash with complex offload SSD IBEX Database Server Samsung “smart” SSD IBEX – An Intelligent Storage Engine with Support for Advanced SQL Off-loading. L. Woods, Z. Istvan and G. Alonso, VLDB’14 18

  19. Processing in data path: Distributed Processing Caribou: Distributed Workers (Compute) storage with processing + Provisioning • Specialized HW nodes • + Scalability 10Gbps access • 25W power cons. Storage Zsolt István, David Sidler, Gustavo Alonso: Caribou: Intelligent Distributed Storage . PVLDB 10(11), 2017. 19

  20. Smart Storage in Databases: Filter push-down SELECT … FROM customer WHERE age<35 AND purchases>2 AND address LIKE “%PO. Box 123%” ▪ Challenge: guarantee that filtering never slows down retrieval Intel Hyperscan library (Xeon E5-2680 v2) ▪ Algorithms can be re-imagined to become bandwidth-bound 2.8x instead of compute-bound ▪ Extend the state of the art: parameterization without re-programming [FCCM16] ▪ Many options: Regular expressions, comparisons, decompression, … 20 [FCCM16] Runtime Parameterizable Regular Expression Operators for Databases. Zs. Istvan, D. Sidler, G. Alonso. FCCM’16

  21. The Glass Half Full… ✓ Reduce data movement bottlenecks ▪ “All or nothing” behavior makes query planning difficult ✓ Hybrid processing 21

  22. IBEX’s Hybrid Group -by ▪ Group-by: Compute aggregate function over categories ▪ select avg(salary) from employees group by department Ibex with SW-only Group-By CPU Final Filtered Projection Selection Group-by Group Input table data s 22

  23. IBEX’s Hybrid Group -by ▪ Group-by: Compute aggregate function over categories ▪ select avg(salary) from employees group by department Ibex with HW-only Group-By CPU Final Filtered Projection Selection Group-by Group Input table data s 23

  24. IBEX’s Hybrid Group -by ▪ Group-by: Compute aggregate function over categories ▪ select avg(salary) from employees group by department ▪ If number of groups does not fit on FPGA? ▪ Send partial aggregates – finalize in SW ▪ Worst case: same as no acceleration ▪ Best- case: All in HW! Ibex with Hybrid Group-by CPU Ibex with HW-only Group-By CPU Final Final Partial Filtered Filtered Projection Projection Selection Selection Group-by Group-by Group-by Group Group Group Input table Input table data data s s s Challenge: How to split across accelerator and software? 24

  25. The Glass Half Full ✓ Reduce data movement bottlenecks ✓ Hybrid Processing ▪ Analytical databases becoming more optimized / not much compute in core SQL ✓ Emerging compute-intensive workloads 25

  26. The Rise of Machine Learning ▪ Databases adopting new ways of analyzing the data ▪ SAP Hana, Oracle, SQL Server, etc. ▪ Specialized hardware can help both with model building [Kara18] , inference [Owaida18] ▪ Benefits for “classical” algorithms as well [Kara18] Kara et al: ColumnML: Column-Store Machine Learning with On-The-Fly Data Transformation. PVLDB 12(4): 348-361 (2018) 26 [Owaida18] Owaida et al: Application Partitioning on FPGA Clusters: Inference over Decision Tree Ensembles. FPL 2018: 295-300

  27. doppioDB: a hybrid database engine No data copy, transformation, DRAM (DB Tables) partitioning, etc. FPGA Co-processor CPU Hardware Hardware Software Software Operator Operator operator operator Database Hardware Hardware Engine Operator Operator (MonetDB) ▪ Goal: extend the capabilities of analytical databases ▪ FPGA works on the same data as software (cache-coherent access) ▪ Can combine SW and HW operators inside the same query ▪ Challenge: ensure high utilization of FPGA, use in many queries 27

  28. K-means – Algorithm ◼ Goal: partition unlabeled data into several clusters, where the number of clusters is the “k” in the k -means. ◼ Two steps in each iteration: ◼ Assignment : assign data points to closet centroid according to distance metric ◼ Centroid update : the centroids are re- calculated by averaging all the data points within each cluster ◼ Long process if the data set and number of iterations are large 28

  29. Design – Execution Walk-Through 1 4 Receives K-Means parameters Accumulates data points per cluster and counts how many data points are assigned to 2 Fetch the initial centroids and each cluster the data 5 Collect partial results from each pipeline 3 Calculates the distance between 6 Division for updating new centroid a data point and all the centroids and assign it to closest centroid 7 Writes back the final results 3 4 1 DRAM 2 (DB Tables) 7 6 5 29 Zhenhao He, David Sidler, Zsolt István, Gustavo Alonso: A Flexible K-Means Operator for Hybrid Databases . FPL 2018

  30. Uses of Parallelism ▪ K-Means algorithm ▪ FPGA outperforms several cores of the CPU Need to determine K ▪ Can use parallelism in two ways – cover more queries (Elbow method) K is known / Centroids known 30

Recommend


More recommend