from hyper to hyper
play

From HyPer to Hyper Integrating an academic DBMS into a leading - PowerPoint PPT Presentation

From HyPer to Hyper Integrating an academic DBMS into a leading analytics and business intelligence platform Tobias Muehlbauer, tmuehlbauer@tableau.com Jan Finis, jfinis@tableau.com The Story of Hyper 2020: Hyper as a general-purpose Database


  1. From HyPer to Hyper Integrating an academic DBMS into a leading analytics and business intelligence platform Tobias Muehlbauer, tmuehlbauer@tableau.com Jan Finis, jfinis@tableau.com

  2. The Story of Hyper

  3. 2020: Hyper as a general-purpose Database Service in the Tableau Platform Mid 2019: Hyper API exposes full Hyper SQL capabilities to partners and customers Mid 2018: Tableau Prep launches with Hyper as its processing engine Early 2018: Hyper replaces Tableau Data Engine in existing on-prem and SaaS products Tableau Europe R&D center in Munich with over 30 full-time employees March 2016: Tableau acquires HyPer (HyPer ⇨ Hyper) Commercial spin-off Academic Success: More than 50 peer-reviewed publications and several awards 2008: HyPer started as a research project at Technical University of Munich

  4. Context: One Size Fits All? One size does not fit all, but what about data lag and disconnect?

  5. Context: A Changing Hardware Landscape In order to leverage modern hardware, databases need to change .

  6. The Idea Behind Hyper

  7. Inside Hyper tuple PSMAs JIT-compiled c o l d s c a tuple-at-a-time query pipeline n C push matches vectorized evaluation of Registers A B D single interface tuple-at-a-time A SARGable predicates on D vector compressed data and B Caches unpacking of matches compressed interpreted vectorized scan Data Block DRAM on Data Block Index t o h tuple-at-a-time NVRAM n a c s evaluation of scan predicates vectorized evaluation of A B C D A B D Local SSD/HDD SARGable predicates and uncompressed vector copying of matches chunk Network A B C D vectors of e.g., 8192 records uncompressed interpreted vectorized scan JIT-compiled tuple-at-a-time scan chunk on uncompressed chunk on uncompressed chunk Optimized for Storage Hierarchy Query Compilation & Vectorized Scans s n e o g HT(S) r t i n o ( i n HT(T) s a t o r c o t P h e V s d t Accounts t g n e ) o n t o r A B n o e s i version information stored in additional hidden column in base relation e B C l a s d i Owner Bal r s d , l e r t – Result probe(8) e s V (indexes only store references, i.e., rowIDs, to records in base relations) e x f V r o 16 8 i i f f r 8 v a [ c o i l R n Thomas 10 m l A u Z A B C I s c m t latest version in-place t i probe(16) r o 33 x i t a n T t probe(10) Larry 10 i i s A Z m o a 16 8 v store n 18 33 e ) physical before-image deltas (i.e., column values) 10 y [0,0) recentlyCommitted Alfons 10 16 a ... ... ... ... in undo buffers Sally à Wendy T3 27 10 Judy 10 Undo buffer of Ty (Sally à ...) 7 c ... ... ... ... T5 Sally à Henry probe(27) store Tobias 10 Ty,Bal,8 10 i 5 5 5 z Sally 7 Z A B C 27 b Hanna 10 b 27 10 y 18 e t 7 23 [0,1) r Hasso 10 ( 23 u a n n o A ... ... ... ... Undo buffer of T5 s s 5 j t t c a a s t c r i Mike 10 t t o i t o o T n r T5,Bal,9 T5,Bal,10 n i e s ... ... ... ... 7 d m I d D e Lisa 10 ) activeTransactions morsel 5 f Tx T4 Read only: Σ Betty 10 ... ... Sally à Mike Ty T6 Cindy 10 Undo buffer of T3 ... ... Tz T7 Read only: Σ [2,5) Henry 11 T3,Bal,10 T3,Bal,10 Dispatcher morsel ... ... Praveen 10 ... ... Wendy 11 ... ... main-memory column-store Advanced Query Optimization Fast MVCC Morsel-Driven Parallelization

  8. Hyper in Tableau Desktop Online Public Prep Server Federation Prep Extract Hyper API

  9. From HyPer to Hyper: Challenges Support • Limited support provided up to 30 months after major product version release Compare performance across releases and database engines • Semantic differences • Infrastructure Windows, Linux, and macOS • Small laptops to large-scale servers and Cloud deployments • Workload Long tail of query complexity generated by Tableau • Wide variety of data set characteristics •

  10. Tableau Workloads

  11. What Tableau Workloads Look Like Most queries are “small”: Only 0.5% larger than 5KB SQL Text • But: Huge outliers • Largest query in our data set: 6.7MB • Largest query we saw so far: 27MB • And that’s not all due to constant strings…

  12. Need a query plan visualizer? https://github.com/tableau/query-graphs/

  13. What Tableau Workloads Look Like 1 273 237 254 164 72 36 2 253 253 252 3 233 233 232 4 123 123 122 Table Scan 5 81 81 80 Group By Sort 6 67 67 66 Percentiles 132 7 Inner Join 127 8 Outer Join Group Join 126 9 10 0 100 200 300 400 500 600 700 800 900 1000 Number of relational operators Vogelsgesang et al.: Get Real: How Benchmarks Fail to Represent the Real World. DBTest’18

  14. Replacing Tableau’s Old Data Engine

  15. Replacing Tableau’s Old Data Engine Tableau’s old data engine (TDE): vector-based engine inspired by MonetDB/X100 First step: Replace TDE as the backend of all Tableau Products Goals: Deliver performance at scale • Seamless transition for customers •

  16. TDE: The Gold Standard Having a gold standard is great! We just ran a lot (60k) of workbooks from Tableau Public and compared results. Simple, measurable goal: Get results to match for all and be fast, then we’re done J

  17. Challenge: Bug Compatibility? Is it really worth to show the same result for all queries? Non-deterministic behavior (parallelization!) • Bugs in TDE • Our attitude changed over time: 1. 100% same results at all cost, customers don’t want their Viz to change! This is non-negotiable! 2. Well, but what if it changes to be correct? 3. Who said that Visualizations can’t change in the first place? 4. Let’s do the right thing and fix things once and for all!

  18. Compatibility Curiosity: String to Date Cast Is “ 5/7/2020 ” April 7 or July 5? TDE’s strategy : be aggressive finding a valid date. Sounds great! But horrible in the relational model! In Input ut TDE TDE Hy Hyper 5/7/2020 April 7, 2020 April 7, 2020 15/7/2020 July 15, 2020 NULL Silent failure: Sales workbook: More sales in the first 12 days of each months!

  19. Improvement over TDE T T T T T T T T T T D D D D D D D D D D E E E E E E E E E E Continuous performance improvements past launch

  20. Testing Hyper

  21. SQL Level Testing query N expectConstantResult SELECT DATE '2001-09-28' + INTEGER '7' ---- 2001-10-05 We started with the SQLite test suite # DATE + INTEGER -> DATE with overflow Added own test cases for features • statement error 22003 Added regression tests for defects • SELECT DATE '4713-01-01 BC' - 1 Added fine grained expectations • (e.g., constant folding) How to execute the tests? First: Own Hyper front end that parses the file and executes the queries: • Problem: Server / protocol code not tested • Second: Client (based on libpq) that parses the file and sends queries to a server • Problem: Harder to debug, test driver is not same process •

  22. Testing even more with SQL Level Tests Use EXPLAIN statement to test optimizer # Deduplication for simple domain queries query S EXPLAIN SELECT * FROM Introduce function to scan the own log to test (SELECT a FROM t GROUP BY 1) t1, for log messages (SELECT a FROM t GROUP BY 1) t2 ---- Introduce trace settings that allow printing • executiontarget(1) specific internals to the log join(2) bnl Special test functions with side effects to explicitscan(3) explicitscan(6)->(4) test further internals E.g., a function that allocates thread-local • groupby(4) memory tablescan(5) SELECT suicide() • t

  23. SQL vs. C++ Unit Tests SQL C++ Easy to write, usually very succinct Only test the code in question, No recompilation needed not the whole SQL layer In vivo: Can run code in specific query contexts Good test failure reporting Can update expected test results automatically Runs faster Resembles customer usage of the system Big controversy!

  24. Beyond SQL Testing exec CREATE DATABASE mytestdb; connection mytestdb user=bob { repeat 100 { SQLite tests are great, but they can’t exec CREATE TABLE foo AS (SELECT x FROM simulate load from multiple generate_series(1,1e6) x); connections parallel 10 { exec UPDATE foo SET x = x+10 WHERE x % 2 = 0; } and 2 { Solution: Loadtest DSL test { open connection • query N SELECT SUM(x) FROM foo Embed SQLite test statement • WHERE x % 2 = 1; Execute code blocks in parallel • ---- 1234567 Loops • } } }

  25. A/B Testing: MaxPerf / QueryRunner 1. Goal: Check for compatibility with TDE I. Correctness and perf II. A/B Test on the 60k Public Workbooks Improvement A over B 2. How it works I. Starts up Hyper II. Loads a workbook (thus sending queries to Hyper) III. Checks number of marks / mark values and records times IV. A/B test between old and new branch

  26. Automatic Regression Testing 1. Measure perf on every commit 2. If perf regresses, file a defect 3. Make sure it’s not just noise 4. If perf improves, make it the new expectation

  27. Fuzzing 1. Use a fuzzer (e.g., AFL) 2. Feed it the SQL grammar 3. Let it run for a long time Found several vulnerabilities and defects with fuzzing!

Recommend


More recommend