agenda
play

Agenda Closing the Loop on Data Analysis Smoke Fast Lineage + - PDF document

10/30/17 Agenda Closing the Loop on Data Analysis Smoke Fast Lineage + Interactions eugenewu.net Precision Interfaces Interface for All Analyses assistant professor Scorpion Explaining Outliers columbia university data science institute 2


  1. 10/30/17 Agenda Closing the Loop on Data Analysis Smoke Fast Lineage + Interactions eugenewu.net Precision Interfaces Interface for All Analyses assistant professor Scorpion Explaining Outliers columbia university data science institute 2 SQL DB Data Visualization Management System DVMS Co Co-de design gn Vis Vis end-to-end human-in-the-loop ML DB BI … ML DB BI … data analysis “The W orld” “The W orld” [Unflattening - Nick Sousanis] 1

  2. 10/30/17 DVMS DVMS Projects Why? DeVIL: De : Use human limits [CIDR ’17] How? SMOKE: Lineage for Interactive Vis [revision] VISTREAM: Prefetching architecture VI [in progress] PI: Scalable Interface Generation [HILDA ‘17] What? S4 S4 : Spreadsheet-style search [SIGMOD ‘15] What PV PVD: D: Physical Visualization Design [in progress] Vis How interfaces to create? Scorpion: Explaining Outliers [VLDB ’13] Why? to create and scale? Ne NeuroFlash: : Explaining Neural Networks [in progress] ML DB BI … Explaining Social Media Popularities [in progress] Prep Prep? Ac ActiveClean an: : Interactive Cleaning for ML [VLDB ’16] “The W orld” QF QFix: Cleaning past queries [SIGMOD ’17] the data? Pr PreCog: Quality Push-down [in review] Agenda Smoke: Fast Lineage + Interactions Smoke Fast Lineage + Interactions backward_trace() Precision Interfaces Interface for All Analyses Scorpion Explaining Outliers Result 1 Result 2 forward_trace() view_refresh() 17 18 Smoke: Fast Lineage + Interactions Smoke: Fast Lineage + Interactions ⨝ ⨝ Revenue Revenue refresh(backward_trace( ,input)) refresh(backward_trace( ,input)) Profit Profit backward_trace() backward_trace() Revenue Revenue Profit Profit Price Price Price Price Product Product Product Product view_refresh() view_refresh() 2

  3. 10/30/17 SPLOT = SELECT 8 AS radius, Smoke: Fast Lineage + Interactions Fine-grained Lineage Capture 'gray' AS stroke, 'gray' AS fill, lscale(revenue, sx) AS center_x, lscale(profit, sy) AS center_y, FROM A, B, sx, sy WHERE …; 𝛿 "#,%&'()*+∗$) (A ⨝ B) HIST = SELECT 4 AS width, ⨝ 'blue' AS fill, Revenue id $ refresh(backward_trace( ,input)) hscale(price, hx) AS height FROM B, C, hx WHERE …; a 1 1 40 Profit id qty $ id sum render (SELECT * FROM SPLOT); a 2 2 5 render (SELECT * FROM HIST); j 1 1 6 40 o 1 1 280 γ ⨝ BT = BACKWARD TRACE FROM interaction(vis(database)) backward_trace() j 2 1 1 40 o 2 2 45 HIST@vnow-1 AS HS, clicked id qty WHERE clicked.id = HS.id j 3 2 9 5 TO A; Revenue b 1 1 6 SPLOT = SELECT ..., 'red' AS fill SQL(Lineage( )) SQL b 2 1 1 FROM BT, B WHERE … UNION Profit b 3 2 9 SELECT ..., 'gray' AS fill FROM (A EXCEPT BT), B WHERE … Price Price HIST = SELECT ..., 'red' AS fill FROM BT, C WHERE … UNION SELECT ..., 'blue’ AS fill Product Product FROM (A EXCEPT BT), C WHERE … view_refresh() 22 Fine-grained Lineage Capture 𝛿 "#,%&'()*+∗$) (A ⨝ B) id $ a 1 1 40 How do people capture lineage today? id qty $ id sum a 2 2 5 j 1 1 6 40 o 1 1 280 Lazy aka don’t capture j 2 1 1 40 o 2 2 45 id qty Eager via Query rewrites j 3 2 9 5 b 1 1 6 Eager via Instrumentation b 2 1 1 b 3 2 9 Capture lineage graph w/ low-overhead to answer lineage queries efficiently 24 23 Lazy Approach Eager Logical Denormalized Rewrite lineage qs into SQL Rewrite original query into single big query Backward_trace(o 1 ,B) = σ id=1 (B) id $ id $ a 1 1 40 a 1 1 40 A B id qty $ id sum id qty $ id $ pid $ pid qty a 2 2 5 a 2 2 5 j 1 1 6 40 o 1 1 280 j 1 1 6 40 o 1 1 280 1 40 1 6 γ γ ’ ⨝ ⨝ ’ j 2 1 1 40 o 2 2 45 j 2 1 1 40 o 2 1 280 1 40 1 1 id qty id qty j 3 2 9 5 j 3 2 9 5 o 3 2 45 2 5 2 9 b 1 1 6 b 1 1 6 b 2 1 1 b 2 1 1 b 3 2 9 b 3 2 9 PR PROS CO CONS PR PROS CONS CO Le Leverage DB query opts Introduces redundanc In y No No c capt pture overhead Bad fo Ba for low-se selectivity Fl Flex exibility Re Resu sult must st be further processed Good f Go d for h high gh-se selectivity No su No support for non-in invertib ible le op ops Use existing da Us tab abas ase In Index result to use it Co Complex rewrite predica tes Ad Addtl pr project ion t t o ge get r real result 25 27 [C [Cui et al. and Ikeda et al.] [Perm, Gpro m, and DB No tes ] 3

  4. 10/30/17 Eager Logical Normalized Eager Physical A O 1 1 Lineage Subsystem 2 2 id $ id $ (j 1 , o 1 ) (a 1 , j 1 ) a 1 1 40 a 1 1 40 id qty $ id sum id qty $ id sum a 2 2 5 a 2 2 5 j 1 1 6 40 o 1 1 280 j 1 1 6 40 o 1 1 280 γ γ ’ ⨝ ⨝ ’ j 2 1 1 40 o 2 2 45 j 2 1 1 40 o 2 2 45 id qty id qty j 3 2 9 5 j 3 2 9 5 b 1 1 6 b 1 1 6 b 2 1 1 b 2 1 1 b 3 2 9 b 3 2 9 PR PROS CO CONS PROS PR CONS CO Re Reduces s redundanc y Extra Qs Ex s to mak e lineage tables Avoid relat Av ational al overhe a ds RPC/virtual function RP on calls s expensiv e Ea Easi sily add annot ot ation ons Ne Need t d to i inde dex l lineage ge tabl bles Co Control over physic al rep Wr Write-in ineffic icie ient lin lineage storage Use existing da Us tab abas ase Li Lineage tracing requires join Easi Ea sier to o integrate No No ph physical p plan o optimizations 28 29 [T rio, and DBN otes] [Subzero, NewT , Ram p, Clo thia et al., Titian] Can Lineage Express Interactions? Can Lineage Express Interactions? 4 Design Principles 4 Design Principles Performance Issues High capture overhead Tight Integration Tight Integration Slow lineage tracing Operator instrumentation Operator instrumentation Write efficient lineage idxs Write efficient lineage idxs Make capture fast Issues come from: Reuse work Reuse work Redundant work Lineage indexes ≈ Hash tables Lineage indexes ≈ Hash tables Inefficient representations Intra-plan hash table reuse Intra-plan hash table reuse Per-pointer overheads Apriori Knowledge Apriori Knowledge W orkload-based Don’t capture if not used Don’t capture if not used Are these issues necessary? optimizations No. See Smoke Lineage Consumption Lineage Consumption (lineage workload) Push computation into Push computation into lineage capture lineage capture 30 31 Smoke Overview Lineage Index Representation Result 1 Op input output rid index rid array N-to-1 1-to-1 Result 2 r 1 j 1 o 1 r 1 j 1 j 2 r 2 j 2 o 2 r 2 j 4 j 7 j 4 r 3 j 3 o 3 r 3 j 5 j 8 j 9 j 2 … o 4 r 4 j 3 j 1 Eager physical approach Write and read-efficient lineage indexes T wo instrumentation approaches 32 33 4

  5. 10/30/17 Inject Capture for GROUPBY Two Capture Approaches 1 id sum 1 o 1 1 2 2 o 2 2 1 ⋈ γ agg sum sum id id qty $ id 1 o 1 1 2 j 1 1 6 40 o 1 1 2 1 o 2 2 1 j 2 1 1 40 o 2 2 1 2 γ build j 3 2 9 5 γ γ ’ (1, 1) id qty $ id qty $ id qty $ j 1 1 6 40 j 1 1 6 40 j 1 1 6 40 j 2 1 1 40 j 2 1 1 40 j 2 1 1 40 j 3 2 9 5 j 3 2 9 5 j 3 2 9 5 Defer Inject 34 35 Inject Capture for GROUPBY Inject Capture for GROUPBY id sum id sum sum id o 1 1 2 o 1 1 2 j 1 j 2 o 1 1 2 o 2 2 1 o 2 2 1 j 3 o 2 2 1 γ agg γ agg γ build γ build j 1 j 2 j 1 j 2 (1, 2) (1, 2) id qty $ id qty $ j 3 j 3 (2, 1) (2, 1) j 1 1 6 40 j 1 1 6 40 j 2 1 1 40 j 2 1 1 40 j 3 2 9 5 j 3 2 9 5 36 37 Experiments Capture Overhead GROUPBY SELECT z, COUNT(*), SUM(v), SUM(v*v), SUM(sqrt(v)), MIN(v), MAX(v) Setup FROM zipf GROUP BY z Custom in-memory query compiled engine Execution comparable with MonetDB Smoke vs Logical vs Subsystem vs Lazy TPC-H, synthetic, and cross-filter Smoke is fast Lowest capture overhead Fastest tracing & lineage query perf Sm Smoke-I I best overall à 0. 0.7x 7x ov overhead Interactive capture and tracing speeds Array Ar ay resizing à ~1 ~1/2 of smoke overhead Wr Write-in ineffic icie ient id idxs à ~4 ~4x ov overhead Virtual function ca Vi calls à ~1 ~1.6x Lo Logical pe penalized d by by de denormalized re repre resenta tati tion 38 39 5

  6. 10/30/17 Cross Filtering Experiments Cross Filtering Experiments 40 41 Stepping Back Agenda Lineage capture for high-throughput workflows need not cripple normal execution Smoke Fast Lineage + Interactions Lineage capture can directly create idxs and pre-compute results for future Qs Precision Interfaces Interface for All Analyses Smells like cracking. W orking on partial data cubes Scorpion Explaining Outliers Lineage tracing Qs fast enough for interactive vis Extending to other applications e.g., ML 42 43 Building Interfaces Interfaces Make Life Easy SQL Specs Engineering It takes work! 45 6

  7. 10/30/17 Interfaces for Everybody Our Vision SQL # people who do the task # people who do the task W orth building Not worth building task i task i An Interface for Every Task Existing Approach 1 Existing Approach 2 Help Developers Build Interfaces Non-programmers Program Specs Engineering Specs Engineering 49 50 PI PI Precision Interfaces Precision Interfaces Read Minds Mine Logs Gen. Interface Gen. Interface SQL sparQL 51 52 7

Recommend


More recommend