Data Management Research @ UW Seattle uwdb.io
http://uwdb.io/ Magdalena Balazinska Research in database systems, theory, and programming languages Alvin Cheung ~15 students + postdocs Dan Suciu
Research Areas Big data processing in the cloud Walter Cai • Theory: optimal query processing • Systems : Myria, efficient & complex processing at scale, image analytics, DBMS+NN, data summarization Jenny Ortiz • Usability : Cloud SLAs, performance tuning, viz analytics New Types of DBMSs Leilani Battle • Open World DBMS • Image & video DBMS Brandon Haynes • LightDB: VR/AR/MR DBMS Scientific data management • Collaborations with scientists & deep involvement with eScience Institute Databases and programming languages • DBMS & app co-optimization Probabilistic Databases Laurel Orr Causality
Towards Application-Specific Databases uwplse.org uwdb.io
DB optimizer executor
Ma Main Column Co SparkSQ Sp SQL St Storm Memory Me Sc Scidb St Stores DB DB Scientific Analytics OLTP Streams OLAP Workloads Specialization Can we generate customized data stores from application code?
# stars Application # issues 22k Discourse (forum) 85 Cong Yan Lobster (forum) 45 1k Application Inefficiencies 49k Gitlab (collaboration) 23 Redmine (collaboration) 59 13k • Code translated to inefficient queries 17k Spree (E-commerce) 20 • Misplaced computation • Redundant data loads ROR Ecommerce 11 1.7k • Issuing queries with known results 697 Fulcrum (task mgmt) 2 • Loading unused data 3.5k Tracks (task mgmt) 30 • Missing indexes 18k Diaspora (social network) 57 Onebody (social network) 76 1.2k 78% of fixes took fewer than 5 lines 8k Openstreetmap (map) 4 Max app speedup: 39x 1.1k Fallingfruit (map) 16 Total 428
Image Blur Rotate Hash Join Partitioning
SEARCH Target code Proof of translation
PROGRAM SYNTHESIS Target code Proof of translation
Verified Lifting: Casper Maaz Ahmad 1. Define semantics of map and reduce 3. Retarget spec to Hadoop SumXY = reduce(map(points, f m ), codegen f r ) f m (x,y) = x * y f r (v1,v2) = v1 + v2 void map(Object key, Point [] value) { for(Point p : points) 2. Synthesizer infers emit("sumxy", SumXY); } spec from source void reduce(Text key, int [] vs) { int SumXY = 0; // sequential implementation for (Integer val : vs) SumXY = SumXY + val; void regress(Point [] points) emit(key, SumXY); } { int SumXY = 0; for(Point p : points){ SumXY += p.x * p.y; Lifted code can be } return SumXY; optimized by Hadoop } max 32x speedup
SELECT ... SELECT ... FROM ... FROM ... WHERE ... WHERE ... Q2 Q1 ∃ D . Q1(D) ≠ Q2(D) ? ∀ D . Q1(D) = Q2(D) Query Optimizers Autograders Application Caches
Deciding the equality of two arbitrary relational queries is undecidable. Boris Trakhtenbrot Full decision procedure exists for conjunctive queries Simple heuristics can already prove many common cases
Rosette Coq Proof Assistant Constraint Solver Check validity of proofs Finding counterexamples Q1 == Q2 Q1 ≠ Q2 Cosette ShumoChu Daniel Li Q1 =?= Q2 Nick Anderson
Repeat HTML Images Data CNN Output RNN Conv Conv ... Model Regex Filter Join Generate Train a caption- Training Labels generating model Many regex and Many regex and Likewise for join algorithms join algorithms convolution to choose from! to choose from!
Cuttlefish: A Lightweight Tomer Kaftan Primitive for Online Tuning def loopConvolve(image, filters): ... def fftConvolve(image, filters): ... def mmConvolve(image, filters): ... for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start output result, elapsedTime
Cuttlefish: A Lightweight Tomer Kaftan Primitive for Online Tuning def loopConvolve(image, filters): ... def fftConvolve(image, filters): ... def mmConvolve(image, filters): ... tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start output result, elapsedTime
Cuttlefish: A Lightweight Tomer Kaftan Primitive for Online Tuning def loopConvolve(image, filters): ... def fftConvolve(image, filters): ... def mmConvolve(image, filters): ... tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start output result, elapsedTime
Cuttlefish: A Lightweight Tomer Kaftan Primitive for Online Tuning def loopConvolve(image, filters): ... def fftConvolve(image, filters): ... def mmConvolve(image, filters): ... tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start tuner.observe(token, elapsedTime) output result, elapsedTime
Note: Y-axis is Log-scale
Scythe Chenglong Wang Input tables Stored using id date specialized 1 12/25 data structures 2 11/21 4 12/24 … … Search for Instantiate abstract queries abstract queries Output tables id date max Prune Rank results 1 12/25 30 query based on 2 11/21 10 4 12/24 20 skeletons simplicity … … …
Scythe Chenglong Wang Supported features SPJ • Grouping • Aggregation • Subqueries • Outer join • Exists • Union •
Titles summarize post 80% of the time Stackoverflow dataset Filtered away titles Posts tagged with #sql, #oracle, #database (430k) • My query doesn't work! • Posts containing an accepted answer in SQL Why is my query slow? • • I hate SQL! • Results: 41k (title, query) pairs •
Model Naturalness Informativeness Code-NN (Ours) 2.6 1.55 Nearest neighbor 1.9 1.55 Srini Iyer MOSES 1.76 1.36 ATTEN 2.82 0.93
UWDB Collaborators UW Industry • Bill Howe (iSchool) • Adobe • Andrew Connolly (Astronomy) • Huawei • Aaron Lee (Ophtalmology) • Intel • Ariel Rokem (eScience) • Microsoft • Emilio Zagheni (Sociology) • Teradata • Prog Lang & SW Eng group
Recommend
More recommend