Background AGM Friedgut Optimal Algorithms Summary Multi-join Query Evaluation on Big Data Lecture 1 Dan Suciu March, 2015 Dan Suciu Multi-Joins – Lecture 1 March, 2015 1 / 34
Background AGM Friedgut Optimal Algorithms Summary About Me Orignally from Romania Had fun with Math: 1976 IMO PhD from University of Pennsylvania: Parallel Query Languages Bell Labs and AT&T Labs: Semistructured Data, XML University of Washington: data privacy, probabilistic data, Big Data Today’s topic: Big Data! Dan Suciu Multi-Joins – Lecture 1 March, 2015 2 / 34
Background AGM Friedgut Optimal Algorithms Summary Course Organization Four lectures (1.5h): slides available on the course Website Two sections (1h): mostly interactive A problem set to pass the course: seven problems (simple to challenging); email me your solutions by April 30, 2015. I hope you can attend all lectures and sections: you need them in order to solve the problems. Dan Suciu Multi-Joins – Lecture 1 March, 2015 3 / 34
Background AGM Friedgut Optimal Algorithms Summary Multi-join Query Evaluation – Outline Part 1 Optimal Sequential Algorithms. Thursday 14:15-15:45 Part 2 Lower bounds for Parallel Algorithms. Friday 14:15-15:45 Part 3 Optimal Parallel Algorithms. Saturday 9-10:30 Part 3 Data Skew. Saturday 11-12 Dan Suciu Multi-Joins – Lecture 1 March, 2015 4 / 34
Background AGM Friedgut Optimal Algorithms Summary Multi-join Query Evaluation – Outline Part 1 Optimal Sequential Algorithms. Thursday 14:15-15:45 Part 2 Lower bounds for Parallel Algorithms. Friday 14:15-15:45 Part 3 Optimal Parallel Algorithms. Saturday 9-10:30 Part 3 Data Skew. Saturday 11-12 Dan Suciu Multi-Joins – Lecture 1 March, 2015 4 / 34
Background AGM Friedgut Optimal Algorithms Summary Multi-join Query Evaluation – Outline Part 1 Optimal Sequential Algorithms. Thursday 14:15-15:45 Part 2 Lower bounds for Parallel Algorithms. Friday 14:15-15:45 Part 3 Optimal Parallel Algorithms. Saturday 9-10:30 Part 3 Data Skew. Saturday 11-12 Dan Suciu Multi-Joins – Lecture 1 March, 2015 4 / 34
Background AGM Friedgut Optimal Algorithms Summary Multi-join Query Evaluation – Outline Part 1 Optimal Sequential Algorithms. Thursday 14:15-15:45 Part 2 Lower bounds for Parallel Algorithms. Friday 14:15-15:45 Part 3 Optimal Parallel Algorithms. Saturday 9-10:30 Part 3 Data Skew. Saturday 11-12 Dan Suciu Multi-Joins – Lecture 1 March, 2015 4 / 34
Background AGM Friedgut Optimal Algorithms Summary Bibliography E Friedgut, Hypergraphs, entropy, and inequalities, American Mathematical Monthly, 749-760, 2004. Albert Atserias, Martin Grohe, Dniel Marx: Size Bounds and Query Plans for Relational Joins. SIAM J. Comput. 42(4): 1737-1767 (2013) Hung Q. Ngo, Christopher R´ e, Atri Rudra: Skew strikes back: new developments in the theory of join algorithms. SIGMOD Record 42(4): 5-16 (2013) Paul Beame, Paraschos Koutris, Dan Suciu: Skew in parallel query processing. PODS 2014: 212-223 Paul Beame, Paraschos Koutris, Dan Suciu: Communication steps for parallel query processing. PODS 2013: 273-284 Dan Suciu Multi-Joins – Lecture 1 March, 2015 5 / 34
Background AGM Friedgut Optimal Algorithms Summary Outline for Lecture 1 Background: Queries, Databases, Query Evaluation The AGM inequality Friedgut’s inequality Worst-case optimal query evaluation Summary Dan Suciu Multi-Joins – Lecture 1 March, 2015 6 / 34
Background AGM Friedgut Optimal Algorithms Summary Relations and Databases Person Name Age City Hobby Alice 22 L´ odt´ z knitting Bob 33 Lyon karate Carol 44 L´ odt´ z kayaking David 33 Lima karate Eve 22 Lima knitting Schema Relation/table name Person ; Attribute/column names Name, Age, City, Hobby ; Key Name Instance Set of tuples/rows/records, e.g. (Alice, 22, L´ odt´ z, knitting) Size Number of tuples m = 5; note: relation is a set Database is a set of relations = a finite structure Dan Suciu Multi-Joins – Lecture 1 March, 2015 7 / 34
Background AGM Friedgut Optimal Algorithms Summary Relations and Databases Person Name Age City Hobby Alice 22 L´ odt´ z knitting Bob 33 Lyon karate Carol 44 L´ odt´ z kayaking David 33 Lima karate Eve 22 Lima knitting Schema Relation/table name Person ; Attribute/column names Name, Age, City, Hobby ; Key Name Instance Set of tuples/rows/records, e.g. (Alice, 22, L´ odt´ z, knitting) Size Number of tuples m = 5; note: relation is a set Database is a set of relations = a finite structure Dan Suciu Multi-Joins – Lecture 1 March, 2015 7 / 34
Background AGM Friedgut Optimal Algorithms Summary Relations and Databases Person Name Age City Hobby Alice 22 L´ odt´ z knitting Bob 33 Lyon karate Carol 44 L´ odt´ z kayaking David 33 Lima karate Eve 22 Lima knitting Schema Relation/table name Person ; Attribute/column names Name, Age, City, Hobby ; Key Name Instance Set of tuples/rows/records, e.g. (Alice, 22, L´ odt´ z, knitting) Size Number of tuples m = 5; note: relation is a set Database is a set of relations = a finite structure Dan Suciu Multi-Joins – Lecture 1 March, 2015 7 / 34
Background AGM Friedgut Optimal Algorithms Summary Relations and Databases Person Name Age City Hobby Alice 22 L´ odt´ z knitting Bob 33 Lyon karate Carol 44 L´ odt´ z kayaking David 33 Lima karate Eve 22 Lima knitting Schema Relation/table name Person ; Attribute/column names Name, Age, City, Hobby ; Key Name Instance Set of tuples/rows/records, e.g. (Alice, 22, L´ odt´ z, knitting) Size Number of tuples m = 5; note: relation is a set Database is a set of relations = a finite structure Dan Suciu Multi-Joins – Lecture 1 March, 2015 7 / 34
Background AGM Friedgut Optimal Algorithms Summary Relations and Databases Person Name Age City Hobby Alice 22 L´ odt´ z knitting Bob 33 Lyon karate Carol 44 L´ odt´ z kayaking David 33 Lima karate Eve 22 Lima knitting Schema Relation/table name Person ; Attribute/column names Name, Age, City, Hobby ; Key Name Instance Set of tuples/rows/records, e.g. (Alice, 22, L´ odt´ z, knitting) Size Number of tuples m = 5; note: relation is a set Database is a set of relations = a finite structure Dan Suciu Multi-Joins – Lecture 1 March, 2015 7 / 34
Background AGM Friedgut Optimal Algorithms Summary Basic Stuff that’s Good To Know Relational database systems: Oracle, SQL Server, DB2, Postgres, SQLite, Dremel, Scope, Spark SQL Relations are flat (atomic values only): 1st normal form. Relations are persistent: stored in file systems, or in distributed file systems like Hadoop Physical data independence: system is allowed to organize the relation how it wishes. E.g. indexes, column-oriented DBs, partition on distributed servers, replicated. Dan Suciu Multi-Joins – Lecture 1 March, 2015 8 / 34
Background AGM Friedgut Optimal Algorithms Summary Relational Algebra Cartesian product / Join: ⋈ Projection: Π A Selection: σ C Union: ∪ Set difference: − This course: select-project-join Dan Suciu Multi-Joins – Lecture 1 March, 2015 9 / 34
Background AGM Friedgut Optimal Algorithms Summary Join R ⋈ X = Y S The set of pairs ( t 1 , t 2 ) , with t 1 ∈ R and t 2 ∈ S , s.t. t 1 . X = t 2 . Y T = R ⋈ X = Y S R X U S Y V X U Y V a 1 b 1 a 1 c 1 a 1 b 1 a 1 c 1 a 1 b 2 a 1 c 2 a 1 b 1 a 1 c 2 a 2 b 3 a 3 c 3 a 1 b 2 a 1 c 1 a 3 b 4 a 4 c 4 a 1 b 2 a 1 c 2 a 3 b 4 a 3 c 3 Input schemas: R ( X , U ) , S ( Y , V ) Output schema: T ( X , U , Y , V ) Dan Suciu Multi-Joins – Lecture 1 March, 2015 10 / 34
Background AGM Friedgut Optimal Algorithms Summary Natural Join R ⋈ S Joins R , S on all common attributes, removes duplicate attributes T = R ⋈ S R A B S A C A B C a 1 b 1 a 1 c 1 a 1 b 1 c 1 a 1 b 2 a 1 c 2 a 1 b 1 c 2 a 2 b 3 a 3 c 3 a 1 b 2 c 1 a 3 b 4 a 4 c 4 a 1 b 2 c 2 a 3 b 4 c 3 Input schemas: R ( A , B ) , S ( A , C ) Output schema: T ( A , B , C ) Dan Suciu Multi-Joins – Lecture 1 March, 2015 11 / 34
Background AGM Friedgut Optimal Algorithms Summary Natural Join Examples Question In each case below: what is the output schema? What does the join do? R ( A , B , E , G ) ⋈ S ( A , C , D , E , F ) Dan Suciu Multi-Joins – Lecture 1 March, 2015 12 / 34
Background AGM Friedgut Optimal Algorithms Summary Natural Join Examples Question In each case below: what is the output schema? What does the join do? R ( A , B , E , G ) ⋈ S ( A , C , D , E , F ) Returns Output ( A , B , C , D , E , F , G ) = R ⋈ ( R . A = S . A )∧( R . E = S . E ) S Dan Suciu Multi-Joins – Lecture 1 March, 2015 12 / 34
Background AGM Friedgut Optimal Algorithms Summary Natural Join Examples Question In each case below: what is the output schema? What does the join do? R ( A , B , E , G ) ⋈ S ( A , C , D , E , F ) Returns Output ( A , B , C , D , E , F , G ) = R ⋈ ( R . A = S . A )∧( R . E = S . E ) S R ( A , B ) ⋈ S ( C , D , E ) Dan Suciu Multi-Joins – Lecture 1 March, 2015 12 / 34
Recommend
More recommend