Evaluation of Join Operations Ramakrishnan/Gehrke Chapter 14, Part A (Joins) 340151 Big Data & Cloud Computing (P. Baumann) 1
Relational Operations: Join Definition Natural join: • R S := L ( C ( R S ) ) Where • C: condition that equates all pairs of attributes of R and S that have the same name • Ex: R has x, S has x " R.x=S.x and… " in C • L: list of all attributes of R and S, except equate duplicates • Ex: C contains "R.x=S.x" only one x chosen for L Example: R(a,b) and S(b,c) relations • Then, R S = … Hence: join is shorthand (but more efficient to compute in 1 step) 340151 Big Data & Cloud Computing (P. Baumann) 2
Relational Operations: More Joins Theta join • Let R(a,b) and S(b,c) be relations • R S := C ( R S ) C Why "theta"? • Historically: R S where { =, , >, , <, } x y • Today: can be any condition Special case: C = "R.x = S.y" equijoin No projection! 340151 Big Data & Cloud Computing (P. Baumann) 3
Schema for Examples Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string) Similar to old schema; rname added for variations Reserves: tuple 40 bytes, 100 tuples per page, 1000 pages Sailors: tuple 50 bytes, 80 tuples per page, 500 pages 340151 Big Data & Cloud Computing (P. Baumann) 4
Equality Joins With One Join Column SELECT * FROM Reserves R, Sailors S WHERE R.sid=S.sid R S large R S followed by selection inefficient Assume: M #pages of R, p R tuples per page, N #pages of S, p S tuples per page Cost metric: # of I/Os • will ignore output costs, disk access patterns 340151 Big Data & Cloud Computing (P. Baumann) 5
Simple Nested Loops Join foreach tuple r in R do Reserves: foreach tuple s in S do M = 1000; p R = 100 if r i == s j Sailors: then add <r,s> to result N = 500; p S = 80 For each tuple in outer relation R, scan entire inner relation S • Cost: M + p R * M * N = 1,000 + 100*1000*500 I/Os = 50,001,000 I/Os Page-oriented Nested Loops join ( Block-Nested Loop Join ) : For each page of R: get each page of S, write out matching pairs of tuples <r,s>, where r in R-page and s in S-page • Cost: M + M*N = 1000 + 1000*500 = 501,000 • If smaller relation (S) is outer, cost = 500 + 500*1000 = 500,500 340151 Big Data & Cloud Computing (P. Baumann) 6
Index Nested Loops Join Assume index on join column of one relation (say S) can make it inner and exploit index • Cost: M + ( M*p R * cost of finding matching S tuples ) For each R tuple, cost of probing S index ~1.2 for hash index For each R tuple, cost of probing S index 2 … 4 for B+ tree • Cost of then finding S tuples (assuming Alt. (2) or (3) for data entries) depends on clustering: • Clustered index: 1 I/O for all tuples (typical), unclustered: up to 1 I/O per matching S tuple 340151 Big Data & Cloud Computing (P. Baumann) 7
Examples of Index Nested Loops SELECT * Hash-index on sid of Sailors (as inner): FROM Reserves R, Sailors S WHERE R.sid=S.sid • Scan Reserves: 1000 page I/Os, 100*1000 tuples • For each Reserves tuple: 1.2 I/Os to get data entry in index + 1 I/O to get (the exactly one) matching Sailors tuple • Total: 220,000 I/Os Hash-index on sid of Reserves (as inner): • Scan Sailors: 500 page I/Os, 80*500 tuples • For each Sailors tuple: 1.2 I/Os to find index page with data entries + cost of retrieving matching Reserves tuples (*) • (*) Assuming uniform distribution, 2.5 reservations per sailor (100,000 / 40,000) Cost is 1 or 2.5 I/Os, depending on whether index is clustered • Total: … 4,000 * (1.2 + 2.5) = 148,000 I/Os 340151 Big Data & Cloud Computing (P. Baumann) 8
Block Nested Loops Join 1 page as input buffer for scanning inner S 1 page as output buffer Earlier: block = page Now: block = sequence of pages all remaining pages hold "block" of outer R R & S Join Result Hash table for block of R (k < B-1 pages) . . . . . . . . . Input buffer for S Output buffer 340151 Big Data & Cloud Computing (P. Baumann) 9
Examples of Block Nested Loops Cost: Scan of outer + #outer blocks * scan of inner • #outer blocks = # of pages of outer / blocksize With Reserves (R) as outer, and 100 pages of R per block: • Cost of scanning R is 1000 I/Os; total of 10 blocks • Per block of R, scan Sailors (S); 10*500 I/Os With 100-page block of Sailors as outer: • Cost of scanning S is 500 I/Os; total of 5 blocks • Per block of S, scan Reserves; 5*1000 I/Os 340151 Big Data & Cloud Computing (P. Baumann) 10
Sort-Merge Join Approach: • Sort R & S on join column • scan them to do a ``merge’’ (on join column) • output result tuples Efficiency: • R scanned once; each S group scanned once per matching R tuple • Multiple scans of an S group likely to find needed pages in buffer Cost: M log M + N log N + (M+N) • Ex: with 35, 100 or 300 buffer pages, Reserves & Sailors sorted in 2 passes • total join cost: 7500 ( BNL cost: 2,500 to 15,000 I/Os ) In practice, cost of sort-merge join linear 340151 Big Data & Cloud Computing (P. Baumann) 11
Refinement of Sort-Merge Join combine merging phases in sorting of R & S with merging required for join • With B > , where L is size of larger relation, L using sorting refinement that produces runs of length 2B in Pass 0: #runs of each relation is < B/2 • Allocate 1 page per run of each relation, `merge’ while checking join condition • Cost: read+write each relation in Pass 0 + read each relation in (only) merging pass (+ writing of result tuples) • In example: cost goes down from 7,500 to 4,500 I/Os In practice, cost of sort-merge join is linear • like cost of external sorting 340151 Big Data & Cloud Computing (P. Baumann) 12
Recommend
More recommend