advanced database
play

ADVANCED DATABASE SYSTEMS Parallel Join Algorithms (Hashing) @ - PowerPoint PPT Presentation

Lect ure # 17 ADVANCED DATABASE SYSTEMS Parallel Join Algorithms (Hashing) @ Andy_Pavlo // 15- 721 // Spring 2020 2 Background Parallel Hash Join Hash Functions Hashing Schemes Evaluation 15-721 (Spring 2020) 3 PARALLEL J O IN ALGO


  1. 20 BUILD PH ASE The threads are then to scan either the tuples (or partitions) of R . For each tuple, hash the join key attribute for that tuple and add it to the appropriate bucket in the hash table. → The buckets should only be a few cache lines in size. 15-721 (Spring 2020)

  2. 21 H ASH TABLE Design Decision #1: Hash Function → How to map a large key space into a smaller domain. → Trade-off between being fast vs. collision rate. Design Decision #2: Hashing Scheme → How to handle key collisions after hashing. → Trade-off between allocating a large hash table vs. additional instructions to find/insert keys. 15-721 (Spring 2020)

  3. 22 H ASH FUN CTIO N S We do not want to use a cryptographic hash function for our join algorithm. We want something that is fast and will have a low collision rate. → Best Speed: Always return ' 1 ' → Best Collision Rate: Perfect hashing See SMHasher for a comprehensive hash function benchmark suite. 15-721 (Spring 2020)

  4. 23 H ASH FUN CTIO N S CRC-64 (1975) → Used in networking for error detection. MurmurHash (2008) → Designed to a fast, general purpose hash function. Google CityHash (2011) → Designed to be faster for short keys (<64 bytes). Facebook XXHash (2012) → From the creator of zstd compression. Google FarmHash (2014) → Newer version of CityHash with better collision rates. 15-721 (Spring 2020)

  5. 24 H ASH FUN CTIO N BEN CH M ARK Intel Core i7-8700K @ 3.70GHz crc64 std::hash MurmurHash3 CityHash FarmHash XXHash3 28000 128 Throughput (MB/sec) 64 192 21000 32 14000 7000 0 1 51 101 151 201 251 Key Size (bytes) Source: Fredrik Widlund 15-721 (Spring 2020)

  6. 25 H ASH IN G SCH EM ES Approach #1: Chained Hashing Approach #2: Linear Probe Hashing Approach #3: Robin Hood Hashing Approach #4: Hopscotch Hashing Approach #5: Cuckoo Hashing 15-721 (Spring 2020)

  7. 26 CH AIN ED H ASH IN G Maintain a linked list of buckets for each slot in the hash table. Resolve collisions by placing all elements with the same hash key into the same bucket. → To determine whether an element is present, hash to its bucket and scan for it. → Insertions and deletions are generalizations of lookups. 15-721 (Spring 2020)

  8. 27 CH AIN ED H ASH IN G hash(key) A B C D E F 15-721 (Spring 2020)

  9. 27 CH AIN ED H ASH IN G hash(key) A B | A hash(A) C Buckets D E F 15-721 (Spring 2020)

  10. 27 CH AIN ED H ASH IN G | B hash(B) hash(key) A B | A hash(A) C Buckets D E F 15-721 (Spring 2020)

  11. 27 CH AIN ED H ASH IN G | B hash(B) hash(key) A B | A hash(A) C Buckets D | C hash(C) E F 15-721 (Spring 2020)

  12. 27 CH AIN ED H ASH IN G | B hash(B) hash(key) A B | A hash(A) C Buckets D | C hash(C) E F 15-721 (Spring 2020)

  13. 27 CH AIN ED H ASH IN G | B hash(B) hash(key) A B | A | D hash(A) hash(D) C D | C hash(C) E F 15-721 (Spring 2020)

  14. 27 CH AIN ED H ASH IN G | B hash(B) hash(key) A B | A | D hash(A) hash(D) C D | C | E hash(C) hash(E) E F 15-721 (Spring 2020)

  15. 27 CH AIN ED H ASH IN G | B hash(B) hash(key) A B | A | D hash(A) hash(D) C D | C | E hash(C) hash(E) E F | F hash(F) 15-721 (Spring 2020)

  16. 27 CH AIN ED H ASH IN G | B hash(B) hash(key) 64-bit Bucket Pointers ¤ 48-bit Pointer A 16-bit Bloom Filter B | A | D hash(A) hash(D) C D | C | E hash(C) hash(E) E F | F hash(F) 15-721 (Spring 2020)

  17. 28 LIN EAR PRO BE H ASH IN G Single giant table of slots. Resolve collisions by linearly searching for the next free slot in the table. → To determine whether an element is present, hash to a location in the table and scan for it. → Must store the key in the table to know when to stop scanning. → Insertions and deletions are generalizations of lookups. 15-721 (Spring 2020)

  18. 29 LIN EAR PRO BE H ASH IN G hash(key) A B | A hash(A) C D E F 15-721 (Spring 2020)

  19. 29 LIN EAR PRO BE H ASH IN G hash(key) | B hash(B) A B | A hash(A) C D E F 15-721 (Spring 2020)

  20. 29 LIN EAR PRO BE H ASH IN G hash(key) | B hash(B) A B | A hash(A) C D E F 15-721 (Spring 2020)

  21. 29 LIN EAR PRO BE H ASH IN G hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D E F 15-721 (Spring 2020)

  22. 29 LIN EAR PRO BE H ASH IN G hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D | D E hash(D) F 15-721 (Spring 2020)

  23. 29 LIN EAR PRO BE H ASH IN G hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D | D E hash(D) F 15-721 (Spring 2020)

  24. 29 LIN EAR PRO BE H ASH IN G hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D | D E hash(D) F | E hash(E) 15-721 (Spring 2020)

  25. 29 LIN EAR PRO BE H ASH IN G hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D | D E hash(D) F | E hash(E) | F hash(F) 15-721 (Spring 2020)

  26. 30 O BSERVATIO N To reduce the # of wasteful comparisons during the join, it is important to avoid collisions of hashed keys. This requires a chained hash table with ~2 × the number of slots as the # of elements in R . 15-721 (Spring 2020)

  27. 31 RO BIN H O O D H ASH IN G Variant of linear probe hashing that steals slots from "rich" keys and give them to "poor" keys. → Each key tracks the number of positions they are from where its optimal position in the table. → On insert, a key takes the slot of another key if the first key is farther away from its optimal position than the second key. ROBIN HOOD H HASHING FOUNDATIONS O OF COMPUTER SCIENCE 1985 15-721 (Spring 2020)

  28. 32 RO BIN H O O D H ASH IN G hash(key) A B | A [0] # of "Jumps" From First Position hash(A) C D E F 15-721 (Spring 2020)

  29. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) C D E F 15-721 (Spring 2020)

  30. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) A[0] == C[0] C D E F 15-721 (Spring 2020)

  31. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) A[0] == C[0] C | C [1] hash(C) D E F 15-721 (Spring 2020)

  32. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) C | C [1] C[1] > D[0] hash(C) D E F 15-721 (Spring 2020)

  33. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) C | C [1] C[1] > D[0] hash(C) D | D [1] E hash(D) F 15-721 (Spring 2020)

  34. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) A[0] == E[0] C | C [1] hash(C) D | D [1] E hash(D) F 15-721 (Spring 2020)

  35. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) A[0] == E[0] C | C [1] C[1] == E[1] hash(C) D | D [1] E hash(D) F 15-721 (Spring 2020)

  36. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) A[0] == E[0] C | C [1] C[1] == E[1] hash(C) D | D [1] D[1] < E[2] E hash(D) F 15-721 (Spring 2020)

  37. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) A[0] == E[0] C | C [1] C[1] == E[1] hash(C) D | E [2] D[1] < E[2] E hash(E) F | D [2] hash(D) 15-721 (Spring 2020)

  38. 32 RO BIN H O O D H ASH IN G hash(key) | B [0] hash(B) A B | A [0] hash(A) C | C [1] hash(C) D | E [2] E hash(E) F | D [2] D[2] > F[0] hash(D) | F [1] hash(F) 15-721 (Spring 2020)

  39. 33 H O PSCOTCH H ASH IN G Variant of linear probe hashing where keys can move between positions in a neighborhood . → A neighborhood is contiguous range of slots in the table. → The size of a neighborhood is a configurable constant. A key is guaranteed to be in its neighborhood or not exist in the table. HOPSCOTCH HASHING SYMPOSIUM ON DISTRIBUTED COMPUTING 2008 15-721 (Spring 2020)

  40. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) A B C D E F 15-721 (Spring 2020)

  41. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) Neighborhood #1 A B C D E F 15-721 (Spring 2020)

  42. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) Neighborhood #1 A Neighborhood #2 B Neighborhood #3 C D ⋮ E F 15-721 (Spring 2020)

  43. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) A B Neighborhood #3 C D E F 15-721 (Spring 2020)

  44. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) A B | A Neighborhood #3 hash(A) C D E F 15-721 (Spring 2020)

  45. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) Neighborhood #1 A B | A hash(A) C D E F 15-721 (Spring 2020)

  46. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B Neighborhood #1 hash(B) A B | A hash(A) C D E F 15-721 (Spring 2020)

  47. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A Neighborhood #3 hash(A) C D E F 15-721 (Spring 2020)

  48. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A Neighborhood #3 hash(A) C D E F 15-721 (Spring 2020)

  49. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A Neighborhood #3 hash(A) C | C hash(C) D E F 15-721 (Spring 2020)

  50. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D E F 15-721 (Spring 2020)

  51. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D E F 15-721 (Spring 2020)

  52. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D | D E hash(D) F 15-721 (Spring 2020)

  53. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D | D E hash(D) F 15-721 (Spring 2020)

  54. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A Neighborhood #3 hash(A) C | C hash(C) D | D E hash(D) F 15-721 (Spring 2020)

  55. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A Neighborhood #3 hash(A) C | C hash(C) D | D E hash(D) F 15-721 (Spring 2020)

  56. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A Neighborhood #3 hash(A) C | C hash(C) D | D E hash(D) F 15-721 (Spring 2020)

  57. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D E F | D hash(D) 15-721 (Spring 2020)

  58. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A Neighborhood #3 hash(A) C | C hash(C) D | E E hash(E) F | D hash(D) 15-721 (Spring 2020)

  59. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D | E E hash(E) F | D Neighborhood #6 hash(D) 15-721 (Spring 2020)

  60. 34 H O PSCOTCH H ASH IN G Neighborhood Size = 3 hash(key) | B hash(B) A B | A hash(A) C | C hash(C) D | E E hash(E) F | D Neighborhood #6 hash(D) | F hash(F) 15-721 (Spring 2020)

  61. 35 CUCKO O H ASH IN G Use multiple tables with different hash functions. → On insert, check every table and pick anyone that has a free slot. → If no table has a free slot, evict the element from one of them and then re-hash it find a new location. Look-ups are always O(1) because only one location per hash table is checked. 15-721 (Spring 2020)

  62. 36 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 ⋮ ⋮ 15-721 (Spring 2020)

  63. 36 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert X hash 1 (X) hash 2 (X) ⋮ ⋮ 15-721 (Spring 2020)

  64. 36 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert X hash 1 (X) hash 2 (X) hash 1 (X) | X ⋮ ⋮ 15-721 (Spring 2020)

  65. 36 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert X hash 1 (X) hash 2 (X) hash 1 (X) | X Insert Y hash 1 (Y) hash 2 (Y) ⋮ ⋮ 15-721 (Spring 2020)

  66. 36 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert X hash 2 (Y) | Y hash 1 (X) hash 2 (X) hash 1 (X) | X Insert Y hash 1 (Y) hash 2 (Y) ⋮ ⋮ 15-721 (Spring 2020)

Recommend


More recommend