hash tables
play

Hash Tables Lecture # 06 Database Systems Andy Pavlo AP AP - PowerPoint PPT Presentation

Hash Tables Lecture # 06 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645 Carnegie Mellon Univ. Fall 2018 2 UPCO M IN G DATABASE EVEN TS MapD Talk Thursday Sept 20 th @ 12:00pm CIC 4 th Floor CMU 15-445/645 (Fall


  1. Hash Tables Lecture # 06 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645 Carnegie Mellon Univ. Fall 2018

  2. 2 UPCO M IN G DATABASE EVEN TS MapD Talk → Thursday Sept 20 th @ 12:00pm → CIC 4 th Floor CMU 15-445/645 (Fall 2018)

  3. 3 ADM IN ISTRIVIA Project #1 is due Wednesday Sept 26 th @ 11:59pm Homework #2 is due Friday Sept 28 th @ 11:59pm CMU 15-445/645 (Fall 2018)

  4. 4 REM IN DER If you have a question during the lecture, raise your hand and stop me. Do not come up to the front after the lecture. There are no stupid questions (*) . CMU 15-445/645 (Fall 2018)

  5. 5 CO URSE STATUS We are now going to talk about how Query Planning to support the DBMS's execution engine to read/write data from pages. Operator Execution Access Methods Two types of data structures: → Hash Tables Buffer Pool Manager → Trees Disk Manager CMU 15-445/645 (Fall 2018)

  6. 6 DATA STRUCTURES Internal Meta-data Core Data Storage Temporary Data Structures Table Indexes CMU 15-445/645 (Fall 2018)

  7. 7 DESIGN DECISIO N S Data Organization → How we layout data structure in memory/pages and what information to store to support efficient access. Concurrency → How to enable multiple threads to access the data structure at the same time without causing problems. CMU 15-445/645 (Fall 2018)

  8. 8 H ASH TABLES A hash table implements an associative array abstract data type that maps keys to values. It uses a hash function to compute an offset into the array, from which the desired value can be found. CMU 15-445/645 (Fall 2018)

  9. 9 STATIC H ASH TABLE Allocate a giant array that has one slot hash(key) for every element that you need to 0 abc record. 1 Ø 2 def To find an entry, mod the key by the ⋮ number of elements to find the offset n xyz in the array. CMU 15-445/645 (Fall 2018)

  10. 9 STATIC H ASH TABLE Allocate a giant array that has one slot hash(key) for every element that you need to 0 record. abcdefghi 1 2 xyz123 To find an entry, mod the key by the ⋮ number of elements to find the offset defghijk n in the array. CMU 15-445/645 (Fall 2018)

  11. 10 ASSUM PTIO N S You know the number of elements hash(key) ahead of time. 0 abcdefghi 1 Each key is unique. 2 xyz123 ⋮ Perfect hash function. defghijk n → If key1≠key2 , then hash(key1)≠hash(key2) CMU 15-445/645 (Fall 2018)

  12. 11 H ASH TABLE Design Decision #1: Hash Function → How to map a large key space into a smaller domain. → Trade-off between being fast vs. collision rate. Design Decision #2: Hashing Scheme → How to handle key collisions after hashing. → Trade-off between allocating a large hash table vs. additional instructions to find/insert keys. CMU 15-445/645 (Fall 2018)

  13. 12 TO DAY'S AGEN DA Hash Functions Static Hashing Schemes Dynamic Hashing Schemes CMU 15-445/645 (Fall 2018)

  14. 13 H ASH FUN CTIO N S We don’t want to use a cryptographic hash function for our join algorithm. We want something that is fast and will have a low collision rate. CMU 15-445/645 (Fall 2018)

  15. 14 H ASH FUN CTIO N S MurmurHash (2008) → Designed to a fast, general purpose hash function. Google CityHash (2011) → Based on ideas from MurmurHash2 → Designed to be faster for short keys (<64 bytes). Google FarmHash (2014) → Newer version of CityHash with better collision rates. CLHash (2016) → Fast hashing function based on carry-less multiplication. CMU 15-445/645 (Fall 2018)

  16. 15 H ASH FUN CTIO N BEN CH M ARKS Intel Core i7-8700K @ 3.70GHz std::hash MurmurHash3 CityHash FarmHash CLHash 18000 64 32 192 Throughput (MB/sec) 128 12000 6000 0 1 51 101 151 201 251 Key Size (bytes) Source: Fredrik Widlund CMU 15-445/645 (Fall 2018)

  17. 16 H ASH FUN CTIO N BEN CH M ARKS Intel Core i7-8700K @ 3.70GHz std::hash MurmurHash3 CityHash FarmHash CLHash 192 36000 128 Throughput (MB/sec) 24000 64 32 12000 0 1 51 101 151 201 251 Key Size (bytes) Source: Fredrik Widlund CMU 15-445/645 (Fall 2018)

  18. 17 STATIC H ASH IN G SCH EM ES Approach #1: Linear Probe Hashing Approach #2: Robin Hood Hashing Approach #3: Cuckoo Hashing CMU 15-445/645 (Fall 2018)

  19. 18 LIN EAR PRO BE H ASH IN G Single giant table of slots. Resolve collisions by linearly searching for the next free slot in the table. → To determine whether an element is present, hash to a location in the index and scan for it. → Have to store the key in the index to know when to stop scanning. → Insertions and deletions are generalizations of lookups. CMU 15-445/645 (Fall 2018)

  20. 19 LIN EAR PRO BE H ASH IN G hash(key) A B | val A <key>|<value> C D E F CMU 15-445/645 (Fall 2018)

  21. 19 LIN EAR PRO BE H ASH IN G hash(key) | val B A B | val A C D E F CMU 15-445/645 (Fall 2018)

  22. 19 LIN EAR PRO BE H ASH IN G hash(key) | val B A B | val A C | val C D E F CMU 15-445/645 (Fall 2018)

  23. 19 LIN EAR PRO BE H ASH IN G hash(key) | val B A B | val A C | val C D | val E D F CMU 15-445/645 (Fall 2018)

  24. 19 LIN EAR PRO BE H ASH IN G hash(key) | val B A B | val A C | val C D | val E D F | val E CMU 15-445/645 (Fall 2018)

  25. 19 LIN EAR PRO BE H ASH IN G hash(key) | val B A B | val A C | val C D | val E D F | val E | val F CMU 15-445/645 (Fall 2018)

  26. 20 N O N- UN IQ UE KEYS Choice #1: Separate Linked List → Store values in separate storage area for each key. CMU 15-445/645 (Fall 2018)

  27. 20 N O N- UN IQ UE KEYS Value Lists Choice #1: Separate Linked List value1 XYZ → Store values in separate storage area for value2 ABC each key. value3 value1 value2 Choice #2: Redundant Keys → Store duplicate keys entries together in the hash table. XYZ | value1 ABC | value1 XYZ | value2 XYZ | value3 ABC | value2 CMU 15-445/645 (Fall 2018)

  28. 21 O BSERVATIO N To reduce the # of wasteful comparisons, it is important to avoid collisions of hashed keys. This requires a hash table with ~2x the number of slots as the number of elements. CMU 15-445/645 (Fall 2018)

  29. 22 RO BIN H O O D H ASH IN G Variant of linear probe hashing that steals slots from "rich" keys and give them to "poor" keys. → Each key tracks the number of positions they are from where its optimal position in the table. → On insert, a key takes the slot of another key if the first key is farther away from its optimal position than the second key. CMU 15-445/645 (Fall 2018)

  30. 23 RO BIN H O O D H ASH IN G hash(key) A B | val [0] # of "Jumps" From First Position A C D E F CMU 15-445/645 (Fall 2018)

  31. 23 RO BIN H O O D H ASH IN G hash(key) | val [0] B A B | val [0] A C D E F CMU 15-445/645 (Fall 2018)

  32. 23 RO BIN H O O D H ASH IN G hash(key) | val [0] B A B | val [0] A A[0] == C[0] C | val [1] C D E F CMU 15-445/645 (Fall 2018)

  33. 23 RO BIN H O O D H ASH IN G hash(key) | val [0] B A B | val [0] A C | val [1] C[1] > D[0] C D | val [1] E D F CMU 15-445/645 (Fall 2018)

  34. 23 RO BIN H O O D H ASH IN G hash(key) | val [0] B A B | val [0] A A[0] == E[0] C | val [1] C D | val [1] E D F CMU 15-445/645 (Fall 2018)

  35. 23 RO BIN H O O D H ASH IN G hash(key) | val [0] B A B | val [0] A A[0] == E[0] C | val [1] C[1] == E[1] C D | val [1] E D F CMU 15-445/645 (Fall 2018)

  36. 23 RO BIN H O O D H ASH IN G hash(key) | val [0] B A B | val [0] A A[0] == E[0] C | val [1] C[1] == E[1] C D | val [1] D[1] < E[2] E D F CMU 15-445/645 (Fall 2018)

  37. 23 RO BIN H O O D H ASH IN G hash(key) | val [0] B A B | val [0] A A[0] == E[0] C | val [1] C[1] == E[1] C D | val [2] D[1] < E[2] E E F | val [2] D CMU 15-445/645 (Fall 2018)

  38. 23 RO BIN H O O D H ASH IN G hash(key) | val [0] B A B | val [0] A C | val [1] C D | val [2] E E F | val [2] D[2] > F[0] D | val [1] F CMU 15-445/645 (Fall 2018)

  39. 24 CUCKO O H ASH IN G Use multiple hash tables with different hash functions. → On insert, check every table and pick anyone that has a free slot. → If no table has a free slot, evict the element from one of them and then re-hash it find a new location. Look-ups and deletions are always O(1) because only one location per hash table is checked. CMU 15-445/645 (Fall 2018)

  40. 25 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert A hash 1 (A) hash 2 (A) ⋮ ⋮ CMU 15-445/645 (Fall 2018)

  41. 25 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert A hash 1 (A) hash 2 (A) A | val ⋮ ⋮ CMU 15-445/645 (Fall 2018)

  42. 25 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert A hash 1 (A) hash 2 (A) A | val Insert B hash 1 (B) hash 2 (B) ⋮ ⋮ CMU 15-445/645 (Fall 2018)

  43. 25 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert A B | val hash 1 (A) hash 2 (A) A | val Insert B hash 1 (B) hash 2 (B) ⋮ ⋮ CMU 15-445/645 (Fall 2018)

  44. 25 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert A B | val hash 1 (A) hash 2 (A) A | val Insert B hash 1 (B) hash 2 (B) ⋮ ⋮ Insert C hash 1 (C) hash 2 (C) CMU 15-445/645 (Fall 2018)

  45. 25 CUCKO O H ASH IN G Hash Table #1 Hash Table #2 Insert A C | val hash 1 (A) hash 2 (A) A | val Insert B hash 1 (B) hash 2 (B) ⋮ ⋮ Insert C hash 1 (C) hash 2 (C) CMU 15-445/645 (Fall 2018)

Recommend


More recommend