general overview rel model carnegie mellon univ dept of
play

General Overview - rel. model Carnegie Mellon Univ. Dept. of - PDF document

C. Faloutsos General Overview - rel. model Carnegie Mellon Univ. Dept. of Computer Science Relational model - SQL Formal & commercial query languages 15-415 - Database Applications Functional Dependencies Normalization


  1. C. Faloutsos General Overview - rel. model Carnegie Mellon Univ. Dept. of Computer Science • Relational model - SQL – Formal & commercial query languages 15-415 - Database Applications • Functional Dependencies • Normalization C. Faloutsos • Physical Design Indexing and Hashing – part II • Indexing Carnegie Mellon Carnegie Mellon 15-415 - C. Faloutsos 2 Indexing- overview (Static) Hashing • ISAM and B-trees Problem: “ find EMP record with ssn=123” • hashing What if disk space was free, and time was at premium? • Hashing vs B-trees • Indices in SQL • Advanced topics: – dynamic hashing – multi-attribute indexing Carnegie Mellon 15-415 - C. Faloutsos 3 Carnegie Mellon 15-415 - C. Faloutsos 4 Hashing Hashing A: Brilliant idea: key-to-address transformation: Since space is NOT free: • use M, instead of 999,999,999 slots #0 page • hash function: h(key) = slot-id #0 page 123; Smith; Main str #123 page 123; Smith; Main str #123 page #999,999,999 #999,999,999 Carnegie Mellon 15-415 - C. Faloutsos 5 Carnegie Mellon 15-415 - C. Faloutsos 6 CMU - 15-415 1

  2. C. Faloutsos Hashing Hashing Typically: each hash bucket is a page, holding Notice: could have clustering , or non-clustering many records: versions: #0 page #0 page 123; Smith; Main str. #h(123) 123; Smith; Main str #h(123) M M Carnegie Mellon 15-415 - C. Faloutsos 7 Carnegie Mellon 15-415 - C. Faloutsos 8 Hashing Indexing- overview • ISAM and B-trees Notice: could have clustering, or non-clustering • hashing versions: EMP file – hashing functions ... #0 page – size of hash table ... – collision resolution 234; Johnson; Forbes ave 123 #h(123) • Hashing vs B-trees 123; Smith; Main str. • Indices in SQL ... M • Advanced topics: 345; Tompson; Fifth ave ... Carnegie Mellon 15-415 - C. Faloutsos 9 Carnegie Mellon 15-415 - C. Faloutsos 10 Design decisions Design decisions - functions 1) formula h() for hashing function • Goal: uniform spread of keys over hash buckets 2) size of hash table M 3) collision resolution method • Popular choices: – Division hashing – Multiplication hashing Carnegie Mellon 15-415 - C. Faloutsos 11 Carnegie Mellon 15-415 - C. Faloutsos 12 CMU - 15-415 2

  3. C. Faloutsos Division hashing Division hashing • eg ., M =2; hash on driver-license number h(x) = (a*x+b) mod M (dln), where last digit is ‘gender’ (0/1 = M/F) • eg., h(ssn) = (ssn) mod 1,000 • in an army unit with predominantly male soldiers – gives the last three digits of ssn • M : size of hash table - choose a prime • Thus: avoid cases where M and keys have common divisors - prime M guards against number, defensively (why?) that! Carnegie Mellon 15-415 - C. Faloutsos 13 Carnegie Mellon 15-415 - C. Faloutsos 14 Multiplication hashing Other hashing functions h(x) = [ fractional-part- ��������� �������� • quadratic hashing (bad) • : golden ratio ( 0.618... = ( sqrt(5)-1)/2 ) • ... • in general, we need an irrational number • conclusion: use division hashing • advantage: M need not be a prime number • but � must be irrational Carnegie Mellon 15-415 - C. Faloutsos 15 Carnegie Mellon 15-415 - C. Faloutsos 16 Design decisions Size of hash table 1) formula h() for hashing function • eg., 50,000 employees, 10 employee- records / page 2) size of hash table M 3) collision resolution method • Q: M =?? pages/buckets/slots Carnegie Mellon 15-415 - C. Faloutsos 17 Carnegie Mellon 15-415 - C. Faloutsos 18 CMU - 15-415 3

  4. C. Faloutsos Size of hash table Design decisions • eg., 50,000 employees, 10 employees/page 1) formula h() for hashing function 2) size of hash table M • Q: M =?? pages/buckets/slots 3) collision resolution method • A: utilization ~ 90% and – M : prime number Eg., in our case: M= closest prime to 50,000/10 / 0.9 = 5,555 Carnegie Mellon 15-415 - C. Faloutsos 19 Carnegie Mellon 15-415 - C. Faloutsos 20 Collision resolution Collision resolution • Q: what is a ‘collision’? #0 page • A: ?? #h(123) 123; Smith; Main str. M Carnegie Mellon 15-415 - C. Faloutsos 21 Carnegie Mellon 15-415 - C. Faloutsos 22 Collision resolution Collision resolution • Q: what is a ‘collision’? • open addressing • A: ?? – linear probing (ie., put to next slot/bucket) – re-hashing • Q: why worry about collisions/overflows? • separate chaining (ie., put links to overflow (recall that buckets are ~90% full) pages) • A: ‘birthday paradox’ Carnegie Mellon 15-415 - C. Faloutsos 23 Carnegie Mellon 15-415 - C. Faloutsos 24 CMU - 15-415 4

  5. C. Faloutsos Collision resolution Collision resolution linear probing: re-hashing #0 page #0 page h1() #h(123) #h(123) 123; Smith; Main str. 123; Smith; Main str. h2() M M Carnegie Mellon 15-415 - C. Faloutsos 25 Carnegie Mellon 15-415 - C. Faloutsos 26 Collision resolution Design decisions - conclusions • function: division hashing separate chaining – h(x) = ( a*x+b ) mod M • size M : ~90% util.; prime number. • collision resolution: separate chaining 123; Smith; Main str. – easier to implement (deletions!); – no danger of becoming full Carnegie Mellon 15-415 - C. Faloutsos 27 Carnegie Mellon 15-415 - C. Faloutsos 28 Indexing- overview Hashing vs B-trees: • ISAM and B-trees Hashing offers • hashing • speed ! ( O(1) avg . search time) • Hashing vs B-trees ..but: • Indices in SQL • Advanced topics: – dynamic hashing – multi-attribute indexing Carnegie Mellon 15-415 - C. Faloutsos 29 Carnegie Mellon 15-415 - C. Faloutsos 30 CMU - 15-415 5

  6. C. Faloutsos Hashing vs B-trees: Hashing vs B-trees: ..but B-trees give: thus: • key ordering: • B-trees are implemented in most systems – range queries – proximity queries footnotes : – sequential scan • hashing is not (why not?) • O(log(N)) guarantees for search, ins./del. • ‘dbm’ and ‘ndbm’ of UNIX: offer one or both • graceful growing/shrinking Carnegie Mellon 15-415 - C. Faloutsos 31 Carnegie Mellon 15-415 - C. Faloutsos 32 Indexing- overview Indexing in SQL • ISAM and B-trees • create index <index-name> on <relation- name> (<attribute-list>) • hashing • create unique index <index-name> on • Hashing vs B-trees <relation-name> (<attribute-list>) • Indices in SQL • drop index <index-name> • Advanced topics: – dynamic hashing – multi-attribute indexing Carnegie Mellon 15-415 - C. Faloutsos 33 Carnegie Mellon 15-415 - C. Faloutsos 34 Indexing in SQL Indexing- overview • eg., • ISAM and B-trees create index ssn-index • hashing on STUDENT (ssn) • Hashing vs B-trees • or (eg., on TAKES(ssn,cid, grade) ): • Indices in SQL create index sc-index • Advanced topics: (theoretical interest) on TAKES (ssn, c-id) – dynamic hashing – multi-attribute indexing Carnegie Mellon 15-415 - C. Faloutsos 35 Carnegie Mellon 15-415 - C. Faloutsos 36 CMU - 15-415 6

  7. C. Faloutsos Solution: Dynamic/extendible Problem with static hashing hashing • problem: overflow? • idea: shrink / expand hash table on demand.. • problem: underflow? (underutilization) • ..dynamic hashing Details: how to grow gracefully, on overflow? Many solutions - One of them: ‘extendible hashing’ Carnegie Mellon 15-415 - C. Faloutsos 37 Carnegie Mellon 15-415 - C. Faloutsos 38 Extendible hashing Extendible hashing #0 page #0 page solution: split the bucket in two #h(123) #h(123) 123; Smith; Main str. 123; Smith; Main str. M M Carnegie Mellon 15-415 - C. Faloutsos 39 Carnegie Mellon 15-415 - C. Faloutsos 40 Extendible hashing Extendible hashing directory in detail: 0001... • keep a directory, with ptrs to hash-buckets 0111... 00... 01... • Q: how to divide contents of bucket in two? 10101... 10... • A: hash each key into a very long bit string; 10011... 10110... 11... keep only as many bits as needed 1101... Eventually: 101001... Carnegie Mellon 15-415 - C. Faloutsos 41 Carnegie Mellon 15-415 - C. Faloutsos 42 CMU - 15-415 7

  8. C. Faloutsos Extendible hashing Extendible hashing directory directory 0001... 0001... 0111... 0111... 00... 00... 01... 01... 10101... 10101... 10... 10... split on 3-rd bit 10011... 10011... 10110... 10110... 11... 11... 101001... 1101... 1101... 101001... Carnegie Mellon 15-415 - C. Faloutsos 43 Carnegie Mellon 15-415 - C. Faloutsos 44 Extendible hashing Extendible hashing directory directory (doubled) 0001... 0001... 0111... 0111... 00... 000... 01... 001... new page / bucket new page / bucket 10... 010... 10011... 10101... 10011... 10101... 101001... 011... 101001... 11... 10110... 10110... 100... 1101... 1101... 101... 110... 111... Carnegie Mellon 15-415 - C. Faloutsos 45 Carnegie Mellon 15-415 - C. Faloutsos 46 Extendible hashing Extendible hashing BEFORE AFTER • Summary: directory doubles on demand 0001... 0001... 000... 0111... 0111... 00... 001... • or halves, on shrinking files 01... 010... 10101... • needs ‘local’ and ‘global’ depth (see book) 10... 10101... 011... 10011... 10011... 10110... 101001... • Mainly, of theoretical interest - same for 11... 100... 101001... 10110... 101... – ‘linear hashing’ of Litwin 1101... 1101... 110... – ‘order preserving’ – ‘perfect hashing’ (no collisions!) 111... Carnegie Mellon 15-415 - C. Faloutsos 47 Carnegie Mellon 15-415 - C. Faloutsos 48 CMU - 15-415 8

Recommend


More recommend