1
play

1 1. Need a hashing function, h(k), that To provide a unique set - PDF document

Searching A systematic method for locating a record with a key value k j = K. Searching successful search Chapter 9 unsuccessful search sections 9.1-9.4.1 exact match query range query Maps A Simple List-Based Map A


  1. Searching A systematic method for locating a record with a key value k j = K. Searching – successful search Chapter 9 – unsuccessful search sections 9.1-9.4.1 – exact match query – range query Maps A Simple List-Based Map • A map models a searchable collection • We can efficiently implement a map using an of key-value entries unsorted list • The main operations of a map are for – We store the items of the map in a list S (based on a doubly-linked list), in arbitrary order searching, inserting, and deleting items • Multiple entries with the same key are not allowed ������� ��������������� ������ • Applications: – address book 5 c 8 c 9 c 6 c – student-record database ������� Performance of a List- Hashing Based Map Table Representations of Data • Performance: – put takes O (1) time since we can insert the new item at the 1 to 1 mapping beginning or at the end of the sequence ex. 5000 employees – get and remove take O ( n ) time since in the worst case (the item is not found) we traverse the entire sequence to look for key= an item with the given key • The unsorted list implementation is effective only for spaces= maps of small size or for maps in which puts are the most common operations, while searches and removals are rarely performed (e.g., historical record of logins to a workstation) to provide a unique set of keys 1

  2. 1. Need a hashing function, h(k), that To provide a unique set of keys: maps key K onto an address in the YOU MUST HAVE A UNIQUE KEY! table. ex. binary search ≈ f(n) = 2 log 2 N 2. Must ensure that h(k 1 ) ≠ h(k 2 ) 2 log 2 5000 = 21.6 comparisons O(log 2 n) Simple, naïve example: table size = M h(k) = k mod M How would you like O(1)? Example: add 10,2,19,14,24,23 Choosing a Hash Function Table size=7 * a good hash function maps keys h(k) = k mod 7 uniformly and randomly 10 mod 7 = 3 – a poor hash function maps keys 2 mod 7 = 2 non-uniformly, or maps contiguous clusters 0 of keys into clusters of hash table locations. 1 19 mod 7 = 5 2 2 3 10 4 5 19 6 Where do 14, 24 and 23 go? Example Hash functions 2. Shift Folding – key is divided into sections 1. Division Method – sections are added together – choose a prime number as table size, M – ex. 9 digit key k=013402122 – interpret keys as integers – h(k) = k mod M 013+402+122 = 537 = h(k) – can use multiplication, subtraction, addition, (whatever) in some fashion, to combine into a final value 2

  3. 4. Middle Squaring 3. Boundary Folding – take middle portion of key, square it and adjust – like shift folding – ex. k = 013402122 – every other number is reversed before folding h(k) = 402 2 = 161604 – ex. k = 013402122 adjust - 013 + 204 + 122 = 339 = h(k) 1) could mod M – not much difference 2) could take middle 4 digits (6160) – decide which method gives better scattered result based on experiments – used for character codes. 5. Truncation 6. Digit or Character Extraction – similar to truncation – simply delete part of the key ε use – when key has a predictable value, extract remaining digits before applying another hash function. – ex. k = 013402122 h(k) = 122 – ex. a company coding scheme – easy to compute – not very random ε uniform – seldom used alone - commonly used in conjunction with another method Collisions 7. Random Number Generator ∼∼∼∼∼∼∼ – use key as the “seed” – next number computed is hash value def. h(k 1 ) = h( k 2 ) – unlikely 2 different seeds will give the recall we need to map keys in a same random number UNIFORM and RANDOM – can be computationally expensive fashion (For an arbitrary key, any possible table address, 0 to M-1, should be equally likely to be chosen by your hashing function.) 3

  4. K = C 3 C 2 C 1, A BAD EXAMPLE numerically K= C 3 *256 2 *256 1 +C1*256 0 reduced mod 256 has value C 1 M = 2 8 = 256 Hash 6 keys: RX1,RX2,RX3,RY1,RY2,RY3 h(k) = k mod M = k mod 256 h(RX1) = h(RY1) = ‘1’ = 49 keys: variable names (registers in assembly) h(RX2) = h(RY2) = ‘2’ = 50 up to 3 characters, use 8-bit ASCII chars h(RX3) = h(RY3) = ‘3’ = 51 ( 1- 24 bit integer, divided into 3 equal 8 bit sections) 1) 6 original keys map into only 3 unique Problem with policy - has the effect of selecting addresses the low order character as the value of h(k). 2) contiguous runs of keys, in general, result in contiguous runs of table space Q(2) = Q(1) × (364/365) Hashing Q(3) = Q(2) × (363/365) How often do collisions really happen? develop a recurrence relation Von Mises Birthday Paradox * As soon as the table is 12.9% full, there is 23 + people, 50% probability of a match greater than a 95% chance that 2 will – probe is ∅ -1. collide. – Q(n) is probability if you randomly toss n § Moral of the story§ balls into a table with 365 slots, there will Even in sparsely occupied hash table, be no collision – P(n) = 1- Q(n) probability of a collision collisions are relatively common. – Q(1) = 1 // there will be no collisions 2) double hashing Collision Resolution Policies – calculate a probe decrement P(K) = max (1, K mod M) Open Addressing Inserting keys into other empty locations 3) rehashing in the table – apply h(k) – if a collision, apply h 1 (k) – if still a collision, apply h 2 (k) 1) linear probing – use entire sequence of hash functions – go to next open space – wrap around, if necessary 4

  5. Collision Resolution Policies Example of Double Hashing Chaining • Consider a hash h ( k ) d ( k ) ������ k use linked lists table storing integer �� � � � �� � � � keys that handles �� � � � Quadratic Collision Processing collision with double �� � � � �� �� � � � hashing – examines locations whose distance form �� � � � the initial collision point increases as the – N = 13 �� � � � � � square of the distance from the previous �� � � � – h ( k ) = k mod 13 location tried. – d ( k ) = 7 − k mod 7 – ex. h(k) = A we collide • Insert keys 18, 41, try A+1 2 , A+2 2 , A+3 2 , ... A+R 2 0 1 2 3 4 5 6 7 8 9 10 11 12 22, 44, 59, 32, 31, – uses wraparound 73, in this order – leaves increasingly larger gaps between �� �� �� �� �� �� �� �� successive relocation positions 0 1 2 3 4 5 6 7 8 9 10 11 12 Clusters Load Factors def, contiguous runs of occupied entries Primary Clustering Suppose table T is of size M, and N * look at linear probing entries are occupied, ( M-N are empty) * causes a small “puddle” of keys to α = N/M load factor of T form at the collision location. * the small puddle grows larger ex. M = 100, N = 75, α = 0.75 * the larger it grows, the faster it grows we say T is 75% full. * small puddles connect to form large puddles note - linear probing is subject to primary clustering Performance double hashing is not 1) Based on Uniformity and Randomness of h(k) Secondary Clustering 2) Based on Collision Resolution Policy – when any 2 keys have a collision at a given location, they both subsequently examine the 3) Based on Load Factor (Density of Table) same sequence of alternative locations until the collision is resolved. “ Density - Dependent Search Technique” – not as bad as primary clusters – secondary clusters do not form larger // means you can achieve a highly efficient result if you are willing to secondary clusters waste enough vacant records – quadratic collision processing is subject to secondary clusters 5

Recommend


More recommend