hash tables
play

Hash tables Most data structures that were going to see are about - PowerPoint PPT Presentation

Hash tables Most data structures that were going to see are about storing and manipulating data When only operations Insert , Search and Delete as dictionary op- erations are needed, hash tables can be quite good Many variations of hash tables


  1. Hash tables Most data structures that we’re going to see are about storing and manipulating data When only operations Insert , Search and Delete as dictionary op- erations are needed, hash tables can be quite good Many variations of hash tables (or rather the functions implementing them), from not-so-fast but simple to extremely fast but complicated Elements are pairs (key,data); keys distinct Intuition: you have some, say, “clever array”, and • Insert(elem) inserts elem somewhere into the array • Search(elem) knows where elem is stored and returns correspond- ing data • Delete(elem) also knows where elem is and removes it 1

  2. Actual “positions” (somehow) depend on keys Important: we want to maintain dynamic set (insertions and dele- tions) Given universe U = { 0 , . . . , u − 1 } for some (typically large) u Keys are from U Simple approach : Array of size | U | , operations are straightforward, element with key k is stored in slot k But what if K , set of keys actually stored , is much much smaller than U ? Waste of memory (but time-efficient) 2

  3. What we want is to reduce size of table Hashing: element with key k is stored in slot h ( k ); we use hash function h to compute slot We hope to be able to reduce size of table, say, m : h : U → { 0 , . . . , m − 1 } for some m << | U | We say element with key k hashes into slot h ( k ), and that h ( k ) is hash value of k But. . . two or more keys may hash to same slot ( collisions ) 3

  4. Best idea: just avoid collisions; tailor hash function accordingly However: by assumption | U | > m , so there must be at least two keys with same hash value, thus complete avoidance impossible Thus: whatever h we choose, we still need some form of collision- resolution 4

  5. Hashing with chaining The simplest of all collision-resolution protocols Does just what you’d expect: each slot really is a list . When ele- ments collide, just insert new guy into the list (“the chain”). Suppose T is your hash table, h your hash function Chained-Hash-Insert(T,x) insert x at head of list T [ h (key[ x ])] Chained-Hash-Search(T,k) search for element with key k in list T [ h ( k )] Chained-Hash-Delete(T,x) delete x from list T [ h (key[ x ])] 5

  6. What about running times? • Insert: clearly O (1) under assumption that element is not yet in table; otherwise first search • Search: proportional to length of list; more details to come • Delete: note: argument is x , not k , thus constant time access, then another O (1) if doubly-linked lists. If argument were key, then search necessary. If lists singly-linked, still essentially search necessay (need predecessor of x ) 6

  7. Given hash table T with m slots that stores n elements Def load factor α = n/m (avg list size) Analysis in terms of α (not necessarily greater than one!) Clear: worst-case performance poor: if all n keys hash to same slot, then we might just as well have used just one list Average performance depends on how well hash fct h (that we still don’t know) distributes keys, on average 7

  8. We’ll see more details, but for now (very strong) assumption: Any given element is equally likely to hash into any of the m slots, independently of where other elements hash to. Assumption is called simple uniform hashing Two intuitions come to mind: 1. input is some random sample, hash function is fixed 2. input is fixed, hash function is somehow randomised 8

  9. For j ∈ { 0 , . . . , m − 1 } let n j = length( T [ j ]) Clearly, n 0 + n 1 + · · · + n m − 1 = n Also, average value of n j is E [ n j ] = α = n/m (recall: “equally likely. . . ”) Another assumption (not necessarly true): hash function h can be ecaluated in O (1) time Thus, time required to search for some element with key k depends linearly on length n h ( k ) of list T [ h ( k )] 9

  10. We consider unsuccessful (no element in table has key k ), and suc- cessful searches. Theorem. Under simple uniform hashing, if using collision resulotion hashing with chaining then an unsuccessful search takes expected time Θ(1 + α ) with α = n/m . Proof. • any key k not already in table (recall: unsuccessful) is equally likely hashed to any of the m slots (read: they all look the same for us) • expected time to search unsuccessfully for k is expected time to search to end of T [ h ( k )] • T [ h ( k )] has expected length E [ n h ( k ) ] = α • thus expected # examined elements is α • add 1 for evaluation of h Recall: α could be very small, thus Θ(1 + α ) does make sense! 10

  11. For successful searches not all lists equally likely to be searched Probability that list is searched is proportional to # elements it contains (under certain assumptions) We assume element being searched for is equally likely any of the n elements in table. Then we get Theorem. Under simple uniform hashing, if using collision resolution hashing with chaining then a successful search takes expected time Θ(1 + α ) with α = n/m . Proof. • # elements examined is 1 more than # elements before x in x ’s list • elements before x were inserted after x itself (new elements are plced at front) 11

  12. Let x i be i -th element inserted into table, 1 ≤ i ≤ n let k i = key( x i ) For keys k i , k j , define Bernoulli r.v. X ij = 1 iff h ( k i ) = h ( k j ) Under simple uniform hashing, m � P ( X ij = 1) = P ( h ( k i ) = z ) · P ( h ( k j ) = z ) z =1 m (1 /m ) 2 = 1 /m � = z =1

  13. Thus E [ X ij ] = P ( X ij = 1) = 1 /m , and     n n  1 � � E  1 + X ij   n i =1 j = i +1     n n n n 1  = 1 1 � � � � =  1 + E [ X ij ]  1 +  n n m i =1 i =1 j = i +1 j = i +1     n n n  1  1 1  + � � � = 1  n n m i =1 i =1 j = i +1 n n n 1 + 1 1 = 1 + 1 � � � = ( n − i ) nm nm i =1 i =1 j = i +1   n n � � 1 + 1  = 1 + 1 n 2 − n ( n + 1) � � = n − i  nm nm 2 i =1 i =1 1 + n m − n + 1 = 1 + α − n + 1 = 2 n α 2 m 1 − n + 1 � � = 1 + α = Θ(1 + α ) 2 n 12

  14. Consequence: if m (# slots) is at least proportional to n (# ele- ments), then n = O ( m ) and α = n/m = O (1), thus searching takes constant time on average! Insertion and Deletion also take (worst-case even) constant time (if doubly-linked lists are used), thus all operations take constant time on average! (However: we need assumption of single uniform hashing) 13

  15. So far, haven’t seen a single hash function What make a good hash function? Satisfies (more or less) assumption of single uniform hashing: Each key is equally likely to hash to any of the m slots, independently of where other keys hash to However, typically impossible , certainly depending on how keys are chosen (think of evil adversary) Sometimes we know key distribution. Ex: keys are real random numbers in k ∈ [0 , 1), independently and uniformly chosen, then h ( k ) = ⌊ k · m ⌋ satisfies condition 14

  16. Usual assumption: universe of keys is { 0 , 1 , 2 , . . . } , i.e., somehow interpret real keys as natural numbers (“usually” easy enough. . . ) Two very simple hash functions: 1. Division method : h ( k ) = k mod m Ex: hash table has size 25, key k = 234, then h ( k ) = 234 mod 25 = 9 Quite fast, but drawbacks Want to avoid certain values of m , e.g. pwrs of 2 If m = 2 p , then h ( k ) = k mod m = k mod 2 p , the p lowest- Why? order bits of k Ex: m = 2 5 = 32, k = 168, h ( k ) = 168 mod 32 = 8 = (1000) 2 , and k = 168 = (10101000) 2 Better to make hash depend on all bits of key Good idea (usually) for m : prime not too close to power of two 15

  17. 2. Multiplication mthd : h ( k ) = ⌊ m ( kA mod 1) ⌋ Uh, what’s that? • A is constant with 0 < A < 1 • Thus kA is real with 0 ≤ kA < k • kA mod 1 is fractional part of kA , i.e., kA − ⌊ kA ⌋ Ex: A = 0 . 23, k = 234, then kA = 53 . 82 and kA mod 1 = 0 . 82 IOW: kA mod 1 ∈ [0 , 1) • Therefore m ( kA mod 1) ∈ [0 , m ), and ⌊ m ( kA mod 1) ⌋ ∈ [0 , 1 , . . . , m − 1] Voila! Advantage: value of m not critical Typically power of two (no good with division method!), since then implementation easy (some comments in textbook) 16

Recommend


More recommend