Hash tables Most data structures that we’re going to see are about storing and manipulating data When only operations Insert , Search and Delete as dictionary op- erations are needed, hash tables can be quite good Many variations of hash tables (or rather the functions implementing them), from not-so-fast but simple to extremely fast but complicated Elements are pairs (key,data); keys distinct Intuition: you have some, say, “clever array”, and • Insert(elem) inserts elem somewhere into the array • Search(elem) knows where elem is stored and returns correspond- ing data • Delete(elem) also knows where elem is and removes it 1
Actual “positions” (somehow) depend on keys Important: we want to maintain dynamic set (insertions and dele- tions) Given universe U = { 0 , . . . , u − 1 } for some (typically large) u Keys are from U Simple approach : Array of size | U | , operations are straightforward, element with key k is stored in slot k But what if K , set of keys actually stored , is much much smaller than U ? Waste of memory (but time-efficient) 2
What we want is to reduce size of table Hashing: element with key k is stored in slot h ( k ); we use hash function h to compute slot We hope to be able to reduce size of table, say, m : h : U → { 0 , . . . , m − 1 } for some m << | U | We say element with key k hashes into slot h ( k ), and that h ( k ) is hash value of k But. . . two or more keys may hash to same slot ( collisions ) 3
Best idea: just avoid collisions; tailor hash function accordingly However: by assumption | U | > m , so there must be at least two keys with same hash value, thus complete avoidance impossible Thus: whatever h we choose, we still need some form of collision- resolution 4
Hashing with chaining The simplest of all collision-resolution protocols Does just what you’d expect: each slot really is a list . When ele- ments collide, just insert new guy into the list (“the chain”). Suppose T is your hash table, h your hash function Chained-Hash-Insert(T,x) insert x at head of list T [ h (key[ x ])] Chained-Hash-Search(T,k) search for element with key k in list T [ h ( k )] Chained-Hash-Delete(T,x) delete x from list T [ h (key[ x ])] 5
What about running times? • Insert: clearly O (1) under assumption that element is not yet in table; otherwise first search • Search: proportional to length of list; more details to come • Delete: note: argument is x , not k , thus constant time access, then another O (1) if doubly-linked lists. If argument were key, then search necessary. If lists singly-linked, still essentially search necessay (need predecessor of x ) 6
Given hash table T with m slots that stores n elements Def load factor α = n/m (avg list size) Analysis in terms of α (not necessarily greater than one!) Clear: worst-case performance poor: if all n keys hash to same slot, then we might just as well have used just one list Average performance depends on how well hash fct h (that we still don’t know) distributes keys, on average 7
We’ll see more details, but for now (very strong) assumption: Any given element is equally likely to hash into any of the m slots, independently of where other elements hash to. Assumption is called simple uniform hashing Two intuitions come to mind: 1. input is some random sample, hash function is fixed 2. input is fixed, hash function is somehow randomised 8
For j ∈ { 0 , . . . , m − 1 } let n j = length( T [ j ]) Clearly, n 0 + n 1 + · · · + n m − 1 = n Also, average value of n j is E [ n j ] = α = n/m (recall: “equally likely. . . ”) Another assumption (not necessarly true): hash function h can be ecaluated in O (1) time Thus, time required to search for some element with key k depends linearly on length n h ( k ) of list T [ h ( k )] 9
We consider unsuccessful (no element in table has key k ), and suc- cessful searches. Theorem. Under simple uniform hashing, if using collision resulotion hashing with chaining then an unsuccessful search takes expected time Θ(1 + α ) with α = n/m . Proof. • any key k not already in table (recall: unsuccessful) is equally likely hashed to any of the m slots (read: they all look the same for us) • expected time to search unsuccessfully for k is expected time to search to end of T [ h ( k )] • T [ h ( k )] has expected length E [ n h ( k ) ] = α • thus expected # examined elements is α • add 1 for evaluation of h Recall: α could be very small, thus Θ(1 + α ) does make sense! 10
For successful searches not all lists equally likely to be searched Probability that list is searched is proportional to # elements it contains (under certain assumptions) We assume element being searched for is equally likely any of the n elements in table. Then we get Theorem. Under simple uniform hashing, if using collision resolution hashing with chaining then a successful search takes expected time Θ(1 + α ) with α = n/m . Proof. • # elements examined is 1 more than # elements before x in x ’s list • elements before x were inserted after x itself (new elements are plced at front) 11
Let x i be i -th element inserted into table, 1 ≤ i ≤ n let k i = key( x i ) For keys k i , k j , define Bernoulli r.v. X ij = 1 iff h ( k i ) = h ( k j ) Under simple uniform hashing, m � P ( X ij = 1) = P ( h ( k i ) = z ) · P ( h ( k j ) = z ) z =1 m (1 /m ) 2 = 1 /m � = z =1
Thus E [ X ij ] = P ( X ij = 1) = 1 /m , and n n 1 � � E 1 + X ij n i =1 j = i +1 n n n n 1 = 1 1 � � � � = 1 + E [ X ij ] 1 + n n m i =1 i =1 j = i +1 j = i +1 n n n 1 1 1 + � � � = 1 n n m i =1 i =1 j = i +1 n n n 1 + 1 1 = 1 + 1 � � � = ( n − i ) nm nm i =1 i =1 j = i +1 n n � � 1 + 1 = 1 + 1 n 2 − n ( n + 1) � � = n − i nm nm 2 i =1 i =1 1 + n m − n + 1 = 1 + α − n + 1 = 2 n α 2 m 1 − n + 1 � � = 1 + α = Θ(1 + α ) 2 n 12
Consequence: if m (# slots) is at least proportional to n (# ele- ments), then n = O ( m ) and α = n/m = O (1), thus searching takes constant time on average! Insertion and Deletion also take (worst-case even) constant time (if doubly-linked lists are used), thus all operations take constant time on average! (However: we need assumption of single uniform hashing) 13
So far, haven’t seen a single hash function What make a good hash function? Satisfies (more or less) assumption of single uniform hashing: Each key is equally likely to hash to any of the m slots, independently of where other keys hash to However, typically impossible , certainly depending on how keys are chosen (think of evil adversary) Sometimes we know key distribution. Ex: keys are real random numbers in k ∈ [0 , 1), independently and uniformly chosen, then h ( k ) = ⌊ k · m ⌋ satisfies condition 14
Usual assumption: universe of keys is { 0 , 1 , 2 , . . . } , i.e., somehow interpret real keys as natural numbers (“usually” easy enough. . . ) Two very simple hash functions: 1. Division method : h ( k ) = k mod m Ex: hash table has size 25, key k = 234, then h ( k ) = 234 mod 25 = 9 Quite fast, but drawbacks Want to avoid certain values of m , e.g. pwrs of 2 If m = 2 p , then h ( k ) = k mod m = k mod 2 p , the p lowest- Why? order bits of k Ex: m = 2 5 = 32, k = 168, h ( k ) = 168 mod 32 = 8 = (1000) 2 , and k = 168 = (10101000) 2 Better to make hash depend on all bits of key Good idea (usually) for m : prime not too close to power of two 15
2. Multiplication mthd : h ( k ) = ⌊ m ( kA mod 1) ⌋ Uh, what’s that? • A is constant with 0 < A < 1 • Thus kA is real with 0 ≤ kA < k • kA mod 1 is fractional part of kA , i.e., kA − ⌊ kA ⌋ Ex: A = 0 . 23, k = 234, then kA = 53 . 82 and kA mod 1 = 0 . 82 IOW: kA mod 1 ∈ [0 , 1) • Therefore m ( kA mod 1) ∈ [0 , m ), and ⌊ m ( kA mod 1) ⌋ ∈ [0 , 1 , . . . , m − 1] Voila! Advantage: value of m not critical Typically power of two (no good with division method!), since then implementation easy (some comments in textbook) 16
Recommend
More recommend