Searching • Consider the problem of searching an array for a given value Hashing – If the array is not sorted, the search requires O(n) time • If the value isn’t there, we need to search all n elements • If the value is there, we search n/2 elements on average – If the array is sorted, we can do a binary search • A binary search requires O(log n) time • About equally fast whether the element is found or not – It doesn’t seem like we could do much better • How about an O(1), that is, constant time search? • We can do it if the array is organized in a particular way 2 Hashing Example (ideal) hash function kiwi 0 • Suppose we were to come up with a “magic • Suppose our hash function function” that, given a value to search for, would gave us the following values: 1 tell us exactly where in the array to look banana hashCode("apple") = 5 2 hashCode("watermelon") = 3 watermelon – If it’s in that location, it’s in the array 3 hashCode("grapes") = 8 – If it’s not in that location, it’s not in the array hashCode("cantaloupe") = 7 4 hashCode("kiwi") = 0 • This function would have no other purpose 5 apple hashCode("strawberry") = 9 mango hashCode("mango") = 6 6 • If we look at the function’s inputs and outputs, hashCode("banana") = 2 cantaloupe they probably won’t “make sense” 7 grapes 8 • This function is called a hash function because it strawberry 9 “makes hash” of its inputs 3 4 1
Why hash tables? Finding the hash function key value • How can we come up with this magic function? • We don’t (usually) use . . . hash tables just to see if • In general, we cannot--there is no such magic 141 function something is there or 142 robin robin info – In a few specific cases, where all the possible values are not—instead, we put 143 sparrow sparrow info known in advance, it has been possible to compute a key / value pairs into the 144 hawk hawk info perfect hash function table 145 • What is the next best thing? seagull seagull info – We use a key to find a 146 – A perfect hash function would tell us exactly where to place in the table look 147 bluejay bluejay info – The value holds the – In general, the best we can do is a function that tells us 148 owl owl info information we are where to start looking! actually interested in 5 6 Example imperfect hash function Collisions kiwi 0 • Suppose our hash function • When two values hash to the same array location, gave us the following 1 this is called a collision values: banana 2 • Collisions are normally treated as “first come, first watermelon – hash("apple") = 5 3 served”—the first value that hashes to the location hash("watermelon") = 3 4 gets it hash("grapes") = 8 hash("cantaloupe") = 7 5 apple • We have to find something to do with the second hash("kiwi") = 0 mango 6 and subsequent values that hash to this same hash("strawberry") = 9 hash("mango") = 6 cantaloupe location 7 hash("banana") = 2 grapes 8 hash("honeydew") = 6 strawberry 9 • Now what? 7 8 2
Handling collisions Insertion, I • What can we do when two different values attempt • Suppose you want to add . . . to occupy the same place in an array? seagull to this hash table 141 – Solution #1: Search from there for an empty location • Also suppose: 142 robin • Can stop searching when we find the value or an empty location – hashCode(seagull) = 143 143 sparrow • Search must be end-around – table[143] is not empty – Solution #2: Use a second hash function 144 hawk – table[143] != seagull • ...and a third, and a fourth, and a fifth, ... 145 seagull – Solution #3: Use the array location as the header of a – table[144] is not empty 146 linked list of values that hash to this location – table[144] != seagull 147 bluejay • All these solutions work, provided: – table[145] is empty 148 owl – We use the same technique to add things to the array as • Therefore, put seagull at we use to search for things in the array . . . location 145 9 10 Searching, I Searching, II • Suppose you want to look up • Suppose you want to look up . . . . . . seagull in this hash table cow in this hash table 141 141 • Also suppose: • Also suppose: 142 robin 142 robin – hashCode(seagull) = 143 – hashCode(cow) = 144 143 sparrow 143 sparrow – table[143] is not empty – table[144] is not empty 144 hawk 144 hawk – table[143] != seagull – table[144] != cow 145 seagull 145 seagull – table[144] is not empty – table[145] is not empty – table[144] != seagull – table[145] != cow 146 146 – table[145] is not empty – table[146] is empty 147 bluejay 147 bluejay • If cow were in the table, we – table[145] == seagull ! 148 owl 148 owl • We found seagull at location should have found it by now . . . . . . 145 • Therefore, it isn’t here 11 12 3
Insertion, II Insertion, III • Suppose you want to add . . . • Suppose: . . . hawk to this hash table 141 – You want to add cardinal to 141 this hash table • Also suppose 142 robin 142 robin – hashCode(cardinal) = 147 – hashCode(hawk) = 143 143 sparrow 143 sparrow – The last location is 148 – table[143] is not empty 144 hawk 144 hawk – 147 and 148 are occupied – table[143] != hawk 145 seagull 145 seagull • Solution: – table[144] is not empty 146 146 – table[144] == hawk – Treat the table as circular; after 147 bluejay 147 bluejay 148 comes 0 • hawk is already in the table, 148 owl 148 owl so do nothing – Hence, cardinal goes in . . . location 0 (or 1, or 2, or ...) 13 14 Clustering Efficiency • One problem with the above technique is the tendency to • Hash tables are actually surprisingly efficient form “clusters” • Until the table is about 70% full, the number of • A cluster is a group of items not containing any open slots probes (places looked at in the table) is typically • The bigger a cluster gets, the more likely it is that new only 2 or 3 values will hash into the cluster, and make it ever bigger • Sophisticated mathematical analysis is required to • Clusters cause efficiency to degrade prove that the expected cost of inserting into a • Here is a non -solution: instead of stepping one ahead, step n hash table, or looking something up in the hash locations ahead table, is O(1) – The clusters are still there, they’re just harder to see – Unless n and the table size are mutually prime, some table locations • Even if the table is nearly full (leading to long are never checked searches), efficiency is usually still quite high 15 16 4
Solution #2: Rehashing Solution #3: Bucket hashing • In the event of a collision, another approach is to rehash: • The previous . . . compute another hash function solutions used open 141 – Since we may need to rehash many times, we need an easily hashing: all entries computable sequence of functions 142 robin went into a “flat” • Simple example: in the case of hashing Strings, we might 143 sparrow seagull (unstructured) array take the previous hash code and add the length of the 144 hawk String to it • Another solution is to make each array 145 – Probably better if the length of the string was not a component in computing the original hash function location the header of 146 • Possibly better yet: add the length of the String plus the a linked list of values 147 bluejay number of probes made so far that hash to that – Problem: are we sure we will look at every location in the array? 148 owl location • Rehashing is a fairly uncommon approach, and we won’t . . . pursue it any further here 17 18 Writing your own hashCode method The hashCode function • A hashCode method must: • public int hashCode() is defined in Object – Return a value that is (or can be converted to) a legal • Like equals , the default implementation of array index hashCode just uses the address of the object— – Always return the same value for the same input • It can’t use random numbers, or the time of day probably not what you want for your own objects – Return the same value for equal inputs • You can override hashCode for your own objects • Must be consistent with your equals method • As you might expect, String overrides hashCode • It does not need to return different values for different inputs with a version appropriate for strings • A good hashCode method should: • Note that the supplied hashCode method does not – Be efficient to compute know the size of your array —you have to adjust – Give a uniform distribution of array indices the returned int value yourself – Not assign similar numbers to similar input values 19 20 5
Recommend
More recommend