c hashmaps
play

C: Hashmaps SWEN-250 Personal Software Engineering Starting this - PDF document

C: Hashmaps SWEN-250 Personal Software Engineering Starting this week the slides will have notes associated with them. Since I'm not giving lectures live these notes are what I would say to you if we were in class. I'll try not to put anything


  1. C: Hashmaps SWEN-250 Personal Software Engineering Starting this week the slides will have notes associated with them. Since I'm not giving lectures live these notes are what I would say to you if we were in class. I'll try not to put anything in here that would go on a test, but it is strongly advised to read them. They will add a lot of context to some slides or give additional examples. Some slides will have only a couple notes, or even none, and others may have a lot. If after reading the slides, and notes, you have questions, please visit the course Discord server and post them in the lecture-discussions channel. 1

  2. Data Structures • Data Structures are ways of organizing information – So far, we have looked at: • Arrays • Linked Lists • (and structs for groups of data) • Arrays and lists are sequential organizations of data What we have done so far. 2

  3. Recap Array 0 1 2 3 4 5 … 800 150 100 25 952 216 … List 800 150 100 p_head next next next In each case, you organize data, but access in a linear fashion - By position, or sequence C arrays and linked lists 3

  4. More flexibility: Hash(…) • HashTables or HashMaps – Associative organization of data – Using a ‘key’ and ‘value’ pair – Fast access using keys (vs. index or sequence search) – Dynamic addition of data Hopefully everyone remembers HashTables from CS1. We wrote one as part of the last lecture. We'll be doing the same activity this week in 250, just using C instead of Python. You are welcome to, and even encouraged, to look back on your CS notes as an additional set of reference material. 4

  5. Other considerations • Arrays: Designed to statically allocate memory to store sequence of data where you can access them using an index in constant time and usually static arrays can’t be extended or to add more elements • LinkedLists: You can add elements dynamically and each element is pointing to the next one .. in order to access an element at index i, you have to linearly go through all elements before it in order to reach the needed element • HashTable: A data structure that you can easily access elements in a fast way (using its hash value) and that you can dynamically add more data .. it's also an abstraction on top of one or more data structures .. Hash Tables are nice because they are fast and they don't require integer keys (like arrays). They do have a downside, they cannot be ordered. Therefore they aren't the perfect data structure and there will often be times it's easier, or more beneficial to use a different data structure. There really isn't one data structure that is best in all situations. That's why we teach you so many, so you can choose the best one on an application by application basis. 5

  6. Hash (Map | Table) • Sometimes referred to as dictionary (but slightly different implementation) • The general approach is: – Instead of links or indexes, we use a key • However, we need to convert the key into something usable as a index. That is the ‘hash’ – New Entry: Key -> Hash; table[hash] = value – Lookup: Key -> Hash; value = table[hash] Technically Python uses a dictionary. Personally I don't feel there is any real difference between the two. However, some languages make a distinction, even if it is arbitrary. You can think of them both as an implementation of a Map that uses an array for its underlying data structure. In all hash-based systems there is a function that is used to convert a key (normally a string) into an integer which will be used to index into the underlying array. The colloquial term for this function is hash, though this is often also used to refer to the result of executing that function. This can get a little confusing when discussing the implementation, so I'll often refer to the function as the 'hash function' and the result as the 'hash value'. The main slides use the more common 'hash' for both. 6

  7. Visually (pseudo-code) Key Value Hashmap m; 1 - m.add (“dog”, 100); m.add (“cat”, 50); - 2 m.add (“lion”, 200); … … 50 1371 hashFunction (“dog”) => 2398; … … hashFuction (“cat”) => 1371; hashFunction (“lion”) => 199121; 2398 100 … … 200 199121 Very big array!! Let's use a hash function that attempts to uniquely identify each word. This is the same one we used in CS: hash = ord*26 n + ord*26 n-1 + ord*26 n-2 + … + ord*26 1 + ord*26 0 *where ord is the ordinal value of character (a = 0, b = 1, …, z = 25) Using this system "cat" would be: cat = 2*26 2 + 0*26 1 + 19*26 0 = 1371 dog = 3*26 2 + 14*26 1 + 6*26 0 = 2398 lion = 11*26 1 + 8*26 2 + 14*26 1 + 13*26 0 = 199121 It's easy to see even small words would require an enormous array to hold all the hashed indexes. This quickly becomes unrealistic. 7

  8. Visually (pseudo-code) We can make this array size more reasonable by using the modulus operator! Hashmap m; 0 - m.add(“dog”, 100); m.add(“cat”, 50); 50 1 m.add(“lion”, 200); - 2 100 3 hashFunction(“dog”) => 2398 % 5 = 3; - 4 hashFunction(“cat”) => 1371 % 5 = 1; hashFunction(“lion”) => 199121 % 5 = 1; Collision @ 1 Rather than having a sparsely populated, ginormous array we can force all of our hashes to fit within the size of any array using modulus. The downside of this is now our keys no longer have unique indexes, as both "ca" and "lion" have an index of 1. When two keys have the same index this is called a collision . The smaller the size of the internal array and the fuller the array is increase the chances of a collision happening. At the extreme, an array with size 1 or a full array will always encounter a collision when adding another element. There are a couple common schemes for dealing with collisions. The first one is known as open addressing . In this scheme when a collision occurs the value is placed in the next available location. The downside is you now have to hunt for the item when trying to retrieve it from the hash table. Clusters, or several keys that have the same index, can quickly decrease the efficiency of the hash table. We won't be implementing open addressing in this course. 8

  9. Visually (pseudo-code) Hashmap m; m.add(“dog”, 100); 0 m.add(“cat”, 50); m.add(“lion”, 200); 1 cat|50 lion|200 2 3 hashFunction(“dog”) => 2398 % 5 = 3; dog|100 hashFunction(“cat”) => 1371 % 5 = 1; 4 hashFunction(“lion”) => 97543 % 5 = 1; We have the hashes point to a LIST of key/value pairs! The other common collision scheme is called chaining . In the case of chaining each index in the array is actually a list (or array). If a collision occurs the new item will be added at the end of the list. This eliminates the cascading clustering effect of open addressing. 9

  10. Visually (pseudo-code) Hashmap m; m.add(“dog”, 100); 0 m.add(“cat”, 50); m.add(“lion”, 200); 1 cat|50 lion|200 2 3 hashFunction(“dog”) => 2398 % 5 = 3; dog|100 hashFunction(“cat”) => 1371 % 5 = 1; 4 hashFunction(“lion”) => 97543 % 5 = 1; Each ‘object’ will hold the original key, value and reference to the next object To find/ look up a key-value pair - Hash the key - Look up the first object using the hash and see if it has the same key - If it does: Done. If not: Keep searching the list Note, it is vital that both the key and value are stored. Since multiple keys can reside at the same index we will have to search the list to find the key we are looking in order to return the correct value. Chaining ensures we never run out of space. However, if the linked lists get long the efficiency of the hast table will begin to degrade. For this reason it is preferred to never let the number of items in a hash table be larger than the size of the underlying array. 10

  11. Dynamic sizing • Hash tables gain speed from using arrays for indexing, but need to solve the problem of arrays being fixed size … i.e. dynamic arrays • Hash tables are generated at size N • When you run out of slots, you dynamically create a new array (usually size 2N) and rehash old elements into the new array • Other topics (beyond the scope of our current work) – Hash collisions (we’ll cover one); hash algorithms … When a the number of elements in a hash table's array nears the size of the array, the table will need to create a new array that is twice as big. Even with chaining this is still important to maintain near peak efficiency when looking up items. It is very important that the items in the original array are not copied over directly. Instead each value of the original array must be hashed to find its correct location in the new array. This is known as rehashing . Example: "cat" and "lion" array_size = 10 have the same cat = 1371 % 10 = 1 index lion = 199121 % 10 = 1 after resizing "cat" and "lion" array_size = 20 have different cat = 1371 % 20 = 11 indexes lion = 199121 % 20 = 1 11

  12. ON TO THE ACTIVITY 12

More recommend