CSCI 104 Hash Tables & Functions Mark Redekopp David Kempe - PowerPoint PPT Presentation

1 CSCI 104 Hash Tables & Functions Mark Redekopp David Kempe Sandra Batista

2 Dictionaries/Maps 2 • An array maps integers to values 0 1 2 3 4 5 – Given i, array[i] returns the value in O(1) 3.2 2.7 3.452.91 3.8 4.0 • Dictionaries map keys to values 3.45 – Given key, k, map[k] returns the associated Arrays associate an integer with value some arbitrary type as the value – Key can be anything provided… (i.e. the key is always an integer) • It has a '<' operator defined for it (C++ map) "Jill" or some other comparator functor • Most languages implementation of a dictionary implementation require something map<string, double> similar to operator< for key types "Tommy" 2.5 Pair<string,double> "Jill" 3.45 3.45 C++ maps allow any type to be the key

3 Dictionary Implementation • A dictionary/map can be implemented with a balanced BST – Insert, Find, Remove = O(______________) Map::find("Mark") Map::find("Greg") key value "Jordan" Student object "Frank" Student "Percy" Student object object "Anne" Student "Greg" Student "Tommy" Student object object object

4 Dictionary Implementation • A dictionary/map can be implemented with a balanced BST – Insert, Find, Remove = O(log 2 n) • Can we do better? – Hash tables (unordered maps) offer the promise of O(1) access time Map::find("Mark") Map::find("Greg") key value "Jordan" Student object "Frank" Student "Percy" Student object object "Anne" Student "Greg" Student "Tommy" Student object object object

5 Unordered_Maps / Hash Tables • Can we use non-integer keys but "Jill" still use an array? • What if we just convert the non- Conversion function integer key to an integer. – For now, make the unrealistic 2 assumption that each unique key 0 1 2 3 4 5 converts to a unique integer Bo Tom Jill Joe Tim Lee • This is the idea behind a hash table 3.2 2.7 3.45 2.91 3.8 4.0 • The conversion function is known 3.45 as a hash function, h(k) – It should be fast/easy to compute (i.e. O(1) )

6 Unordered_Maps / Hash Tables • A hash table implements a map ADT "Jill" – Add(key,value) Conversion – Remove(key) function – Lookup/Find(key) : returns value • In a BST the keys are kept in order 2 – A Binary Search Tree implements an 0 1 2 3 4 5 ORDERED MAP Bo Tom Jill Joe Tim Lee 3.2 2.7 3.45 2.91 3.8 4.0 • In a hash table keys are evenly 3.45 distributed throughout the table (unordered) – A hash table implements an UNORDERED MAP

24 7 C++11 Implementation • C++11 added new container classes: – unordered_map – unordered_set • Each uses a hash table for average complexity to insert , erase, and find in O(1) • Must compile with the -std=c++11 option in g++

8 Hash Tables • A hash table is an array that stores key,value key pairs key, value – Usually smaller than the size of possible set 0 of keys, |S| 1 • USC ID's = 10 10 options h(k) 2 – But larger than the expected number of keys 3 to be entered (defined as n ) 4 • The table is coupled with a function, h(k) , … that maps keys to an integer in the range tableSize-2 [0..tableSize-1] (i.e. [0 to m -1]) tableSize-1 • What are the considerations… – How big should the table be? m = tableSize – How to select a hash function? n = # of keys entered – What if two keys map to the same array location? (i.e. h(k1) == h(k2) ) • Known as a collision

9 Table Size key • How big should our table be? key, value 0 • Example 1 : We have 1000 employees 1 with 3 digit IDs and want to store h(k) 2 record for each 3 • Solution 1 : Keep array a[1000]. Let 4 … key be ID and location, so a[ID] holds tableSize-2 employee record. tableSize-1 • Example 2 : Using 10 digit USC ID, store student records m = tableSize – USC ID's = 10 10 options n = # of keys entered • Pick a hash table of some size much smaller (how many students do we have at any particular time)

10 General Table Size Guidelines key • The table size should be bigger key, value 0 than the amount of expected 1 entries ( m > n ) h(k) 2 – Don't pick a table size that is 3 smaller than your expected 4 … number of entries tableSize-2 • But anything smaller than the size tableSize-1 of all possible keys admits the chance that two keys map to the m = tableSize same location in the table (a.k.a. n = # of keys entered COLLISION ) • You will see that tableSize should usually be a prime number

11 Hash Functions First Look • Challenge: Distribute keys to locations in hash table such that • Easy to compute and retrieve values given key • Keys evenly spread throughout the table • Distribution is consistent for retrieval • If necessary key data type is converted to integer before hash is applied – Akin to the operator<() needed to use a data type as a key for the C++ map • Example: Strings – Use ASCII codes for each character and add them or group them – "hello" => 'h' = 104, 'e'=101, 'l' = 108, 'l' = 108, 'o' = 111 = 532 – Hash function is then applied to the integer value 532 such that it maps to a value between 0 to M-1 where M is the table size

12 Possible Hash Functions • Define n = # of entries stored, m = Table Size, k is non-negative integer key • h(k) = 0 ? • h(k) = k mod m ? • h(k) = rand() mod m ? • Rules of thumb – The hash function should examine the entire search key, not just a few digits or a portion of the key – When modulo hashing is used, the base should be prime

13 Hash Function Goals • A "perfect hash function" should map each of the n keys to a unique location in the table – Recall that we will size our table to be larger than the expected number of keys…i.e. n < m – Perfect hash functions are not practically attainable • A "good" hash function or Universal Hash Function – Is easy and fast to compute – Scatters data uniformly throughout the hash table • P( h(k) = x ) = 1/ m (i.e. pseudorandom )

14 Universal Hash Example • Suppose we want a universal hash for words in English language • First, we select a prime table size, m • For any word, w made of the sequence of letters w 1 w 2 … w n we translate each letter into its position in the alphabet (0-25). • Consider the length of the longest word in the English alphabet has length z • Choose a random key word, K, of length z, K = k 1 k 2 … k z • The random key a is created once when the hash table is created and kept 𝑚𝑓𝑜(𝑥) 𝑙 𝑗 ∙ 𝑥 𝑗 𝑛𝑝𝑒 𝒏 • Hash function: ℎ 𝑥 = σ 𝑗=1

15 Pigeon Hole Principle • Recall for hash tables we let… – n = # of entries (i.e. keys) – m = size of the hash table • If n > m , is every entry in the table used? – No. Some may be blank? • Is it possible we haven't had a collision? – No. Some entries have hashed to the same location – Pigeon Hole Principle says given n items to be slotted into m holes and n > m there is at least one hole with more than 1 item – So if n > m , we know we've had a collision • We can only avoid a collision when n < m

16 Resolving Collisions • Collisions occur when two keys, k1 and k2, are not equal, but h(k1) = h(k2). • Collisions are inevitable if the number of entries, n , is greater than table size, m ( by pigeonhole principle ) • Methods – Closed Addressing (e.g. buckets or chaining ) – Open addressing (aka probing) • Linear Probing • Quadratic Probing • Double-hashing

17 Buckets/Chaining k,v • … Simply allow collisions to all occupy Bucket 0 … the location they hash to by making 1 … each entry in the table an ARRAY 2 … (bucket) or LINKED LIST (chain) of 3 … items/entries 4 … – Close Addressing => You will live in … tableSize-1 the location you hash to (it's just that there may be many places at that location) Array of Linked • Buckets key, value Lists – How big should you make each array? 0 – Too much wasted space 1 • 2 Chaining 3 – Each entry is a linked list 4 … tableSize-1

18 Open Addressing • Open addressing means an item with key key, k, may not be located at h(k) key, value 0 k, v • If location 2 is occupied and a new 1 item hashes to location 2, we need to h(k) 2 k, v find another location to store it. 3 k, v 4 • Let i be number of failed inserts … • Linear Probing tableSize-2 – h(k,i) = (h(k)+i) mod m tableSize-1 k,v – Example: Check h(k)+1, h(k)+2, h(k)+3, … • Quadratic Probing – h(k,i) = (h(k)+i^2) mod m – Check location h(k)+1 2 , h(k)+2 2 , h(k)+3 2 , …

19 Linear Probing Issues key, value • If certain data patterns lead 0 occupied 1 to many collisions, linear 2 occupied probing leads to clusters of Linear 3 occupied Probing occupied areas in the table 4 … called primary clustering tableSize-2 • How would quadratic tableSize-1 occupied probing help fight primary key, value clustering? 0 occupied 1 – Quadratic probing tends to 2 occupied Quadratic spread out data across the 3 occupied Probing table by taking larger and 4 5 larger steps until it finds an 6 empty location 7 occupied

CSCI 104 Hash Tables & Functions Mark Redekopp David Kempe - PowerPoint PPT Presentation

1 CSCI 104 Hash Tables & Functions Mark Redekopp David Kempe Sandra Batista 2 Dictionaries/Maps 2 An array maps integers to values 0 1 2 3 4 5 Given i, array[i] returns the value in O(1) 3.2 2.7 3.452.91 3.8 4.0

104 Clinical Cases In Medicine Presentation And 104 Clinical Cases In Medicine Presentation And

The Sun, Earth and Moon Observable Patterns Return to Table of Contents Slide 5 / 104 Slide

WELCOME TO COM 104 INTRODUCTION TO MULTIMEDIA Instructor: Tom McHugh Introduction to Multimedia

specification Alexey Sorokin Head of Test Equipment Development Department Stackable PC -

Photon Interactions 22.104 Spring 2002 MIT Department of Nuclear Engineering

Designing a Single Cycle Datapath Computer Science 104 Alvin R. Lebeck cps 104 1 Administrivia

Math 104 Calculus 10.1 Sequences Math 104 - Yu

Math 104 Calculus 6.4 Surface Area Math 104 -

Math 104 Calculus 10.2 Infinite Series Math 104 -

Math 104 Calculus 8.5 Par6al Frac6ons Math 104 -

Math 104 Calculus 7.4 Rela5ve Rates of Growth Math

Math 104 Calculus 6.3 Arc Length Math 104

Math 104 Calculus 8.4 Trigonometric Subs=tu=ons Math 104

Math 104 Calculus 8.3 Trigonometric Integrals Math 104

CSCI 2133 Rapid Programming Techniques for Innovation UI Design CSS Grid and Flexbox

CSCI 5582 Artificial Intelligence Lecture 23 Jim Martin CSCI 5582 Fall 2006 Today 11/30

Security II: Cryptography Markus Kuhn Computer Laboratory, University of Cambridge

Abusing Performance Optimization Weaknesses to Bypass ASLR Byoungyoung Lee Yeongjin Jang Tielei

Modified Noise for Evaluation on Graphics Hardware Marc Olano Computer Science and Electrical

Fast Object Distribution Andrew Willmott Maxis, Electronic Arts Distributing Objects Goal:

Light-Weight, Delay-Aware and Scalable Authentication for Smart-Grid System Dr. Attila A. Yavuz,

OLAP and Data Mining Chapter 17 OLTP Compared With OLAP On Line Transaction Processing

CS 473: Algorithms Chandra Chekuri Ruta Mehta University of Illinois, Urbana-Champaign Fall

Statistics, Measures of Central Tendency I We are considering a random variable X with a