CSE 332 Data Abstractions: B Trees and Hash Tables Make a Complete - PowerPoint PPT Presentation

CSE 332 Data Abstractions: B Trees and Hash Tables Make a Complete Breakfast Kate Deibel Summer 2012 July 9, 2012 CSE 332 Data Abstractions, Summer 2012 1

The national data structure of the Netherlands HASH TABLES July 9, 2012 CSE 332 Data Abstractions, Summer 2012 2

Hash Tables A hash table is an array of some fixed size Basic idea: hash table 0 hash function: index = h(key) ⁞ key space (e.g., integers, strings) size -1 The goal: Aim for constant-time find, insert, and delete "on average" under reasonable assumptions July 9, 2012 CSE 332 Data Abstractions, Summer 2012 3

An Ideal Hash Functions  Is fast to compute  Rarely hashes two keys to the same index  Known as collisions  Zero collisions often impossible in theory but reasonably achievable in practice 0 ⁞ hash function: index = h(key) key space (e.g., integers, strings) size -1 July 9, 2012 CSE 332 Data Abstractions, Summer 2012 4

What to Hash? We will focus on two most common things to hash: ints and strings If you have objects with several fields, it is usually best to hash most of the "identifying fields" to avoid collisions: class Person { String firstName, middleName, lastName; Date birthDate; … use these four values } An inherent trade-off: hashing-time vs. collision-avoidance July 9, 2012 CSE 332 Data Abstractions, Summer 2012 5

Hashing Integers key space = integers Simple hash function: 0 10 h(key) = key % TableSize 1 41 2  Client: f(x) = x 3  Library: g(x) = f(x) % TableSize 4  Fairly fast and natural 34 5 6 Example: 7  TableSize = 10 7 8 18 9  Insert keys 7, 18, 41, 34, 10 July 9, 2012 CSE 332 Data Abstractions, Summer 2012 6

Hashing non-integer keys If keys are not ints, the client must provide a means to convert the key to an int Programming Trade-off:  Calculation speed  Avoiding distinct keys hashing to same ints July 9, 2012 CSE 332 Data Abstractions, Summer 2012 7

Hashing Strings Key space K = s 0 s 1 s 2 …s k-1 where s i are chars: s i  [0, 256] Some choices: Which ones best avoid collisions? h K = s 0 % TableSize k−1 % TableSize h K = s i i=0 k−1 s i ∙ 37 𝑗 % TableSize h K = i=0 July 9, 2012 CSE 332 Data Abstractions, Summer 2012 8

Combining Hash Functions A few rules of thumb / tricks: 1. Use all 32 bits (be careful with negative numbers) 2. Use different overlapping bits for different parts of the hash  This is why a factor of 37 i works better than 256 i  Example: "abcde" and "ebcda" 3. When smashing two hashes into one hash, use bitwise-xor  bitwise-and produces too many 0 bits  bitwise-or produces too many 1 bits 4. Rely on expertise of others; consult books and other resources for standard hashing functions 5. Advanced: If keys are known ahead of time, a perfect hash can be calculated July 9, 2012 CSE 332 Data Abstractions, Summer 2012 9

Calling a State Farm agent is not an option… COLLISION RESOLUTION July 9, 2012 CSE 332 Data Abstractions, Summer 2012 10

Collision Avoidance With (x%TableSize), number of collisions depends on  the ints inserted  TableSize Larger table-size tends to help, but not always  Example: 70, 24, 56, 43, 10 with TableSize = 10 and TableSize = 60 Technique: Pick table size to be prime. Why?  Real-life data tends to have a pattern,  "Multiples of 61" are probably less likely than "multiples of 60"  Some collision strategies do better with prime size July 9, 2012 CSE 332 Data Abstractions, Summer 2012 11

Collision Resolution Collision: When two keys map to the same location in the hash table We try to avoid it, but the number of keys always exceeds the table size Ergo, hash tables generally must support some form of collision resolution July 9, 2012 CSE 332 Data Abstractions, Summer 2012 12

Flavors of Collision Resolution Separate Chaining Open Addressing  Linear Probing  Quadratic Probing  Double Hashing July 9, 2012 CSE 332 Data Abstractions, Summer 2012 13

Terminology Warning We and the book use the terms  "chaining" or "separate chaining"  "open addressing " Very confusingly, others use the terms  "open hashing" for "chaining"  "closed hashing" for "open addressing" We also do trees upside-down July 9, 2012 CSE 332 Data Abstractions, Summer 2012 14

Separate Chaining All keys that map to the same table location are kept in a linked 0 / list (a.k.a. a "chain" or "bucket") 1 / 2 / 3 / As easy as it sounds 4 / 5 / 6 / 7 / Example: 8 / insert 10, 22, 86, 12, 42 9 / with h(x) = x % 10 July 9, 2012 CSE 332 Data Abstractions, Summer 2012 15

Separate Chaining All keys that map to the same table location are kept in a linked 0 10 / list (a.k.a. a "chain" or "bucket") 1 / 2 / 3 / As easy as it sounds 4 / 5 / 6 / 7 / Example: 8 / insert 10, 22, 86, 12, 42 9 / with h(x) = x % 10 July 9, 2012 CSE 332 Data Abstractions, Summer 2012 16

Separate Chaining All keys that map to the same table location are kept in a linked 0 10 / list (a.k.a. a "chain" or "bucket") 1 / 2 22 / 3 / As easy as it sounds 4 / 5 / 6 / 7 / Example: 8 / insert 10, 22, 86, 12, 42 9 / with h(x) = x % 10 July 9, 2012 CSE 332 Data Abstractions, Summer 2012 17

Separate Chaining All keys that map to the same table location are kept in a linked 0 10 / list (a.k.a. a "chain" or "bucket") 1 / 2 22 / 3 / As easy as it sounds 4 / 5 / 86 / 6 7 / Example: 8 / insert 10, 22, 86, 12, 42 9 / with h(x) = x % 10 July 9, 2012 CSE 332 Data Abstractions, Summer 2012 18

Separate Chaining All keys that map to the same table location are kept in a linked 0 10 / list (a.k.a. a "chain" or "bucket") 1 / 2 12 22 / 3 / As easy as it sounds 4 / 5 / 86 / 6 7 / Example: 8 / insert 10, 22, 86, 12, 42 9 / with h(x) = x % 10 July 9, 2012 CSE 332 Data Abstractions, Summer 2012 19

Separate Chaining All keys that map to the same table location are kept in a linked 0 10 / list (a.k.a. a "chain" or "bucket") 1 / 2 42 12 22 / 3 / As easy as it sounds 4 / 5 / 86 / 6 7 / Example: 8 / insert 10, 22, 86, 12, 42 9 / with h(x) = x % 10 July 9, 2012 CSE 332 Data Abstractions, Summer 2012 20

Thoughts on Separate Chaining Worst-case time for find? Linear  But only with really bad luck or bad hash function  Not worth avoiding (e.g., with balanced trees at each bucket)   Keep small number of items in each bucket  Overhead of tree balancing not worthwhile for small n Beyond asymptotic complexity, some "data-structure engineering" can improve constant factors Linked list, array, or a hybrid  Insert at end or beginning of list  Sorting the lists gains and loses performance  Splay-like: Always move item to front of list  July 9, 2012 CSE 332 Data Abstractions, Summer 2012 21

Rigorous Separate Chaining Analysis The load factor,  , of a hash table is calculated as 𝑜 𝑈𝑏𝑐𝑚𝑓𝑇𝑗𝑨𝑓 𝜇 = where n is the number of items currently in the table July 9, 2012 CSE 332 Data Abstractions, Summer 2012 22

Load Factor? 0 10 / 1 / 2 42 12 22 / 3 / 4 / 5 / 86 / 6 7 / 8 / 9 / 𝑜 = 5 10 = 0.5 𝑈𝑏𝑐𝑚𝑓𝑇𝑗𝑨𝑓 = ? 𝜇 = July 9, 2012 CSE 332 Data Abstractions, Summer 2012 23

Load Factor? 0 10 / 1 71 2 31 / 2 42 12 22 / 3 63 73 / 4 / 75 5 65 95 / 5 86 / 6 27 47 7 88 18 38 98 / 8 99 / = 21 9 𝑜 10 = 2.1 𝑈𝑏𝑐𝑚𝑓𝑇𝑗𝑨𝑓 = ? 𝜇 = July 9, 2012 CSE 332 Data Abstractions, Summer 2012 24

Rigorous Separate Chaining Analysis The load factor,  , of a hash table is calculated as 𝑜 𝑈𝑏𝑐𝑚𝑓𝑇𝑗𝑨𝑓 𝜇 = where n is the number of items currently in the table Under chaining, the average number of elements per bucket is ___ So if some inserts are followed by random finds, then on average:  Each unsuccessful find compares against ___ items  Each successful find compares against ___ items How big should TableSize be?? July 9, 2012 CSE 332 Data Abstractions, Summer 2012 25

Rigorous Separate Chaining Analysis The load factor,  , of a hash table is calculated as 𝑜 𝑈𝑏𝑐𝑚𝑓𝑇𝑗𝑨𝑓 𝜇 = where n is the number of items currently in the table Under chaining, the average number of elements per bucket is  So if some inserts are followed by random finds, then on average:  Each unsuccessful find compares against  items  Each successful find compares against  items  If  is low, find and insert likely to be O(1)  We like to keep  around 1 for separate chaining July 9, 2012 CSE 332 Data Abstractions, Summer 2012 26

Separate Chaining Deletion Not too bad and quite easy  Find in table 0 10 /  Delete from bucket 1 / 2 42 12 22 / Similar run-time as insert 3 /  Sensitive to underlying 4 / bucket structure 5 / 86 / 6 7 / 8 / 9 / July 9, 2012 CSE 332 Data Abstractions, Summer 2012 27

CSE 332 Data Abstractions: B Trees and Hash Tables Make a Complete - PowerPoint PPT Presentation

CSE 332 Data Abstractions: B Trees and Hash Tables Make a Complete Breakfast Kate Deibel Summer 2012 July 9, 2012 CSE 332 Data Abstractions, Summer 2012 1 The national data structure of the Netherlands HASH TABLES July 9, 2012 CSE 332

2012-08-07 CSE 332 Data Abstractions: Data Races and Memory, Reordering, Deadlock,

Summer 2012 August 6, 2012 CSE 332 Data Abstractions, Summer 2012 1 ominous music THE FINAL

CSE 332 Data Abstractions: Dictionary ADT: Arrays, Lists and Trees Kate Deibel Summer 2012

2012-07-10 CSE 332 Data Abstractions: B Trees and Hash Tables Make a Complete Breakfast The

Introduction to Concurrency Kate Deibel Summer 2012 August 6, 2012 CSE 332 Data Abstractions,

CSE 332 Data Abstractions: Introduction to Parallelism and Concurrency Kate Deibel Summer 2012

Kate Deibel Summer 2012 July 16, 2012 CSE 332 Data Abstractions, Summer 2012 1 Where We Are

CSE 332 Data Abstractions: Introduction to Parallelism and Concurrency Kate Deibel Summer 2012

2012-08-05 CSE 332 Data Abstractions: Parallel Sorting & Introduction to Concurrency Like

CSE 332: Data Structures Winter 2014 Richard Anderson, Steve Seitz Lecture 1 CSE 332 Team

Abstractions for Routing Abstractions for Network Routing Brighten Godfrey Brighten Godfrey

Planning and Optimization D2. Abstractions: Additive Abstractions Gabriele R oger and Thomas

Automatically Deriving Abstraction Heuristics PDB Abstractions Explicit-State Abstractions

Unified L2 Abstractions for L3-Driven Fast Handover draft-irtf-mobopts-l2-abstractions-01 F.

ABSTRACTIONS OF THE DATA PLANE DIMACS Working Group on Abstractions for Network Services,

2012-06-25 Announcements David's Super Awesome Office Hours Mondays 2:30-3:30 CSE 220

Welcome to CS50 section! This is Week 5. Please open your CS50 IDE and run this in your console:

Verifying a hash table and its iterators in higher-order separation logic Franois Pottier

GCL SymbolTable A Chain of Hash Tables based on java.util.Hashtable Joseph Bergin 1/12/99 1

Universal hashing Problem: if h is fixed there are with many collisions Idea of

Advanced Algorithms COMS31900 Hashing part two Static Perfect Hashing Rapha el Clifford

Evaluation of Relational Operations [R&G] Chapter 14, Part A (Joins) CS4320 1 Relational

Hash Functions and MACs Properties of Cryptographic Hash Functions Introduction to Message

Hashing Chapter 5 1 Objectives Understand the idea of hashing Compare hashing to sorting

CSE 332 Data Abstractions: B Trees and Hash Tables Make a Complete - PowerPoint PPT Presentation

CSE 332 Data Abstractions: B Trees and Hash Tables Make a Complete Breakfast Kate Deibel Summer 2012 July 9, 2012 CSE 332 Data Abstractions, Summer 2012 1 The national data structure of the Netherlands HASH TABLES July 9, 2012 CSE 332

2012-08-07 CSE 332 Data Abstractions: Data Races and Memory, Reordering, Deadlock,

Summer 2012 August 6, 2012 CSE 332 Data Abstractions, Summer 2012 1 *ominous music* THE FINAL

CSE 332 Data Abstractions: Dictionary ADT: Arrays, Lists and Trees Kate Deibel Summer 2012

2012-07-10 CSE 332 Data Abstractions: B Trees and Hash Tables Make a Complete Breakfast The

Introduction to Concurrency Kate Deibel Summer 2012 August 6, 2012 CSE 332 Data Abstractions,

CSE 332 Data Abstractions: Introduction to Parallelism and Concurrency Kate Deibel Summer 2012

Kate Deibel Summer 2012 July 16, 2012 CSE 332 Data Abstractions, Summer 2012 1 Where We Are

CSE 332 Data Abstractions: Introduction to Parallelism and Concurrency Kate Deibel Summer 2012

2012-08-05 CSE 332 Data Abstractions: Parallel Sorting &amp; Introduction to Concurrency Like

CSE 332: Data Structures Winter 2014 Richard Anderson, Steve Seitz Lecture 1 CSE 332 Team

Abstractions for Routing Abstractions for Network Routing Brighten Godfrey Brighten Godfrey

Planning and Optimization D2. Abstractions: Additive Abstractions Gabriele R oger and Thomas

Automatically Deriving Abstraction Heuristics PDB Abstractions Explicit-State Abstractions

Unified L2 Abstractions for L3-Driven Fast Handover draft-irtf-mobopts-l2-abstractions-01 F.

ABSTRACTIONS OF THE DATA PLANE DIMACS Working Group on Abstractions for Network Services,

2012-06-25 Announcements David's Super Awesome Office Hours Mondays 2:30-3:30 CSE 220

Welcome to CS50 section! This is Week 5. Please open your CS50 IDE and run this in your console:

Verifying a hash table and its iterators in higher-order separation logic Franois Pottier

GCL SymbolTable A Chain of Hash Tables based on java.util.Hashtable Joseph Bergin 1/12/99 1

Universal hashing Problem: if h is fixed there are with many collisions Idea of

Advanced Algorithms COMS31900 Hashing part two Static Perfect Hashing Rapha el Clifford

Evaluation of Relational Operations [R&amp;G] Chapter 14, Part A (Joins) CS4320 1 Relational

Hash Functions and MACs Properties of Cryptographic Hash Functions Introduction to Message

Hashing Chapter 5 1 Objectives Understand the idea of hashing Compare hashing to sorting

Summer 2012 August 6, 2012 CSE 332 Data Abstractions, Summer 2012 1 ominous music THE FINAL

2012-08-05 CSE 332 Data Abstractions: Parallel Sorting & Introduction to Concurrency Like

Evaluation of Relational Operations [R&G] Chapter 14, Part A (Joins) CS4320 1 Relational