univ ersal hashing b y p eter bro miltersen this le ctur
play

Univ ersal Hashing b y P eter Bro Miltersen This le ctur - PDF document

Univ ersal Hashing b y P eter Bro Miltersen This le ctur e note was written for the c ourse \Pe arls of The ory" at University of A arhus. Most r e c ent r evision, Mar ch 5, 1998. 1. Intr oduction


  1. Univ ersal Hashing b y P eter Bro Miltersen This le ctur e note was written for the c ourse \Pe arls of The ory" at University of A arhus. Most r e c ent r evision, Mar ch 5, 1998. 1. Intr oduction Univ ersal hashing is theory at its b est! Hashing started out as a purely heuristic metho d for implemen ting sym b ol tables. It mo v ed in to the hardcore theory of algorithms with Carter and W egman's analysis of the concept of univ ersalit y . It w en t on to pla y an imp ortan t role in sev eral of the most imp ortan t constructions in abstract complexit y theory and cryptograph y . And no w, these constructions start to creep bac k in to practice. Th us, ha ving matured inside theory , hash- ing gets applied in w a ys the original sym b ol table implemen tors could not ha v e dreamed of ! In this note, w e trac k the exciting career of the hash function. 2. The prehistor y of universal hashing The heuristic concept of hashing, as is no w ada ys kno wn to most (all?) program- mers, w as in tro duced b y Dumey in 1956 [4]. It w as in tro duced as a solution to the sym b ol table problem (no w ada ys called the dictionary problem). In the dictionary problem, w e are giv en a sequence of Inser t ( k , x ), Delete ( k ), and Lookup ( k ) op erations whic h m ust b e p erformed on-line (i.e. one op eration m ust b e completely p erformed, b efore the next is considered) on an initially empt y set S . Inser t ( k , x ) inserts the k ey k with asso ciated information x in to the set, Delete ( k ) deletes the k ey k and its asso ciated information from the set, and Lookup ( k ) returns the information asso ciated with k , if k is indeed in the set. F or simplicit y in the analysis whic h is to come, w e assume that single k eys and single pieces of asso ciated information �t in to single mac hine w ords, but that t w o k eys or t w o pieces of information do not �t in to a mac hine w ord. This is often called, b eliev e it or not, the tr ansdichotomous mo del of computation. Exercise 1 (for language fr e aks) Explain the term tr ansdichotomous. The goal is to p erform the op erations while minimizing the time and space used. The space used is measured in terms of memory registers. In general, w e aim for 1

  2. line ar space, i.e. space comparable to the size of the set b eing stored. Of course, the size of the set v aries as the op erations are p erformed, and this causes some complications in the solutions w e'll lo ok at. F or simplicit y , w e will assume that w e kno w a single upp er b ound N on the size of the set at all times, and w e will allo w ourselv es to use O ( N ) registers, ev en when the set is m uc h smaller (but see Problem 19). Exercise 2 R e c al l some solutions to the dictionary pr oblem. Do they use line ar sp ac e? How fast ar e they? Dumey's solution to the dictionary problem w as the follo wing. Assume the k eys and pieces of information are b oth tak en from the univ erse U . Pic k some \crazy",\c haotic",\random" function h (the hash function) mapping U to f 1 ; : : : ; N g . Initialize an arra y A [1 ::N ]. A t an y giv en time, in A [ i ] w e k eep a link ed list con taining the k eys k curren tly in the set, for whic h h ( k ) = i . F or eac h k ey w e attac h the asso ciated information. This is called chaine d hashing. There are other kinds of hashing whic h w e'll happily ignore. Exercise 3 (for language fr e aks) Why hash -function? Exercise 4 Convinc e yourself that it is fairly simple to pr o gr am this data struc- tur e, not much worse than implementing a single linke d list. In tuitiv ely , it is fairly clear wh y this solution should w ork w ell. If the function h is indeed \crazy", \c haotic" and \random", mapping our set S to f 1 ; : : : ; N g using h should b eha v e as if w e w ere just distributing elemen ts of S at random in N buc k ets. Since the size of S is at most N , w e should exp ect the buc k ets to b e quite small in general. As the crazy function, Dumey suggested h ( x ) = x mo d p for p a prime. Exercise 5 Why a prime?? Hashing is widely used in practice and exp erience sho ws that it do es indeed w ork v ery w ell! But what ab out a rigorous analysis? It is easy to see that the ab o v e in tuition cannot b e formalized so that the argumen t ab o v e will b e true for all sets S . Exercise 6 Why not? Ev en giv en the the answ er to exercise 6, hashing w as in tensely analyzed in the t w o decades follo wing Dumey's in v en tion. The problem exp osed in exercise 6 w as dealt with in t w o di�eren t w a ys. 1. In some pap ers, it is assumed that the set to b e stored is not a w orst case 2

  3. set. Instead, w e assume that it is c hosen according to some probabilit y distribution or has some structural prop ert y w e can explore. 2. In some pap ers, w e do not assume an ything ab out the set S , but w e assume that h really is a random function, i.e. c hosen uniformly at random from the set of all functions mapping U to f 1 ; : : : ; N g . There are pap ers of b oth kinds with deep and b eautiful mathematics. Ho w ev er, b oth kinds do lea v e y ou a bit nerv ous ab out the relev ance or the meaningfulness of the results. The �rst kind is based on assumptions on the input set whic h ma y b e hard or imp ossible to guaran tee in practice, and the second is simply based on a false assumption! No matter ho w long time y ou stare at the function h ( x ) = x mo d p , it will not morph in to a random function. 3. An anal ysis of the second kind In spite of the ab o v e, it turns out that the �rst really satisfactory analysis of hashing is based on an analysis of the second kind, so w e shall pro ceed along those lines. Theorem 7 Assume that h r e al ly is chosen uniformly at r andom fr om the set of al l functions b etwe en U and f 1 ; : : : ; N g . F urthermor e assume that h c an b e evaluate d in c onstant time. Then the exp ected time r e quir e d to p erform an y se quenc e of m op er ations (satisfying the upp er b ound N on the maximum size of the set) by chaine d hashing is O ( m ) . In other w ords, w e can p erform the op erations in c onstant exp ected amortize d time p er op eration! Exercise 8 The c onstant amortize d time b ound in the ab ove the or em may se em so attr active that the r e ader may c onsider actual ly ensuring that the pr emise is true, i.e. actual ly cho osing h uniformly at r andom fr om the set of al l functions b etwe en U and f 1 ; : : : ; N g . This is, as we shal l se e later, in a way a go o d ide a, but explain the big pr oblem. Let's pro v e the theorem. Assume that the sequence of op erations is op ( k ) ; op ( k ) ; : : : ; op ( k ) 1 1 2 2 m m with op 2 f Inser t ; Delete ; Lookup g . W e are only men tioning the k ey- i parameters k , since the information-parameters x are unimp ortan t for the anal- i i ysis. 3

Recommend


More recommend