Simple and Space-Efficient Minimal Perfect Hash Functions Fabiano C. Botelho Department of Computer Science Federal University of Minas Gerais, Brazil Rasmus Pagh Computational Logic and Algorithms Group IT Univ of Copenhagen, DenMark Nivio Ziviani Department of Computer Science Federal University of Minas Gerais, Brazil LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 1
What Is The Problem to Solve? Design, analyze and implement MPHFs that: Use space close to the optimal Faster to generate than the ones available in the literature Fast to compute Small memory to generate the functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 2
Perfect Hash Function Key set S of size n ... 0 1 n -1 Hash Table Perfect Hash Function ... 0 1 m -1 S U , where |U| u ⊆ = LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 3
Minimal Perfect Hash Function Key set S of size n ... 0 1 n -1 Minimal Perfect Hash Function Hash Table ... 0 1 n -1 S U , where |U| u ⊆ = LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 4
Lower Bounds For Storage Space 2 n log Storage Space e PHFs (m ≈ n): ≥ m Storage Space n log e MPHFs (m = n): ≥ log = e 1 . 4427 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 5
Related Work Theoretical Results Practical Results Heuristics LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 6
Theoretical Results Work Gen. Time Eval. Time Size (bits) Mehlhorn (1984) Expon. Expon. O(n) Schmidt and Not analyzed O(1) O(n) Siegel (1990) Hagerup and O(n+log log u) O(1) O(n) Thorup (2001) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 7
Practical Results Work Gen. Time Eval. Time Size (bits) Czech, Havas and O(n) O(1) O(n log n) Majewski (1992) Majewski, O(n) O(1) O(n log n) Wormald, Havas and Czech (1996) Pagh (1999) O(n) O(1) O(n log n) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 8
Heuristics Gen. Eval. Size Work Application Time Time (bits) Fox, Chen and Index data Exp. O(1) O(n) Heath (1992) in CD-ROM Lefebvre and Sparse O(n) O(n) O(1) Hoppe (2006) spatial data Chang, Lin and Not Data mining O(n) O(1) Chou (2005, 2006) analyzed LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 9
Our Family of Algorithms Near-optimal space Evaluation in constant time Function generation in linear time Simple to describe and implement Algorithms in the literature with near-optimal space either: Require exponential time for construction and evaluation, or Use near-optimal space only asymptotically, for large n Acyclic random hypergraphs Used before by Majewski et all (1996): O(n log n) bits We proceed differently: O(n) bits (we changed space complexity, close to theoretical lower bound) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 10
Our Family of Algorithms - Remark Chazelle et al (SODA 2004) presented a way of constructing PHFs that is equivalent to ours It is explained as a modification of the ``Bloomier Filter'' data structure, but they do not make explicit that a PHF is constructed LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 11
Random Hypergraphs (r-graphs) 3-graph: h 0 (jan) = 1 h 1 (jan) = 3 h 2 (jan) = 5 1 0 h 0 (feb) = 1 h 1 (feb) = 2 h 2 (feb) = 5 2 3 h 0 (mar) = 0 h 1 (mar) = 3 h 2 (mar) = 4 4 5 3-graph is induced by three uniform hash functions Our best result uses 3-graphs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 12
Acyclic 2-graph G r : L:Ø h 0 2 0 1 3 m jan a a feb r p r h 1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 13
Acyclic 2-graph G r : 0 L: { 0,5 } h 0 2 0 1 3 jan a feb p r h 1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 14
Acyclic 2-graph G r : 1 0 L: { 0,5 } {2 ,6 } h 0 2 0 1 3 jan a p r h 1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 15
Acyclic 2-graph G r : 1 2 0 L: { 0,5 } {2 ,6 } {2 ,7 } h 0 2 0 1 3 jan h 1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 16
Acyclic 2-graph G r : 1 2 0 3 L: { 0,5 } {2 ,6 } {2 ,7 } {2 ,5 } h 0 2 0 1 3 G r is acyclic h 1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 17
The Family of Algorithms (r = 2) S G r : h 0 0 1 2 3 jan feb m jan a a feb Mapping r p r mar h 1 4 5 6 7 apr LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 18
The Family of Algorithms (r = 2) g S G r : 0 0 r 1 h 0 2 0 1 3 jan 0 2 L 3 feb r m jan a a feb Mapping Assigning r p r 4 r mar 5 r h 1 6 4 5 7 apr 6 1 7 1 0 1 2 3 L: { 0,5 } {2 ,6 } {2 ,7 } {2 ,5 } LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 19
The Family of Algorithms (r = 2) • Values in the range {0,1, ..., r} • r = 2 or r = 3 g • At most 2 bits for S G r : 0 0 each vertex in g r 1 h 0 2 0 1 3 jan 0 2 L 3 feb r m jan a a feb Mapping Assigning r p r 4 r mar 5 r h 1 4 5 6 7 apr 6 1 7 1 1 2 0 3 L: { 0,5 } {2 ,6 } {2 ,7 } {2 ,5 } LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 20
The Family of Algorithms (r = 2) g assigned assigned S G r : 0 0 r 1 h 0 Hash Table 0 1 2 3 jan 0 2 0 mar L feb 3 r m jan 1 jan a a feb Mapping Assigning Ranking r p feb 2 r 4 r mar apr 3 5 r h 1 6 4 5 7 apr 6 1 7 1 assigned assigned phf(feb) = h i=1 (feb) = 6 i = (g(h 0 (feb)) + g(h 1 (feb))) mod r = (g(2) + g(6)) mod 2 = 1 mphf(feb) = rank(phf(feb)) = rank(6) = 2 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 21
Use of Acyclic Random Hypergraphs Sufficient condition for the family of algorithms work (Majewski et al (1996)) Repeatedly selects h 0 ,h 1 ..., h r-1 2 For r = 2, m=cn and c>2, Pr 1 ( 2 / c ) = − a For c = 2.09, Pr a = 0.29 For r = 3 and c ≥ 1.23: probability tends to 1 Number of iterations is 1/Pr a : r = 2: 3.5 iterations r = 3: 1.0 iteration LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 22
Space to Represent the Functions (r = 2) PHFs (ranking information not required): g: [0,m-1] _ {0,1} m = cn bits, c = 2.09 _ 2.09 n bits g 0 0 MPHFs (ranking information required): r 1 g: [0,m-1] _ {0,1,2} 0 2 3 2m + _m = (2+ _)cn bits r r 4 For c = 2.09 and _ = 0.125 _ 4.44 n bits 5 r 6 Packed MPHFs (Range of size 3): 1 7 1 log 3 bits for each entry of g (arithmetic coding) (log 3 + _)cn bits. For c = 2.09 and _ = 0.125 _ 3.6 n bits. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 23
Space to Represent the Functions (r = 3) PHFs (ranking information not required): g: [0,m-1] _ {0,1,2} m = cn bits, c = 1.23 _ 2.46 n bits Packed PHFs (Range of size 3): log 3 bits for each entry of g (arithmetic coding) (log 3) cn bits, c = 1.23 _ 1.95 n bits Optimal: 1.17n bits MPHFs (ranking information required): g: [0,m-1] _ {0,1,2,3} 2m + _m = (2+ _)cn bits For c = 1.23 and _ = 0.125 _ 2.62 n bits Optimal: 1.4427n bits. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 24
Experimental Results Metrics: Generation time Storage space Evaluation time Collection: 64 bytes long on average (URLs collected from the web) Experiments Commodity PC with a cache of 2 Mbytes LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 25
Related Algorithms Botelho, Kohayakawa, Ziviani (2005) - BKZ Fox, Chen and Heath (1992) – FCH Czech, Havas and Majewski (1992) – CHM Majewski, Wormald, Havas and Czech (1996) – MWHC Pagh (1999) - PAGH LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 26
Generation Time and Storage Space Generation Storage Space Algorithms Time (sec) Bits/Key Size (MB) r = 2 19.49 ± 3.750 3.60 1.52 Ours r = 3 9.80 ± 0.007 2.62 1.11 BKZ 16.85 ± 1.85 21.76 9.19 FCH 5901.9 ± 1489.6 3.66 1.55 MWHC 10. 63 ± 0.09 26.76 11.30 PAGH 52.55 ± 2.66 44.16 18.65 n=3,541,615 keys LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 27
Evaluation Time Evaluation Algorithms Time (sec) r = 2 2.63 Ours r = 3 2.73 BKZ 2.81 FCH 2.14 MWHC 2.85 PAGH 2.78 n=3,541,615 keys LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 28
Comparison of the Resulting PHFs and MPHFs r Packed m Generation Evaluation Storage Space Time (sec) Time (sec) Bits/Key Size (MB) 2 No 2.09n 19.41 ± 3.736 1.83 2.09 0.88 2 Yes n 19.49 ± 3.750 2.63 3.60 1.52 3 No 1.23n 9.73 ± 0.009 2.16 2.46 1.04 3 Yes 1.23n 9.95 ± 0.009 2.14 1.95 0.82 3 No n 9.80 ± 0.007 2.73 2.62 1.11 n=3,541,615 keys LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 29
Recommend
More recommend