smartcuckoo a fast and cost efficient hashing index
play

SmartCuckoo: A Fast and Cost-Efficient Hashing Index Scheme for - PowerPoint PPT Presentation

SmartCuckoo: A Fast and Cost-Efficient Hashing Index Scheme for Cloud Storage Systems Yuanyuan Sun, Yu Hua, Song Jiang*, Qiuyu Li, Shunde Cao, Pengfei Zuo Huazhong University of Science and Technology *University of Texas, Arlington Presented


  1. SmartCuckoo: A Fast and Cost-Efficient Hashing Index Scheme for Cloud Storage Systems Yuanyuan Sun, Yu Hua, Song Jiang*, Qiuyu Li, Shunde Cao, Pengfei Zuo Huazhong University of Science and Technology *University of Texas, Arlington Presented in the USENIX ATC 2017 1

  2. Indexing services in cloud storage n Large amounts of data From small hand-held devices to large-scale data centers Ø 44ZB in total, 5.2TB for each user in 2020 (IDC' 2014) Ø n Fast query services are important to both users and systems Returning accurate results in a real-time manner Ø Improving system performance and storage efficiency Ø 2

  3. The importance of hash tables n Hash tables are widely used in data stores and caches Key-value stores, e.g., Memcached, Redis Ø Relational databases, e.g., MonetDB, HyPer Ø In-cache index (ICS 2014, MICRO 2015) Ø n Strengths: Constant-scale addressing complexity ~O(1) Ø Fast query response Ø n Weakness: Risk of high-latency for handling hashing collisions Ø n Cuckoo hashing 3

  4. Cuckoo hashing n Kick-out operations: like cuckoo birds n Open addressing n Supporting fast lookups: O(1) time complexity n However, insertion latency can be very high and unpredictable, especially Ø when an endless loop occurs! 4

  5. How is an endless loop formed? 0 H 1 ( ) 1 2 a 3 4 5 6 7 5

  6. How is an endless loop formed? c a H 1 ( ) 0 1 2 3 4 5 6 7 6

  7. How is an endless loop formed? a 0 H 2 ( ) 1 c 2 b 3 H 1 ( ) 4 5 6 7 7

  8. How is an endless loop formed? T 1 T 2 a 0 b 1 c 2 3 4 5 6 7 8

  9. How is an endless loop formed? T 1 T 2 a 0 b 1 H 1 ( ) c 2 x d 3 H 2 ( ) 4 e 5 6 7 9

  10. How is an endless loop formed? T 1 T 2 Kickout for empty buckets a 0 b 1 c 2 x d 3 4 e 5 6 My alternative location 7 10

  11. How is an endless loop formed? T 1 T 2 Kickout for empty buckets a 0 b 1 2 x d 3 4 e 5 c 6 My alternative location 7 11

  12. How is an endless loop formed? T 1 T 2 Kickout for empty buckets a 0 b 1 x 2 d 3 4 e 5 c 6 My alternative location 7 12

  13. How is an endless loop formed? T 1 T 2 a b a 0 b 1 x 2 x d d 3 4 n An endless loop is formed. e 5 c n Endless kickouts for any 6 insertion within the loop. My alternative location 7 13

  14. Observations n Endless loops widely exist in the Cuckoo hashing structures. More than 25% (cuckoo hashing with a stash) Ø n Loop ratio: the percentage of insertion failures due to loops 50 45 RandomInteger 40 MacOS 35 Loop Ratios (%) DocWords 30 25 20 15 10 5 0 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 Load Factor 14

  15. Existing works n ChunkStash @USENIX ATC’10 Collisions: resursive strategy to relocate one of keys in candidates Ø Loops: an auxiliary linked list (or, hash table) Ø n MemC3 @NSDI’13 Collisions: random and repeat relocation (500 times) Ø Loops: an expansion process Ø Stand-alone implementation: libcuckoo @ EuroSys’14 Ø n Horton tables @USENIX ATC’16 Recursively evicting keys within a certain search tree height Ø 15

  16. Motivations n Due to endless loops: Substantial resources consumption Ø u A large number of step-by-step kick-out operations Unbounded performance Ø u Fruitless effort n Design Goal: Predetermining and avoiding occurrence of endless loops Ø 16

  17. Our approach: SmartCuckoo n Tracking item placements in the hash table Representing the hashing relationship as a directed pseudoforest Ø Classifying item insertions into three cases Ø Predetermining and avoiding loops during insertion without any Ø kick-out attempts. 17

  18. How to identify loop(s)? n Pseudoforest: A graph: each vertex has an outdegree of at most one Ø Each connected component (subgraph) has at most one cycle (loop) Ø In a subgraph: Ø Loop #Vertices = #Edges No loop #Vertices = #Edges + 1 j j d d n n c c m m k k b b e e i l a a Vacancy f g f g h h Maximal Non-maximal 18

  19. Classification and predetermination n Three cases depending on the number of vertices added to the graph v+0, v+1, and v+2 n v+0: 5 possible scenarios based on the status of corresponding subgraph(s) n Three cases v+0 v+1 v+2 Two insert Same subgraph Different subgraphs A new Two new positions of a key one ones Subgraph status Non- Maximal Both non- A maximal Both maximal maximal maximal and a non- - - maximal Scenarios (a) (e) (b) (c) (d) - - 19

  20. v+0: (a) One non-maximal subgraph n One empty bucket n Success! T 1 T 2 T 1 T 2 a H 1 ( ) a 0 0 b b b x 1 1 1 H 1 (x 1 ) a b d c c 2 H 2 ( ) 2 a d x 1 3 3 c H 2 (x 1 ) 4 4 c x 1 d d 5 5 6 6 7 7 Pseudoforest 20

  21. v+0: (b) Two non-maximal subgraphs n Two empty buckets n Success! b T 1 T 2 T 1 T 2 b a a d a 0 0 a d b b 1 1 c c c c 2 H 2 (x 2 ) 2 3 3 H 2 ( ) g x 2 H 1 (x 2 ) 4 4 x 2 g x 2 d d 5 H 1 ( ) 5 g 6 g 6 f f f f 7 7 Pseudoforest 21

  22. v+0: (c) One maximal and one non-maximal n One loop and one empty bucket n Conventional cuckoo hashing: taking a random walk Ø T 1 : executing extra useless kick-out operations Ø T 2 : making a success Ø SmartCuckoo: directly selecting to enter from T 2 n Success! T 1 T 2 b T 1 T 2 b a a 0 a 0 a d d b b 1 1 c c c c 2 2 H 1 (x 3 ) e e H 1 ( ) e e 3 3 x 3 4 4 g H 2 (x 3 ) x 3 g H 2 ( ) g d d 5 5 Pseudoforest x 3 g 6 6 f f f f 7 7 22

  23. v+0: (d) Two maximal subgraphs n Two loops! n Execution: Ø Conventional cuckoo hashing: sufficient attempts, then reporting a failure Ø SmartCuckoo: reporting a failure without any kick-out operations. b T 1 T 2 a d a 0 b 1 c H 2 (x 4 ) e c 2 e Failure! 3 H 2 ( ) g H 1 (x 4 ) h 4 x 4 d h 5 H 1 ( ) g 6 i f f 7 i Pseudoforest 23

  24. v+0: (e) One maximal subgraph n One loop! T 1 T 2 a 0 H 1 ( ) b H 2 (x 5 ) 1 b x 5 c H 1 (x 5 ) a 2 d Failure! e 3 H 2 ( ) c 4 e d 5 6 7 Pseudoforest 24

  25. Case: v+1 n A new vertex after the item's insertion n Success! T 1 T 2 T 1 T 2 a a 0 0 b b b b 1 a 1 a d d c c 2 2 3 c c 3 H 2 (x 6 ) H 2 ( ) x 6 4 4 x 6 d d 5 5 H 1 ( ) x 6 H 1 (x 6 ) 6 6 7 7 Pseudoforest 25

  26. Case: v+2 n Two new vertices after the insertion n Success! T 1 T 2 T 1 T 2 a a 0 b 0 b b b a 1 1 d a d c c 2 2 c 3 3 c H 2 ( ) x 7 4 4 x 7 d d 5 5 H 1 ( ) H 1 (x 7 ) x 7 H 2 (x 7 ) 6 6 7 7 Pseudoforest 26

  27. Evaluation methodology n Comparisons: Ø Baseline (Cuckoo hashing with a stash @ SIAM Journal on Computing '09) Ø libcuckoo @ EuroSys'14 Ø BCHT (bucketized cuckoo hash table) n Traces: Ø RandomInteger: random integer generator @ TOMACS'98 Ø MacOS: http://tracer.filesystems.org Ø DocWords: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words Ø YCSB: https://github.com/brianfrankcooper/YCSB @ SOCC'11 n Metrics: in millions of operations per second Insertion throughput Ø Lookup throughput: positive/negative Ø Throughput of workload with mixed queries (YCSB) Ø 27

  28. Insertion throughput 3.5 0.5 × Baseline Millions of Insertions Per 3 libcuckoo BCHT 2.5 SmartCuckoo 5 × Second 2 1.5 1 0.5 0 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Load Factor n SmartCuckoo significantly increases insertion throughputs. n 0.5 × to 5 × speedups compared to Baseline. 28

  29. Lookup throughput Baseline libcuckoo BCHT SmartCuckoo 2.5 Millions of Lookups Per 2 Second 1.5 1 0.5 0 100% 0% Percentage of Existent Keys in the Lookup Requests n 0%: all candidate positions for a key have to be accessed. n Almost the same lookup throughput with Baseline. n Significantly higher than libcuckoo and BCHT. 29

  30. Throughput of workload with mixed queries 2.4 Millions of Operations Per Baseline 2 Workload Insert Lookup Update libcuckoo 1.6 Second BCHT YCSB-1 100 0 0 SmartCuckoo 1.2 YCSB-2 75 25 0 0.8 YCSB-3 50 50 0 0.4 YCSB-4 25 75 0 YCSB-5 0 95 5 0 YCSB-1 YCSB-2 YCSB-3 YCSB-4 YCSB-5 Workloads n With the decrease of the percentage of insertions, all schemes increase the throughputs. n In each workload, SmartCuckoo produces higher throughput than other three schemes. 30

  31. Conclusion and future work n Cuckoo hashing is cost-efficient to offer O(1) query performance. n We address the problem of potential endless loops in item insertion. n SmartCuckoo helps improve predictable performance in storage systems. n To-do-list: SmartCuckoo in hash tables with more than two hash functions; n The use of multiple slots in each bucket. n 31

  32. Thanks and questions? Open-source code: https://github.com/syy804123097/SmartCuckoo 32

Recommend


More recommend