executive summary
play

Executive summary 2 Hash tables suffer from poor core utilization - PowerPoint PPT Presentation

L EVERAGING C ACHES TO A CCELERATE H ASH T ABLES AND M EMOIZATION G UOWEI Z HANG AND D ANIEL S ANCHEZ MICRO 2019 Executive summary 2 Hash tables suffer from poor core utilization & poor spatial poor core utilization poor spatial


  1. L EVERAGING C ACHES TO A CCELERATE H ASH T ABLES AND M EMOIZATION G UOWEI Z HANG AND D ANIEL S ANCHEZ MICRO 2019

  2. Executive summary 2 ¨ Hash tables suffer from poor core utilization & poor spatial poor core utilization poor spatial locality locality ¨ HTA accelerates hash tables with simple ISA & HW changes ¤ Adopts HTA table format that leverages cache characteristics ¤ Leaves rare cases to software LLC Flat-HTA Hierarchical-HTA L2 L2 Reduces runtime overheads Improves spatial locality … L1I L1D L1I L1D Core Core ¨ HTA accelerates hash-table-intensive applications by up to 2x ¨ HTA-based memoization improves performance significantly

  3. Hash table performance is critical 3 Hash table found, value = hashtable. lookup (key); key value hashtable. insert (key, value); … hashtable. delete (key); Database Key-value store Networking Genomics Memoization ¨ Hash table performance is critical for memoization ¤ Uses hash tables to skip repetitive computation ¤ Beneficial only if hash table lookups are cheaper than memoized code

  4. Issue 1: Poor core utilization 4 Backend stalls Wrong path execution Other 1.2 1 Normalized cycles Data-dependent branches Poor use of core backend 0.8 • • High misprediction rate Frequent misses • • High penalty Hard-to-overlap due to 0.6 too many µ ops 0.4 0.2 0 Baseline Flat-HTA Flat-HTA reduces runtime overheads!

  5. Issue 2: Poor spatial locality 5 LLC k0, v0 k1, v1 Line 0 Line 16 k2, v2 Line 32 k3, v3 Conventional system L2

  6. Issue 2: Poor spatial locality 5 LLC k0, v0 k1, v1 Line 0 Line 16 k2, v2 Line 32 k3, v3 Conventional system L2 k0, v0 k1, v1 Line 0 Line 16 k2, v2 Line 32 k3, v3 Wastes cache capacity

  7. Issue 2: Poor spatial locality 5 LLC k0, v0 k1, v1 Line 0 Line 16 k2, v2 Line 32 k3, v3 Conventional system Hierarchical-HTA L2 L2 k0, v0 k1, v1 Line 0 k0, v0 k1, v1 k2, v2 k3, v3 Line 0 Line 16 k2, v2 Line 32 k3, v3 Improves spatial locality! Wastes cache capacity

  8. Prior hardware acceleration underused caches 6 ¨ Domain-specific management [Costa 2000, Choi 2008, Chalamalasetti 2013, Lim 2013, Gope 2017…] ¤ E.g., PHP processing, distributed key-value store, memoization ¤ Requires dedicated on-chip storage (e.g., 98KB [Costa et al 2000] ) ¤ Or bypasses memory hierarchy [Lloyd 2017, Tanaka 2014, Xu 2016…] HTA is general HTA avoids dedicated on-chip storage HTA exploits memory hierarchy for spatial locality

  9. HTA: Hash Table Acceleration

  10. HTA overview 8 Make the common case fast! 1.Table format Key Overflow 2.ISA extensions Accelerated by HTA function unit 3.Hardware implementation Fetch LLC LLC k0, v0 k1, v1 Line 0 Decode Execute k3, v3 Line 16 k2, v2 L2 L2 Issue Mem Commit L2 … L1I L1D L1I L1D k0, v0 k1, v1 k2, v2 k3, v3 Address Line Line 0 Calculation Comparison Core Core Flat-HTA Hierarchical-HTA Reduces runtime overheads Improves spatial locality

  11. HTA Table format 9 Memory Key Reg0 Reg1 2 M cache lines 128 H M 128b 128b 64b 64b 128b Key 0 Key 1 Value 0 Value 1 Unused Conventional table HTA table • • Variable number of probes Small, fixed number of probes • • Introduces hard-to-predict branches Overflows are handled by software path • • Minimizes work Avoids hard-to-predict branches • Enables hardware acceleration while (key != curSlot.key) { // Probe next slot }

  12. HTA ISA extensions 10 Branch semantics Single-threaded lookup • Easy to predict • Exploits existing predictors lookup: hta_lookup <table_id>, <key_reg>, <value_reg>, done call swLookup # Accesses software hash table if (key is found) or (line is not full): done: … taken # done else: not taken # call swLookup Single-threaded insert insert: hta_swap <table_id>, <key_reg>, <value_reg>, done call swHandleInsert # Accesses software hash table done: … • We prototype a CISC Multi-threaded insert (x86) implementation • RISC is also possible insert: hta_update <table_id>, <key_reg>, <value_reg>, done call swLockLine hta_swap <table_id>, <key_reg>, <value_reg>, release call swHandleInsert release: call swUnlockLine done: …

  13. Flat-HTA implementation 11 Execute Mem Commit Fetch Decode Issue HTA function unit Line comparison Address calculation key à lineAddr lineValue à outcome Area 0.055% of core

  14. Hierarchical-HTA overview 12 0 1 2 Legend 3 4 Frequently-accessed pair … LLC Infrequently-accessed pair 12 Empty slot 13 14 15 Cache line 0 1 L2 2 3

  15. Hierarchical-HTA overview 12 0 1 2 Legend 3 4 Frequently-accessed pair … LLC Infrequently-accessed pair 12 Empty slot 13 14 15 Cache line 0 1 L2 2 3

  16. Hierarchical-HTA overview 12 0 1 2 Legend 3 4 Frequently-accessed pair … LLC Infrequently-accessed pair 12 Empty slot 13 14 15 Cache line 0 1 L2 2 3

  17. Check out paper for more 13 ¨ Hierarchical-HTA implementation ¤ Maintains coherence conservatively ¤ Handles overflows conservatively ¨ Details on ISA and Flat-HTA implementation

  18. Methodology 14 ¨ Schemes ¨ Simulation with zsim ¤ Baseline: best of ¨ System n Google dense_hash_map ¤ 1 to16 cores n C++11 unordered_map ¤ 2MB LLC per core ¤ HTA-SW n w/ HTA table format n w/o HTA function unit ¤ Flat-HTA ¤ Hierarchical-HTA Shared LLC L2 L2 ¨ Applications … L1I L1D L1I L1D ¤ bfcounter (bioinformatics) ¤ lzw (data compression) Core Core ¤ Hashjoin (database) ¤ ycsb-read (key-value store) ¤ ycsb-write (key-value store)

  19. Flat-HTA speedups 15 Baseline Baseline HTA-SW HTA-SW Flat-HTA Flat-HTA (software-only) 1.4 1.8 2.0 1.4 1.8 1.6 1.6 1.2 1.2 1.4 1.4 1.5 1.0 1.0 1.2 1.2 Speedup 0.8 0.8 1.0 1.0 1.0 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.5 0.4 0.4 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 bfcounter lzw hashjoin ycsb-read ycsb-write

  20. Flat-HTA speedups 15 Baseline Baseline HTA-SW HTA-SW Flat-HTA Flat-HTA (software-only) 1.4 1.8 2.0 1.4 1.8 1.6 1.6 1.2 1.2 1.4 1.4 1.5 1.0 1.0 1.2 1.2 Speedup 0.8 0.8 1.0 1.0 1.0 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.5 0.4 0.4 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 bfcounter lzw hashjoin ycsb-read ycsb-write

  21. Flat-HTA speedups 15 Baseline Baseline HTA-SW HTA-SW Flat-HTA Flat-HTA (software-only) 1.4 1.8 2.0 1.4 1.8 1.6 1.6 1.2 1.2 1.4 1.4 1.5 1.0 1.0 1.2 1.2 Speedup 0.8 0.8 1.0 1.0 1.0 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.5 0.4 0.4 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 bfcounter lzw hashjoin ycsb-read ycsb-write

  22. Flat-HTA cycles breakdown 16 Others Wrong path execution Backend stall 1.0 1.0 1.2 1.2 1.2 Normalized cycles 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 B S F B S F B S F B S F B S F bfcounter lzw hashjoin ycsb-read ycsb-write B: Baseline S: HTA-SW F: Flat-HTA (software-only)

  23. Flat-HTA on multithreaded applications 17 Baseline Flat-HTA 16 16 14 14 12 12 Speedup 10 10 8 8 6 6 4 4 2 2 0 0 1 16 1 16 Cores Cores ycsb-read ycsb-write

  24. HTA on memoization 18 ¨ Example memo_exp: hta_lookup <table id>, <key reg>, <value reg>, done call exp hta_swap <table id>, <key reg>, <value reg>, done done: … ¨ Schemes ¤ Baseline (no memoization) ¤ Software memoization ¤ HTA memoization ¨ Applications selected from ¤ SPECCPU2006 ¤ SPECOMP2001 ¤ SPECOMP2012 ¤ PARSEC ¤ SPLASH2 ¤ BioParallel

  25. Flat-HTA speedups on memoization 19 Baseline Baseline Software Memoization Software Memoization HTA Memoization HTA Memoization 2.0 18 2.0 2.0 2.0 8 16 14 1.5 6 1.5 1.5 1.5 12 Speedup 10 1.0 4 1.0 1.0 1.0 8 6 0.5 2 0.5 0.5 0.5 4 2 0.0 0 0 0.0 0.0 0.0 bschols semphy bwaves equake nab water

  26. Flat-HTA speedups on memoization 19 Baseline Baseline Software Memoization Software Memoization HTA Memoization HTA Memoization 2.0 18 2.0 2.0 2.0 8 16 14 1.5 6 1.5 1.5 1.5 12 Speedup 10 1.0 4 1.0 1.0 1.0 8 6 0.5 2 0.5 0.5 0.5 4 2 0.0 0 0 0.0 0.0 0.0 bschols semphy bwaves equake nab water

  27. Conclusion 20 ¨ HTA accelerates hash tables and memoization ¤ Adopts a new hash table format ¤ Accelerates common cases in HW; leaves rare cases to SW ¨ Flat-HTA reduces runtime overheads significantly ¤ Requires minor (0.055% area) changes to cores ¨ Hierarchical-HTA improves spatial locality ¤ Needs changes to cores and cache controllers ¨ HTA improves hash-table-intensive applications by up to 2x ¨ HTA enables memoization of small code regions

  28. T HANKS F OR Y OUR A TTENTION ! Q UESTIONS ARE WELCOME !

Recommend


More recommend