exploiting locality in graph analytics through hardware
play

EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED - PowerPoint PPT Presentation

EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED TRAVERSAL SCHEDULING Anurag Mukkara , Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, Daniel Sanchez MICRO 2018 The locality problem of graph processing 2 Irregular


  1. EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED TRAVERSAL SCHEDULING Anurag Mukkara , Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, Daniel Sanchez MICRO 2018

  2. The locality problem of graph processing 2 ¨ Irregular structure of graphs causes seemingly random memory references ¨ On-chip caches are too small to fit most real-world graphs ¨ Software frameworks improve locality through offline preprocessing ¨ Preprocessing is expensive and often impractical 3616 1.2 1.2 Main Memory Accesses 1.0 1.0 Execution time 0.8 0.8 1 PageRank iteration Preproc. overhead 0.6 0.6 on UK web graph 0.4 0.4 0.2 0.2 0.0 0.0 Baseline GOrder Baseline GOrder

  3. Improving locality in an online fashion 3 ¨ Traversal schedule decides order in which graph edges are processed ¨ Many real-world graphs have strong community structure ¨ Traversals that follow community structure have good locality ¨ Performing this in software without preprocessing is not practical due to scheduling overheads

  4. Contributions 4 PageRank Delta on UK web graph ¨ BDFS : Bounded Depth-First Scheduling 1.0 ¤ Performs a series of bounded depth-first explorations Main Memory Accesses 0.8 1.8x ¤ Improves locality for graphs with good community structure 0.6 0.4 0.2 0.0 BaselineBDFS ¨ HATS : Hardware Accelerated Traversal Scheduling 3.0 2.7x 1.0 2.5 ¤ A simple unit specialized for traversal scheduling 0.8 2.0 Speedup 1.8x Speedup 0.6 ¤ Cheap and implementable in reconfigurable logic 1.5 1.0 0.4 0.5 0.2 0.0 0.0 Baseline BDFS Baseline- HATSBDFS- BaselineBDFS HATS

  5. Agenda 5 ¨ Background ¨ BDFS ¨ HATS ¨ Evaluation

  6. Graph data structures 6 2 3 Compressed Sparse Row (CSR) Format 0 1 Neighbor 1 2 0 3 0 1 3 0 2 array Graph Offset representation 0 2 4 7 9 array Vertex Algorithm-specific 0.8 7.9 3.6 1.2 data

  7. Vertex-ordered (VO) schedule follows layout order 7 ¨ Simplifies scheduling and parallelism ¨ Poor locality for vertex data accesses PageRank on UK web graph Load/Store Storage Neighbor array Vertex data 1.0 Neighbors Main Memory Accesses 1 2 0 3 0 1 3 0 2 V0 V1 V2 V3 Offsets 0.8 Vertex Data 0.6 Time 0.4 0.2 Full Spatial Locality Low Spatial Locality 0.0 VO No Temporal Locality Low Temporal Locality

  8. Agenda 8 ¨ Background ¨ BDFS ¨ HATS ¨ Evaluation

  9. BDFS: Bounded Depth-First Scheduling 9 ¨ Vertex data accesses have high potential temporal locality ¨ Following community structure helps harness this locality ¨ BDFS performs a series of bounded depth-first explorations ¨ Traversal starts at vertex with id 0 ¨ Processes all edges of first community before moving to second ¨ Divide-and-conquer nature of BDFS ¤ Small depth bounds capture most locality ¤ Good locality at all cache levels

  10. BDFS reduces total main memory accesses 10 Storage Neighbor array Vertex data PageRank on UK web graph VO Time 1.0 Neighbors Main Memory Accesses Offsets 0.8 High Spatial Locality Low Spatial Locality Vertex Data No Temporal Locality Low Temporal Locality 0.6 0.4 Time 0.2 BDFS 0.0 VO BDFS Low Spatial Locality Lower Spatial Locality No Temporal Locality High Temporal Locality

  11. BDFS in software does not improve performance 11 ¨ Scheduling overheads negate the Execution Memory Instructions benefits of better locality Time Accesses 1.0 3.5 1.2 Main Memory Accesses 3.0 ¨ Higher instruction count 1.0 0.8 Execution Time 2.5 Instructions 0.8 0.6 2.0 0.6 1.5 ¨ Limited ILP and MLP 0.4 0.4 1.0 ¤ Interleaved execution of traversal 0.2 0.2 0.5 scheduling and edge processing 0.0 0.0 0.0 VO BDFS VO BDFS VO BDFS ¤ Unpredictable data-dependent branches

  12. Agenda 12 ¨ Background ¨ BDFS ¨ HATS ¨ Evaluation

  13. HATS: Hardware Accelerated Traversal Scheduling 13 Main Memory Shared L3 L2 L2 L1 L1 … HATS HATS Core Core ¨ Decouples traversal scheduling from edge processing logic ¨ Small hardware unit near each core to perform traversal scheduling ¨ General-purpose core runs algorithm-specific edge processing logic ¨ HATS is decoupled from the core and runs ahead of it

  14. HATS operation and design 14 VO-HATS L2 Fetch Fetch Prefetch Scan Neighbors Offsets HATS Prefetches accs. L1 HATS BDFS-HATS Fetch Fetch Scan Prefetch Core Offsets Neighbors FIFO Edge Config. accs. Buffer Stack Core Exploration FSM

  15. HATS costs 15 ¨ Adds only one new instruction ¤ Fetches edge from FIFO buffer to core registers ¨ Very cheap and energy-efficient over a general-purpose core ¤ RTL synthesis with a 65nm process and 1GHz target frequency ASIC FPGA Area TDP Area 0.4% of core 0.2% of core 3200 LUTs

  16. HATS benefits 16 ¨ Reduces work for general-purpose core for VO ¨ Enables sophisticated scheduling like BDFS ¨ Performs accurate indirect prefetching of vertex data ¨ Accelerates a wide range of algorithms ¨ Requires changes to graph framework only, not algorithm code

  17. Agenda 17 ¨ Background ¨ BDFS ¨ HATS ¨ Evaluation

  18. Evaluation methodology 18 ¨ Event-driven simulation using zsim ¨ 5 applications from Ligra framework ¤ PageRank (PR) ¨ 16-core processor ¤ PageRank Delta (PRD) ¤ Haswell-like OOO cores ¤ Connected Components (CC) ¤ 32 MB L3 cache ¤ Radii Estimation (RE) ¤ 4 memory controllers ¤ Maximal Independent Set (MIS) ¨ IMP [Yu, MICRO’15] ¨ 5 large real-world graph inputs ¤ Indirect Memory Prefetcher ¤ Millions of vertices ¤ Configured with graph data ¤ Billions of edges structure information for accurate prefetching

  19. HATS improves performance significantly 19 ¨ IMP improves performance by hiding latency IMP IMP IMP IMP VO-HATS VO-HATS VO-HATS VO-HATS BDFS-HATS BDFS-HATS BDFS-HATS BDFS-HATS 120 120 120 120 Speedup over VO (%) Speedup over VO (%) Speedup over VO (%) Speedup over VO (%) 100 100 100 100 ¨ VO-HATS outperforms IMP by 80 80 80 80 offloading traversal scheduling 60 60 60 60 from general-purpose core 40 40 40 40 20 20 20 20 ¨ BDFS-HATS gives further gains 0 0 0 0 by reducing memory accesses PR PR PR PR PRD PRD PRD PRD CC CC CC CC RE RE RE RE MIS MIS MIS MIS

  20. HATS reduces both on-chip and off-chip energy 20 ¨ IMP reduces static energy due VO VO VO VO VO IMP IMP IMP IMP IMP VO-HATS VO-HATS VO-HATS VO-HATS VO-HATS BDFS-HATS BDFS-HATS BDFS-HATS BDFS-HATS BDFS-HATS to faster execution time 1.0 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 Normalized energy Normalized energy Normalized energy Normalized energy Normalized energy ¨ VO-HATS reduces core energy 0.6 0.6 0.6 0.6 0.6 due to lower instruction count 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 ¨ BDFS-HATS reduces memory 0.0 0.0 0.0 0.0 0.0 PR PR PR PR PR PRD PRD PRD PRD PRD CC CC CC CC CC RE RE RE RE RE MIS MIS MIS MIS MIS energy due to better locality On-chip Off-chip (Core + Cache) (Memory)

  21. See paper for more results 21 ¨ HATS on an on-chip reconfigurable fabric ¤ Parallelism enhancements to maintain throughput at slower clock cycle ¨ Sensitivity to on-chip location of HATS (L1, L2, LLC) ¨ Adaptive-HATS ¤ Avoids performance loss for graphs with no community structure ¨ HATS versus other locality optimizations

  22. Conclusion 22 ¨ Graph processing is bottlenecked by main memory accesses ¨ BDFS exploits community structure to improve cache locality ¨ HATS accelerates traversal scheduling to make BDFS practical Thanks For Your Attention! Questions Are Welcome!

Recommend


More recommend