EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED TRAVERSAL SCHEDULING Anurag Mukkara , Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, Daniel Sanchez MICRO 2018
The locality problem of graph processing 2 ¨ Irregular structure of graphs causes seemingly random memory references ¨ On-chip caches are too small to fit most real-world graphs ¨ Software frameworks improve locality through offline preprocessing ¨ Preprocessing is expensive and often impractical 3616 1.2 1.2 Main Memory Accesses 1.0 1.0 Execution time 0.8 0.8 1 PageRank iteration Preproc. overhead 0.6 0.6 on UK web graph 0.4 0.4 0.2 0.2 0.0 0.0 Baseline GOrder Baseline GOrder
Improving locality in an online fashion 3 ¨ Traversal schedule decides order in which graph edges are processed ¨ Many real-world graphs have strong community structure ¨ Traversals that follow community structure have good locality ¨ Performing this in software without preprocessing is not practical due to scheduling overheads
Contributions 4 PageRank Delta on UK web graph ¨ BDFS : Bounded Depth-First Scheduling 1.0 ¤ Performs a series of bounded depth-first explorations Main Memory Accesses 0.8 1.8x ¤ Improves locality for graphs with good community structure 0.6 0.4 0.2 0.0 BaselineBDFS ¨ HATS : Hardware Accelerated Traversal Scheduling 3.0 2.7x 1.0 2.5 ¤ A simple unit specialized for traversal scheduling 0.8 2.0 Speedup 1.8x Speedup 0.6 ¤ Cheap and implementable in reconfigurable logic 1.5 1.0 0.4 0.5 0.2 0.0 0.0 Baseline BDFS Baseline- HATSBDFS- BaselineBDFS HATS
Agenda 5 ¨ Background ¨ BDFS ¨ HATS ¨ Evaluation
Graph data structures 6 2 3 Compressed Sparse Row (CSR) Format 0 1 Neighbor 1 2 0 3 0 1 3 0 2 array Graph Offset representation 0 2 4 7 9 array Vertex Algorithm-specific 0.8 7.9 3.6 1.2 data
Vertex-ordered (VO) schedule follows layout order 7 ¨ Simplifies scheduling and parallelism ¨ Poor locality for vertex data accesses PageRank on UK web graph Load/Store Storage Neighbor array Vertex data 1.0 Neighbors Main Memory Accesses 1 2 0 3 0 1 3 0 2 V0 V1 V2 V3 Offsets 0.8 Vertex Data 0.6 Time 0.4 0.2 Full Spatial Locality Low Spatial Locality 0.0 VO No Temporal Locality Low Temporal Locality
Agenda 8 ¨ Background ¨ BDFS ¨ HATS ¨ Evaluation
BDFS: Bounded Depth-First Scheduling 9 ¨ Vertex data accesses have high potential temporal locality ¨ Following community structure helps harness this locality ¨ BDFS performs a series of bounded depth-first explorations ¨ Traversal starts at vertex with id 0 ¨ Processes all edges of first community before moving to second ¨ Divide-and-conquer nature of BDFS ¤ Small depth bounds capture most locality ¤ Good locality at all cache levels
BDFS reduces total main memory accesses 10 Storage Neighbor array Vertex data PageRank on UK web graph VO Time 1.0 Neighbors Main Memory Accesses Offsets 0.8 High Spatial Locality Low Spatial Locality Vertex Data No Temporal Locality Low Temporal Locality 0.6 0.4 Time 0.2 BDFS 0.0 VO BDFS Low Spatial Locality Lower Spatial Locality No Temporal Locality High Temporal Locality
BDFS in software does not improve performance 11 ¨ Scheduling overheads negate the Execution Memory Instructions benefits of better locality Time Accesses 1.0 3.5 1.2 Main Memory Accesses 3.0 ¨ Higher instruction count 1.0 0.8 Execution Time 2.5 Instructions 0.8 0.6 2.0 0.6 1.5 ¨ Limited ILP and MLP 0.4 0.4 1.0 ¤ Interleaved execution of traversal 0.2 0.2 0.5 scheduling and edge processing 0.0 0.0 0.0 VO BDFS VO BDFS VO BDFS ¤ Unpredictable data-dependent branches
Agenda 12 ¨ Background ¨ BDFS ¨ HATS ¨ Evaluation
HATS: Hardware Accelerated Traversal Scheduling 13 Main Memory Shared L3 L2 L2 L1 L1 … HATS HATS Core Core ¨ Decouples traversal scheduling from edge processing logic ¨ Small hardware unit near each core to perform traversal scheduling ¨ General-purpose core runs algorithm-specific edge processing logic ¨ HATS is decoupled from the core and runs ahead of it
HATS operation and design 14 VO-HATS L2 Fetch Fetch Prefetch Scan Neighbors Offsets HATS Prefetches accs. L1 HATS BDFS-HATS Fetch Fetch Scan Prefetch Core Offsets Neighbors FIFO Edge Config. accs. Buffer Stack Core Exploration FSM
HATS costs 15 ¨ Adds only one new instruction ¤ Fetches edge from FIFO buffer to core registers ¨ Very cheap and energy-efficient over a general-purpose core ¤ RTL synthesis with a 65nm process and 1GHz target frequency ASIC FPGA Area TDP Area 0.4% of core 0.2% of core 3200 LUTs
HATS benefits 16 ¨ Reduces work for general-purpose core for VO ¨ Enables sophisticated scheduling like BDFS ¨ Performs accurate indirect prefetching of vertex data ¨ Accelerates a wide range of algorithms ¨ Requires changes to graph framework only, not algorithm code
Agenda 17 ¨ Background ¨ BDFS ¨ HATS ¨ Evaluation
Evaluation methodology 18 ¨ Event-driven simulation using zsim ¨ 5 applications from Ligra framework ¤ PageRank (PR) ¨ 16-core processor ¤ PageRank Delta (PRD) ¤ Haswell-like OOO cores ¤ Connected Components (CC) ¤ 32 MB L3 cache ¤ Radii Estimation (RE) ¤ 4 memory controllers ¤ Maximal Independent Set (MIS) ¨ IMP [Yu, MICRO’15] ¨ 5 large real-world graph inputs ¤ Indirect Memory Prefetcher ¤ Millions of vertices ¤ Configured with graph data ¤ Billions of edges structure information for accurate prefetching
HATS improves performance significantly 19 ¨ IMP improves performance by hiding latency IMP IMP IMP IMP VO-HATS VO-HATS VO-HATS VO-HATS BDFS-HATS BDFS-HATS BDFS-HATS BDFS-HATS 120 120 120 120 Speedup over VO (%) Speedup over VO (%) Speedup over VO (%) Speedup over VO (%) 100 100 100 100 ¨ VO-HATS outperforms IMP by 80 80 80 80 offloading traversal scheduling 60 60 60 60 from general-purpose core 40 40 40 40 20 20 20 20 ¨ BDFS-HATS gives further gains 0 0 0 0 by reducing memory accesses PR PR PR PR PRD PRD PRD PRD CC CC CC CC RE RE RE RE MIS MIS MIS MIS
HATS reduces both on-chip and off-chip energy 20 ¨ IMP reduces static energy due VO VO VO VO VO IMP IMP IMP IMP IMP VO-HATS VO-HATS VO-HATS VO-HATS VO-HATS BDFS-HATS BDFS-HATS BDFS-HATS BDFS-HATS BDFS-HATS to faster execution time 1.0 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 Normalized energy Normalized energy Normalized energy Normalized energy Normalized energy ¨ VO-HATS reduces core energy 0.6 0.6 0.6 0.6 0.6 due to lower instruction count 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 ¨ BDFS-HATS reduces memory 0.0 0.0 0.0 0.0 0.0 PR PR PR PR PR PRD PRD PRD PRD PRD CC CC CC CC CC RE RE RE RE RE MIS MIS MIS MIS MIS energy due to better locality On-chip Off-chip (Core + Cache) (Memory)
See paper for more results 21 ¨ HATS on an on-chip reconfigurable fabric ¤ Parallelism enhancements to maintain throughput at slower clock cycle ¨ Sensitivity to on-chip location of HATS (L1, L2, LLC) ¨ Adaptive-HATS ¤ Avoids performance loss for graphs with no community structure ¨ HATS versus other locality optimizations
Conclusion 22 ¨ Graph processing is bottlenecked by main memory accesses ¨ BDFS exploits community structure to improve cache locality ¨ HATS accelerates traversal scheduling to make BDFS practical Thanks For Your Attention! Questions Are Welcome!
Recommend
More recommend