EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED - PowerPoint PPT Presentation

EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED TRAVERSAL SCHEDULING Anurag Mukkara , Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, Daniel Sanchez MICRO 2018

The locality problem of graph processing 2 ¨ Irregular structure of graphs causes seemingly random memory references ¨ On-chip caches are too small to fit most real-world graphs ¨ Software frameworks improve locality through offline preprocessing ¨ Preprocessing is expensive and often impractical 3616 1.2 1.2 Main Memory Accesses 1.0 1.0 Execution time 0.8 0.8 1 PageRank iteration Preproc. overhead 0.6 0.6 on UK web graph 0.4 0.4 0.2 0.2 0.0 0.0 Baseline GOrder Baseline GOrder

Improving locality in an online fashion 3 ¨ Traversal schedule decides order in which graph edges are processed ¨ Many real-world graphs have strong community structure ¨ Traversals that follow community structure have good locality ¨ Performing this in software without preprocessing is not practical due to scheduling overheads

Contributions 4 PageRank Delta on UK web graph ¨ BDFS : Bounded Depth-First Scheduling 1.0 ¤ Performs a series of bounded depth-first explorations Main Memory Accesses 0.8 1.8x ¤ Improves locality for graphs with good community structure 0.6 0.4 0.2 0.0 BaselineBDFS ¨ HATS : Hardware Accelerated Traversal Scheduling 3.0 2.7x 1.0 2.5 ¤ A simple unit specialized for traversal scheduling 0.8 2.0 Speedup 1.8x Speedup 0.6 ¤ Cheap and implementable in reconfigurable logic 1.5 1.0 0.4 0.5 0.2 0.0 0.0 Baseline BDFS Baseline- HATSBDFS- BaselineBDFS HATS

Agenda 5 ¨ Background ¨ BDFS ¨ HATS ¨ Evaluation

Graph data structures 6 2 3 Compressed Sparse Row (CSR) Format 0 1 Neighbor 1 2 0 3 0 1 3 0 2 array Graph Offset representation 0 2 4 7 9 array Vertex Algorithm-specific 0.8 7.9 3.6 1.2 data

Vertex-ordered (VO) schedule follows layout order 7 ¨ Simplifies scheduling and parallelism ¨ Poor locality for vertex data accesses PageRank on UK web graph Load/Store Storage Neighbor array Vertex data 1.0 Neighbors Main Memory Accesses 1 2 0 3 0 1 3 0 2 V0 V1 V2 V3 Offsets 0.8 Vertex Data 0.6 Time 0.4 0.2 Full Spatial Locality Low Spatial Locality 0.0 VO No Temporal Locality Low Temporal Locality

BDFS: Bounded Depth-First Scheduling 9 ¨ Vertex data accesses have high potential temporal locality ¨ Following community structure helps harness this locality ¨ BDFS performs a series of bounded depth-first explorations ¨ Traversal starts at vertex with id 0 ¨ Processes all edges of first community before moving to second ¨ Divide-and-conquer nature of BDFS ¤ Small depth bounds capture most locality ¤ Good locality at all cache levels

BDFS reduces total main memory accesses 10 Storage Neighbor array Vertex data PageRank on UK web graph VO Time 1.0 Neighbors Main Memory Accesses Offsets 0.8 High Spatial Locality Low Spatial Locality Vertex Data No Temporal Locality Low Temporal Locality 0.6 0.4 Time 0.2 BDFS 0.0 VO BDFS Low Spatial Locality Lower Spatial Locality No Temporal Locality High Temporal Locality

BDFS in software does not improve performance 11 ¨ Scheduling overheads negate the Execution Memory Instructions benefits of better locality Time Accesses 1.0 3.5 1.2 Main Memory Accesses 3.0 ¨ Higher instruction count 1.0 0.8 Execution Time 2.5 Instructions 0.8 0.6 2.0 0.6 1.5 ¨ Limited ILP and MLP 0.4 0.4 1.0 ¤ Interleaved execution of traversal 0.2 0.2 0.5 scheduling and edge processing 0.0 0.0 0.0 VO BDFS VO BDFS VO BDFS ¤ Unpredictable data-dependent branches

HATS: Hardware Accelerated Traversal Scheduling 13 Main Memory Shared L3 L2 L2 L1 L1 … HATS HATS Core Core ¨ Decouples traversal scheduling from edge processing logic ¨ Small hardware unit near each core to perform traversal scheduling ¨ General-purpose core runs algorithm-specific edge processing logic ¨ HATS is decoupled from the core and runs ahead of it

HATS operation and design 14 VO-HATS L2 Fetch Fetch Prefetch Scan Neighbors Offsets HATS Prefetches accs. L1 HATS BDFS-HATS Fetch Fetch Scan Prefetch Core Offsets Neighbors FIFO Edge Config. accs. Buffer Stack Core Exploration FSM

HATS costs 15 ¨ Adds only one new instruction ¤ Fetches edge from FIFO buffer to core registers ¨ Very cheap and energy-efficient over a general-purpose core ¤ RTL synthesis with a 65nm process and 1GHz target frequency ASIC FPGA Area TDP Area 0.4% of core 0.2% of core 3200 LUTs

HATS benefits 16 ¨ Reduces work for general-purpose core for VO ¨ Enables sophisticated scheduling like BDFS ¨ Performs accurate indirect prefetching of vertex data ¨ Accelerates a wide range of algorithms ¨ Requires changes to graph framework only, not algorithm code

Evaluation methodology 18 ¨ Event-driven simulation using zsim ¨ 5 applications from Ligra framework ¤ PageRank (PR) ¨ 16-core processor ¤ PageRank Delta (PRD) ¤ Haswell-like OOO cores ¤ Connected Components (CC) ¤ 32 MB L3 cache ¤ Radii Estimation (RE) ¤ 4 memory controllers ¤ Maximal Independent Set (MIS) ¨ IMP [Yu, MICRO’15] ¨ 5 large real-world graph inputs ¤ Indirect Memory Prefetcher ¤ Millions of vertices ¤ Configured with graph data ¤ Billions of edges structure information for accurate prefetching

HATS improves performance significantly 19 ¨ IMP improves performance by hiding latency IMP IMP IMP IMP VO-HATS VO-HATS VO-HATS VO-HATS BDFS-HATS BDFS-HATS BDFS-HATS BDFS-HATS 120 120 120 120 Speedup over VO (%) Speedup over VO (%) Speedup over VO (%) Speedup over VO (%) 100 100 100 100 ¨ VO-HATS outperforms IMP by 80 80 80 80 offloading traversal scheduling 60 60 60 60 from general-purpose core 40 40 40 40 20 20 20 20 ¨ BDFS-HATS gives further gains 0 0 0 0 by reducing memory accesses PR PR PR PR PRD PRD PRD PRD CC CC CC CC RE RE RE RE MIS MIS MIS MIS

HATS reduces both on-chip and off-chip energy 20 ¨ IMP reduces static energy due VO VO VO VO VO IMP IMP IMP IMP IMP VO-HATS VO-HATS VO-HATS VO-HATS VO-HATS BDFS-HATS BDFS-HATS BDFS-HATS BDFS-HATS BDFS-HATS to faster execution time 1.0 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 Normalized energy Normalized energy Normalized energy Normalized energy Normalized energy ¨ VO-HATS reduces core energy 0.6 0.6 0.6 0.6 0.6 due to lower instruction count 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 ¨ BDFS-HATS reduces memory 0.0 0.0 0.0 0.0 0.0 PR PR PR PR PR PRD PRD PRD PRD PRD CC CC CC CC CC RE RE RE RE RE MIS MIS MIS MIS MIS energy due to better locality On-chip Off-chip (Core + Cache) (Memory)

See paper for more results 21 ¨ HATS on an on-chip reconfigurable fabric ¤ Parallelism enhancements to maintain throughput at slower clock cycle ¨ Sensitivity to on-chip location of HATS (L1, L2, LLC) ¨ Adaptive-HATS ¤ Avoids performance loss for graphs with no community structure ¨ HATS versus other locality optimizations

Conclusion 22 ¨ Graph processing is bottlenecked by main memory accesses ¨ BDFS exploits community structure to improve cache locality ¨ HATS accelerates traversal scheduling to make BDFS practical Thanks For Your Attention! Questions Are Welcome!

EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED - PowerPoint PPT Presentation

EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED TRAVERSAL SCHEDULING Anurag Mukkara , Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, Daniel Sanchez MICRO 2018 The locality problem of graph processing 2 Irregular

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

Hardware Observability Framework Hardware Observability Framework Hardware Observability

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

locality.org.uk Locality is the national network of ambitious and enterprising community-led

Highway Locality Budget Scheme Steve Dibben Highway Locality Manager Mid Herts Group

Graph U-Nets Hongyang Gao and Shuiwang Ji Texas A&M University Graph U-Nets - Department of

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Exploiting Graph Embeddings for Graph Analysis Tasks Fatemeh Salehi Rizi Graph Embedding Day

WITH RAPIDS Joe Eaton, Ph.D. Technical Lead for Graph Analytics AGENDA Introduction - Why

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 5.1 Introduction Principle of Locality

Exploiting Modern Hardware Features via Lightweight Profiling Probir Roy Scalable Tools

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Importing Skill-Biased Technology Ariel Burstein Javier Cravino Jonathan Vogel January 2012

Direct/Adjoint Methods Lecture 12 ME EN 575 Andrew Ning aning@byu.edu Outline Motivating

Julia H. Appleton MT(ASCP), MBA Centers for Medicare & Medicaid Services (CMS) Center for

DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu

Live Video Analytics at Scale with Approximation and Delay-Tolerance Haoyu Zhang, Microsoft and

Towards Practical Differential Privacy for SQL Queries Noah Johnson, Joseph P. Near, Dawn Song

derivatives for design and control with Jim and Simon review: serial manipulator end

Adaptive Multiscale Streamline Simulation and Inversion for High-Resolution Geomodels Vegard

EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED - PowerPoint PPT Presentation

EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED TRAVERSAL SCHEDULING Anurag Mukkara , Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, Daniel Sanchez MICRO 2018 The locality problem of graph processing 2 Irregular

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

Hardware Observability Framework Hardware Observability Framework Hardware Observability

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

locality.org.uk Locality is the national network of ambitious and enterprising community-led

Highway Locality Budget Scheme Steve Dibben Highway Locality Manager Mid Herts Group

Graph U-Nets Hongyang Gao and Shuiwang Ji Texas A&amp;M University Graph U-Nets - Department of

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Exploiting Graph Embeddings for Graph Analysis Tasks Fatemeh Salehi Rizi Graph Embedding Day

WITH RAPIDS Joe Eaton, Ph.D. Technical Lead for Graph Analytics AGENDA Introduction - Why

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 5.1 Introduction Principle of Locality

Exploiting Modern Hardware Features via Lightweight Profiling Probir Roy Scalable Tools

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Importing Skill-Biased Technology Ariel Burstein Javier Cravino Jonathan Vogel January 2012

Direct/Adjoint Methods Lecture 12 ME EN 575 Andrew Ning aning@byu.edu Outline Motivating

Julia H. Appleton MT(ASCP), MBA Centers for Medicare &amp; Medicaid Services (CMS) Center for

DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu

Live Video Analytics at Scale with Approximation and Delay-Tolerance Haoyu Zhang, Microsoft and

Towards Practical Differential Privacy for SQL Queries Noah Johnson, Joseph P. Near, Dawn Song

derivatives for design and control with Jim and Simon review: serial manipulator end

Adaptive Multiscale Streamline Simulation and Inversion for High-Resolution Geomodels Vegard

Graph U-Nets Hongyang Gao and Shuiwang Ji Texas A&M University Graph U-Nets - Department of

Julia H. Appleton MT(ASCP), MBA Centers for Medicare & Medicaid Services (CMS) Center for