a portable high level graph analytics framework targeting
play

A Portable, High-Level Graph Analytics Framework Targeting - PowerPoint PPT Presentation

A Portable, High-Level Graph Analytics Framework Targeting Distributed, Heterogeneous Systems Robert Searles*, Stephen Herbein*, and Sunita Chandrasekaran November 14, 2016 Motivation HPC and Big Data communities are converging


  1. A Portable, High-Level Graph Analytics Framework Targeting Distributed, Heterogeneous Systems Robert Searles*, Stephen Herbein*, and Sunita Chandrasekaran November 14, 2016

  2. Motivation ◮ HPC and Big Data communities are converging ◮ Heterogeneous and distributed systems are becoming increasingly more common ◮ Distributing data and leveraging specialized hardware (e.g. accelerators) is critical ◮ Graph analytics are important to both communities

  3. Goal ◮ Develop a portable, high-level framework for programming current and future HPC systems that: ◮ Distributes data automatically ◮ Utilize heterogeneous hardware ◮ Accelerate two real-world graph analytics applications ◮ Demonstrate portability by running on a variety of hardware, including multi-core Intel CPUs, NVIDIA GPUs, and AMD GPUs

  4. Our Framework: Spark + X ◮ Utilize the MapReduce framework, Spark, to handle data and task distribution ◮ Automatic data/task “X” CPU GPU distribution ◮ Fault-tolerant Spark ◮ Minimal programmer “X” CPU GPU overhead ◮ Leverage heterogeneous resources to compute the tasks “X” CPU GPU local to each node ◮ Accelerators and other emerging trends in HPC technology

  5. Case Study Applications ◮ Fast Subtree Kernel (FSK) ◮ Call graph similarity analysis ◮ Program characterization ◮ Malware analysis ◮ Triangle enumeration ◮ Spam detection ◮ Web link recommendation ◮ Social network analysis

  6. What is FSK? ◮ Compute-bound graph kernel ◮ Measures the similarity of graphs in a dataset ◮ A graph is represented by a list of feature vectors ◮ Each feature vector represents a subtree Binaries Call Graphs Similarity Matrix Program Characterization Decomp FSK SVM

  7. FSK in our framework ◮ Spark Component ◮ Split up pairwise graph comparisons ◮ Local Component ◮ For each pair of graphs ◮ Compare all feature vectors Spark Compare Compare Call Graphs Compare

  8. What is Triangle Enumeration? ◮ Data-bound graph operation ◮ Finds all cycles of size 3 (AKA triangles) within a graph 1 4 2 3 0 5 Figure: This graph contains 2 triangles (highlighted in red).

  9. Triangle Enumeration in our framework ◮ Spark Component ◮ Partition the graph ◮ Distribute the vertices/edges across the cluster ◮ Local Component ◮ Count triangles within each subgraph ◮ Done using matrix-matrix multiplication (BLAS) ◮ Spark Component ◮ Count triangles between subgraphs

  10. Hardware/Software Triangle Enumeration Fast Subtree Kernel ◮ Software ◮ Software ◮ PySpark ◮ PySpark ◮ ScikitCUDA ◮ PyOpenCL ◮ Hardware: NVIDIA GPUs ◮ Hardware: AMD GPU ◮ GTX 470 ◮ Fury X ◮ GTX 970 ◮ Tesla K20c

  11. FSK Results - Single-Node Parallelism Call Graph Similarity - Single Node Performance 40000 CPU Runtime CPU Runtime (8 threads) GPU Runtime Runtime (in seconds) 30000 20000 1.18 10000 1.13 1.02 1.42 0 10 100 500 1000 Dataset Size ◮ Single node runtimes (Single thread, 8 thread, and GPU)

  12. FSK Results - Multi-Node Scalability Call Graph Similarity - Single Node vs. Multi Node 3.13 12000 Single-Node CPU Multi-Node CPU (3 nodes) Runtime (in seconds) 10000 8000 6000 4000 2.99 2000 0.62 3.07 0 10 100 500 1000 Dataset Size ◮ Multiple node runtimes (CPU saturated on all nodes)

  13. Triangle Enumeration - Optimizing Data Movement ◮ Runtime of Spark component for Triangle Enumeration with a variable number of partitions for Erdos-Renyi random graphs with differing densities Denser graphs (P=.05) Sparse graphs (P=.001) Global Time vs. Number of Partitions for 3 Configurations (N=5000, P=.05) Global Time vs. Number of Partitions for 350.00 3 Configurations (N=5000, P=.001) CPU GPU-1 Executor GPU-4 Executors Global Time (Seconds) 300.00 12.00 250.00 Global Time (Seconds) CPU GPU-1 Executor GPU-4 Executors 10.00 200.00 8.00 150.00 6.00 100.00 50.00 4.00 0.00 2.00 36 72 144 0.00 Number of Spark Partitions 36 72 144 Number of Spark Partitions ◮ More partitions means ◮ Fewer partitions allows for oversubscription of the GPU more triangles to be counted ◮ Overlaps communication locally with computation

  14. Triangle Enumeration - Optimizing Local Computation ◮ Performance of the local component of Triangle Enumeration on the CPU and GPU for graphs of varying size and density GPU (ScikitCUDA) CPU (Scipy) 4.0 4.0 3.5 3.5 Run Time (s) 3.0 Run Time (s) 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 0.05 7000 0.05 7000 6000 6000 0.04 0.04 Graph Size (Nodes) Graph Size (Nodes) 5000 5000 Graph Density Graph Density 0.03 0.03 4000 4000 0.02 0.02 3000 3000 0.01 0.01 2000 2000 0.00 1000 0.00 1000 ◮ Running on the GPU is preferred unless the graph is sparse (density < .01), then running on the CPU is preferred

  15. Conclusion ◮ FSK ◮ Linear Scaling ◮ GPU outperforms CPU ◮ Free load balancing with Spark ◮ Triangle Enumeration ◮ Optimize data movement by changing the number of Spark partitions ◮ Improve local performance by choosing where to execute tasks ◮ Our high-level framework ◮ Demonstrated portability using a variety of hardware

  16. Future Work ◮ Additional case-study application ◮ Spike neural network training ◮ Detecting common subgraphs within neural networks ◮ Additional tests ◮ Scalability test on a large-scale homogenous cluster ◮ Add latest Nvidia GPUs (K40/80) to our heterogenous cluster

  17. Reproducibility ◮ All data and code on GitHub ◮ https://github.com/rsearles35/WACCPD-2016

Recommend


More recommend