numa aware graph structured analytics
play

NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, - PowerPoint PPT Presentation

NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 100 Hrs of Video 100 o 1.11 Billion on Users every minute


  1. NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China

  2. Big Data Everywhere 100 Hrs of Video 100 o 1.11 Billion on Users every minute 6 6 Billion on Photos os 400 400 Million on Tweets/day ay How do we understand and use Big Data ta ?

  3. Data Analytics 100 100 Hrs of Video o 1.11 Billion on Users every minute 6 6 Billion on Photos os 400 Million 400 on Tweets/day ay Machine Learning and Data Mining NLP

  4. It’s about the graphs ...

  5. NUMA & Graph-analytics Application Hardware Processor Memory Single Unified Multi- Core NUCA Multi- NUMA Socket Now e.g. 80 Cores with 1TB RAM 8 Sockets X (10 Cores with128GB local RAM) NLP

  6. How about ? NUMA systems Graph-analytics

  7. Contribution Polymer : NUMA-aware Graph-structured Analytics □ A comprehensive analysis that uncovers issues for running graph analytics system on NUMA platform □ A new system that exploits both NUMA-aware data layout and memory access strategies □ Three optimizations for global synchronization efficiency, load balance and data structure flexibility □ A detailed evaluation that demonstrates the performance and scalability benefits

  8. Outline Background & Issues Design of Polymer Evaluation

  9. Outline Background & Issues Design of Polymer Evaluation

  10. Example: PageRank 4 5 A centrality analysis algorithm to measure the relative rank 3 1 2 for each element of a linked set Characteristics □ Linked set  data dependence □ Rank of who links it  predictable accesses □ Convergence  iterative computation 4 5 4 5 4 5 3 1 4 3 1 2 3 1 1 4

  11. Graph-analytics The scatter-gather model □ “ scatter ” : propagate the current value of a vertex to its neighbors along edges □ “ gather ” : accumulate values from neighbors to compute the next value of a vertex vertex TOPO 1 2 3 4 5 6 edge 2 3 3 5 2 5 6 1 3 5 1 2 3 6 2 In-memory data structure □ Graph Topology curr D 1 D 2 D 3 D 4 D 5 D 6 DATA □ Application-specific Data next □ Runtime State curr 1 0 1 1 1 0 STAT next

  12. Vertex-centric (e.g. Ligra) 1 6 2 5 3 4 STAT/curr TOPO/vertex TOPO/out-edge DATA/curr DATA/next STAT/next

  13. Vertex-centric (e.g. Ligra) 1 6 2 5 3 4 STAT/curr TOPO/vertex TOPO/out-edge DATA/curr SEQ R RND W DATA/next STAT/next

  14. Edge-centric (e.g. X-Stream) 1 6 2 5 3 partition 4 TOPO/edge STAT/curr DATA/curr DATA/Uout shuffle phase DATA/Uin DATA/next STAT/next

  15. Edge-centric (e.g. X-Stream) 1 6 2 5 3 partition 4 TOPO/edge STAT/curr DATA/curr RND R DATA/Uout shuffle SEQ W R phase SEQ W R DATA/Uin RND W DATA/next STAT/next

  16. NUMA Characteristics A commodity NUMA machine □ Multiple processor nodes (i.e., socket) □ Processor = multiple cores + a local DRAM □ A globally shared memory abstract (cache-coherence) □ Hallmark: Non-uniform memory access Latency (Cycle) Bandwidth (MB/s) Inst. 0-hop 1-hop 2-hop Access 0-hop 1-hop 2-hop IL 80-core Intel Xeon machine 80-core Intel Xeon machine Load 117 271 372 SEQ 3207 2455 2101 2333 Store 108 304 409 RAND 720 348 307 344

  17. NUMA Characteristics A commodity NUMA machine □ Multiple processor nodes (i.e., socket) Sequential remote access is faster than □ Processor = multiple cores + a local DRAM random local access □ A globally shared memory abstract (cache-coherence) & □ Hallmark: Non-uniform memory access Random remote access is awesome! Latency (Cycle) Bandwidth (MB/s) Inst. 0-hop 1-hop 2-hop Access 0-hop 1-hop 2-hop IL 80-core Intel Xeon machine 80-core Intel Xeon machine Load 117 271 372 SEQ 3207 2455 2101 2333 Store 108 304 409 RAND 720 348 307 344

  18. NUMA Characteristics The world we lived in: “ first-touch ” policy “binding virtual pages to physical frames locating on a memory node where a thread first touches the pages” Centralized Interleaved Associated CPU MEM The world we want to lived in

  19. NUMA Characteristics The world we lived in: “ first-touch ” policy “binding virtual pages to physical frames locating on a memory node where a thread first touches the pages” Both centralized and interleaved data layout will hamper locality and parallelism & Centralized Interleaved Associated Associated layout is the ideal one. CPU MEM The world we want to lived in

  20. NUMA Characteristics Lack of locality (access neighboring vertices) □ It is inevitable to access remote memory 1 1 6 2 6 2 . . . How to mix ?? 5 3 5 3 4 4 SEQ RND □ Random access is always there Local Global 1 1 2 3 4 5 6 6 2 update 5 3 1 2 3 4 5 6 4

  21. Access Strategy on NUMA SEQ RND Vertex-centric Model L □ Completely overlooked (e.g. Ligra) G 1 6 2 5 3 N0 N1 4 SEQ R L RND W G

  22. Access Strategy on NUMA SEQ RND Edge-centric Model L □ Inefficient way (e.g. X-Stream) G 1 6 2 N0 N1 5 3 4 RND R L SEQ W L shuffle SEQ R L phase SEQ W G SEQ R L RND W L

  23. Scalability & Performance on NUMA Scalability: #Cores vs. #sockets Intel 80-cores (8Sx10C) 10 10 X-Stream Ligra Ligra Normalized Normalized 8 8 8C: 6.92X Speedup X-Stream Speedup X-Stream 6 6 Galois Galois 8S: 4.58X 4 4 Galois 2 2 10C: 6.19X 0 0 8S: 2.90X 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 #cores #sockets Performance (sec) Scalability: #sockets 160 8 Runtime (sec) Ligra Ligra X-Stream worse ! Normalized 120 X-Stream 6 Speedup X-Stream 1S: 132s Galois Galois 8S: 29s 80 4 8 Socket LG: 2.9X 2 40 Galois XS: 1.4X 1S: 33s 0 0 8S: 12s 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 #sockets #sockets Intel 80-cores (8Sx10C) AMD 64-core (8Sx8C)

  24. Scalability & Performance on NUMA Scalability: #Cores vs. #sockets Intel 80-cores (8Sx10C) 10 10 X-Stream Ligra Ligra Normalized Normalized 8 8 8C: 6.92X Speedup X-Stream Speedup X-Stream 6 6 Galois Galois 8S: 4.58X 4 4 Galois 2 2 10C: 6.19X Minimize remote & random accesses 0 0 8S: 2.90X 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 #cores #sockets + Performance (sec) Scalability: #sockets Eliminate the combination of them 160 8 Runtime (sec) Ligra Ligra X-Stream worse Normalized 120 X-Stream 6 Speedup X-Stream 1S: 132s ! Galois Galois 8S: 29s 80 4 8 Socket LG: 2.9X 2 40 Galois XS: 1.4X 1S: 33s 0 0 8S: 12s 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 #sockets #sockets Intel 80-cores (8Sx10C) AMD 64-core (8Sx8C)

  25. Outline Background & Issues Design of Polymer Evaluation

  26. Goal#1: Reduce remote accesses Co-locating data and computation within the same NUMA node

  27. 1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node 1 2 3 4 5 6 TOPO 1. Graph-aware 2 3 3 5 2 5 6 1 3 5 1 2 3 6 2 partitioning DATA D 1 D 2 D 3 D 4 D 5 D 6 STAT 1 0 1 1 1 0

  28. 1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node N0 N1 1 2 3 4 5 6 1. Graph-aware partitioning 2 3 3 5 2 5 6 1 3 5 1 2 3 6 2 D 1 D 2 D 3 D 4 D 5 D 6 1 0 1 1 1 0 Intuitive

  29. 1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node N0 N1 1 2 3 4 5 6 1. Graph-aware partitioning 2 3 3 2 1 3 1 2 3 2 5 5 6 5 6 D 1 D 2 D 3 D 4 D 5 D 6 1 0 1 1 1 0 sophisticated

  30. 1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node agent N0 N1 1 2 3 4 5 6 1 2 3 4 5 6 1. Graph-aware partitioning 2 3 3 2 1 3 1 2 3 2 5 5 6 5 6 D 1 D 2 D 3 D 4 D 5 D 6 1 0 1 1 1 0 sophisticated

  31. 1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node agent N0 N1 1 2 3 4 5 6 1 2 3 4 5 6 1. Graph-aware partitioning 2 3 3 2 1 3 1 2 3 2 5 5 6 5 6 D 1 D 2 D 3 D 4 D 5 D 6 1 0 1 1 1 0 2. NUMA-aware 1.seq 1.seq/rnd 1.seq/rnd allocation 2.local 2.global 2.global TOPO DATA STAT 3.long 3.long 3.short Virt-Memory Phys-Memory

  32. Goal#2: Eliminate “random + remote” Random remote access → access neighboring vertices on other nodes distribute the computations on a singe vertex over multiple nodes

  33. Goal#2: Eliminate “random + remote” Random remote access → access neighboring vertices on other nodes distribute the computations on a singe vertex over multiple nodes Each node handles 1 RND L 2 5 6 3 all edges of partial vertices 3 4 5 SEQ G 3 4 1 6 2 Each node handles RND G 5 3 partial edges of all vertices SEQ L 5 3 4

  34. 1 NUMA-aware Access Strategy 6 2 5 3 distribute the computations on 4 singe vertex over multiple NUMA-nodes N0 N1 STAT/curr TOPO/vertex TOPO/out-edge DATA/curr DATA/next STAT/next

  35. 1 NUMA-aware Access Strategy 6 2 5 3 distribute the computations on 4 singe vertex over multiple NUMA-nodes N0 N1 SEQ R L DATA/curr DATA/next RND W G DATA/curr SEQ R G RND W L DATA/next

  36. Optimizations 1. Rolling update 2. Hierarchical and efficient barrier 3. Adaptive data structure

  37. Outline Background & Issues Design of Polymer Evaluation

Recommend


More recommend