NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China
Big Data Everywhere 100 Hrs of Video 100 o 1.11 Billion on Users every minute 6 6 Billion on Photos os 400 400 Million on Tweets/day ay How do we understand and use Big Data ta ?
Data Analytics 100 100 Hrs of Video o 1.11 Billion on Users every minute 6 6 Billion on Photos os 400 Million 400 on Tweets/day ay Machine Learning and Data Mining NLP
It’s about the graphs ...
NUMA & Graph-analytics Application Hardware Processor Memory Single Unified Multi- Core NUCA Multi- NUMA Socket Now e.g. 80 Cores with 1TB RAM 8 Sockets X (10 Cores with128GB local RAM) NLP
How about ? NUMA systems Graph-analytics
Contribution Polymer : NUMA-aware Graph-structured Analytics □ A comprehensive analysis that uncovers issues for running graph analytics system on NUMA platform □ A new system that exploits both NUMA-aware data layout and memory access strategies □ Three optimizations for global synchronization efficiency, load balance and data structure flexibility □ A detailed evaluation that demonstrates the performance and scalability benefits
Outline Background & Issues Design of Polymer Evaluation
Outline Background & Issues Design of Polymer Evaluation
Example: PageRank 4 5 A centrality analysis algorithm to measure the relative rank 3 1 2 for each element of a linked set Characteristics □ Linked set data dependence □ Rank of who links it predictable accesses □ Convergence iterative computation 4 5 4 5 4 5 3 1 4 3 1 2 3 1 1 4
Graph-analytics The scatter-gather model □ “ scatter ” : propagate the current value of a vertex to its neighbors along edges □ “ gather ” : accumulate values from neighbors to compute the next value of a vertex vertex TOPO 1 2 3 4 5 6 edge 2 3 3 5 2 5 6 1 3 5 1 2 3 6 2 In-memory data structure □ Graph Topology curr D 1 D 2 D 3 D 4 D 5 D 6 DATA □ Application-specific Data next □ Runtime State curr 1 0 1 1 1 0 STAT next
Vertex-centric (e.g. Ligra) 1 6 2 5 3 4 STAT/curr TOPO/vertex TOPO/out-edge DATA/curr DATA/next STAT/next
Vertex-centric (e.g. Ligra) 1 6 2 5 3 4 STAT/curr TOPO/vertex TOPO/out-edge DATA/curr SEQ R RND W DATA/next STAT/next
Edge-centric (e.g. X-Stream) 1 6 2 5 3 partition 4 TOPO/edge STAT/curr DATA/curr DATA/Uout shuffle phase DATA/Uin DATA/next STAT/next
Edge-centric (e.g. X-Stream) 1 6 2 5 3 partition 4 TOPO/edge STAT/curr DATA/curr RND R DATA/Uout shuffle SEQ W R phase SEQ W R DATA/Uin RND W DATA/next STAT/next
NUMA Characteristics A commodity NUMA machine □ Multiple processor nodes (i.e., socket) □ Processor = multiple cores + a local DRAM □ A globally shared memory abstract (cache-coherence) □ Hallmark: Non-uniform memory access Latency (Cycle) Bandwidth (MB/s) Inst. 0-hop 1-hop 2-hop Access 0-hop 1-hop 2-hop IL 80-core Intel Xeon machine 80-core Intel Xeon machine Load 117 271 372 SEQ 3207 2455 2101 2333 Store 108 304 409 RAND 720 348 307 344
NUMA Characteristics A commodity NUMA machine □ Multiple processor nodes (i.e., socket) Sequential remote access is faster than □ Processor = multiple cores + a local DRAM random local access □ A globally shared memory abstract (cache-coherence) & □ Hallmark: Non-uniform memory access Random remote access is awesome! Latency (Cycle) Bandwidth (MB/s) Inst. 0-hop 1-hop 2-hop Access 0-hop 1-hop 2-hop IL 80-core Intel Xeon machine 80-core Intel Xeon machine Load 117 271 372 SEQ 3207 2455 2101 2333 Store 108 304 409 RAND 720 348 307 344
NUMA Characteristics The world we lived in: “ first-touch ” policy “binding virtual pages to physical frames locating on a memory node where a thread first touches the pages” Centralized Interleaved Associated CPU MEM The world we want to lived in
NUMA Characteristics The world we lived in: “ first-touch ” policy “binding virtual pages to physical frames locating on a memory node where a thread first touches the pages” Both centralized and interleaved data layout will hamper locality and parallelism & Centralized Interleaved Associated Associated layout is the ideal one. CPU MEM The world we want to lived in
NUMA Characteristics Lack of locality (access neighboring vertices) □ It is inevitable to access remote memory 1 1 6 2 6 2 . . . How to mix ?? 5 3 5 3 4 4 SEQ RND □ Random access is always there Local Global 1 1 2 3 4 5 6 6 2 update 5 3 1 2 3 4 5 6 4
Access Strategy on NUMA SEQ RND Vertex-centric Model L □ Completely overlooked (e.g. Ligra) G 1 6 2 5 3 N0 N1 4 SEQ R L RND W G
Access Strategy on NUMA SEQ RND Edge-centric Model L □ Inefficient way (e.g. X-Stream) G 1 6 2 N0 N1 5 3 4 RND R L SEQ W L shuffle SEQ R L phase SEQ W G SEQ R L RND W L
Scalability & Performance on NUMA Scalability: #Cores vs. #sockets Intel 80-cores (8Sx10C) 10 10 X-Stream Ligra Ligra Normalized Normalized 8 8 8C: 6.92X Speedup X-Stream Speedup X-Stream 6 6 Galois Galois 8S: 4.58X 4 4 Galois 2 2 10C: 6.19X 0 0 8S: 2.90X 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 #cores #sockets Performance (sec) Scalability: #sockets 160 8 Runtime (sec) Ligra Ligra X-Stream worse ! Normalized 120 X-Stream 6 Speedup X-Stream 1S: 132s Galois Galois 8S: 29s 80 4 8 Socket LG: 2.9X 2 40 Galois XS: 1.4X 1S: 33s 0 0 8S: 12s 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 #sockets #sockets Intel 80-cores (8Sx10C) AMD 64-core (8Sx8C)
Scalability & Performance on NUMA Scalability: #Cores vs. #sockets Intel 80-cores (8Sx10C) 10 10 X-Stream Ligra Ligra Normalized Normalized 8 8 8C: 6.92X Speedup X-Stream Speedup X-Stream 6 6 Galois Galois 8S: 4.58X 4 4 Galois 2 2 10C: 6.19X Minimize remote & random accesses 0 0 8S: 2.90X 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 #cores #sockets + Performance (sec) Scalability: #sockets Eliminate the combination of them 160 8 Runtime (sec) Ligra Ligra X-Stream worse Normalized 120 X-Stream 6 Speedup X-Stream 1S: 132s ! Galois Galois 8S: 29s 80 4 8 Socket LG: 2.9X 2 40 Galois XS: 1.4X 1S: 33s 0 0 8S: 12s 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 #sockets #sockets Intel 80-cores (8Sx10C) AMD 64-core (8Sx8C)
Outline Background & Issues Design of Polymer Evaluation
Goal#1: Reduce remote accesses Co-locating data and computation within the same NUMA node
1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node 1 2 3 4 5 6 TOPO 1. Graph-aware 2 3 3 5 2 5 6 1 3 5 1 2 3 6 2 partitioning DATA D 1 D 2 D 3 D 4 D 5 D 6 STAT 1 0 1 1 1 0
1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node N0 N1 1 2 3 4 5 6 1. Graph-aware partitioning 2 3 3 5 2 5 6 1 3 5 1 2 3 6 2 D 1 D 2 D 3 D 4 D 5 D 6 1 0 1 1 1 0 Intuitive
1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node N0 N1 1 2 3 4 5 6 1. Graph-aware partitioning 2 3 3 2 1 3 1 2 3 2 5 5 6 5 6 D 1 D 2 D 3 D 4 D 5 D 6 1 0 1 1 1 0 sophisticated
1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node agent N0 N1 1 2 3 4 5 6 1 2 3 4 5 6 1. Graph-aware partitioning 2 3 3 2 1 3 1 2 3 2 5 5 6 5 6 D 1 D 2 D 3 D 4 D 5 D 6 1 0 1 1 1 0 sophisticated
1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node agent N0 N1 1 2 3 4 5 6 1 2 3 4 5 6 1. Graph-aware partitioning 2 3 3 2 1 3 1 2 3 2 5 5 6 5 6 D 1 D 2 D 3 D 4 D 5 D 6 1 0 1 1 1 0 2. NUMA-aware 1.seq 1.seq/rnd 1.seq/rnd allocation 2.local 2.global 2.global TOPO DATA STAT 3.long 3.long 3.short Virt-Memory Phys-Memory
Goal#2: Eliminate “random + remote” Random remote access → access neighboring vertices on other nodes distribute the computations on a singe vertex over multiple nodes
Goal#2: Eliminate “random + remote” Random remote access → access neighboring vertices on other nodes distribute the computations on a singe vertex over multiple nodes Each node handles 1 RND L 2 5 6 3 all edges of partial vertices 3 4 5 SEQ G 3 4 1 6 2 Each node handles RND G 5 3 partial edges of all vertices SEQ L 5 3 4
1 NUMA-aware Access Strategy 6 2 5 3 distribute the computations on 4 singe vertex over multiple NUMA-nodes N0 N1 STAT/curr TOPO/vertex TOPO/out-edge DATA/curr DATA/next STAT/next
1 NUMA-aware Access Strategy 6 2 5 3 distribute the computations on 4 singe vertex over multiple NUMA-nodes N0 N1 SEQ R L DATA/curr DATA/next RND W G DATA/curr SEQ R G RND W L DATA/next
Optimizations 1. Rolling update 2. Hierarchical and efficient barrier 3. Adaptive data structure
Outline Background & Issues Design of Polymer Evaluation
Recommend
More recommend