hardware software vectorization for closeness centrality
play

Hardware/Software Vectorization for Closeness Centrality on - PowerPoint PPT Presentation

Hardware/Software Vectorization for Closeness Centrality on Multi-/Many-Core Architectures uce, Erik Saule , Kamer Kaya, Ahmet Erdem Sary Umit V. C ataly urek The Ohio State University (BMI, CS, ECE) University of North Carolina at


  1. Hardware/Software Vectorization for Closeness Centrality on Multi-/Many-Core Architectures uce, Erik Saule , Kamer Kaya, ¨ Ahmet Erdem Sarıy¨ Umit V. C ¸ataly¨ urek The Ohio State University (BMI, CS, ECE) University of North Carolina at Charlotte (CS) MTAAP 2014 Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 1 / 21

  2. Outline Introduction 1 An SpMM-based approach 2 Experiments 3 Conclusion 4 Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 2 / 21

  3. Centralities - Concept Answer questions such as Who controls the flow in a Applications network? Covert network (e.g., terrorist Who is more important? identification) Who has more influence? Contingency analysis (e.g., Whose contribution is weakness/robustness of significant for connections? networks) Viral marketing (e.g., who will Different kinds of graph spread the word best) road networks Traffic analysis social networks Store locations power grids mechanical mesh Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 3 / 21

  4. Closeness Centrality Definition Let G = ( V , E ) be an unweighted graph with the vertex set V and edge set E . 1 cc [ v ] = � d ( v , u ) where u ∈ V d ( u , v ) is the shortest path length between u and v . The best known algorithm computes the shortest path graph rooted in each vertex of the graph. The complexity is O ( E ) per source, O ( VE ) in total, which makes its computationally expensive. Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 4 / 21

  5. Closeness Centrality Typical Algorithms (one BFS per Definition source) Let G = ( V , E ) be an Top-down or Bottom-Up. unweighted graph with the vertex T o set V and edge set E . x x x 1 cc [ v ] = � d ( v , u ) where u ∈ V x x d ( u , v ) is the shortest path From x x length between u and v . x x x The best known algorithm x computes the shortest path graph rooted in each vertex of the graph. The complexity is O ( E ) Direction Optimizing. per source, O ( VE ) in total, Level synchronous bfs. which makes its computationally No regularity in the computation: expensive. no use of vector processing units. Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 4 / 21

  6. Vector processing units SIMD: a key source of performance MMX (1996): 64 bit registers (x86) SSE (1999): 128 bit registers (x86) AVX (2008): 256 bit registers (x86) IMIC (2012): 512-bit registers (Xeon Phi) Operations 512-bits register to come on x86 add mul Ignoring vectorization is wasting 75% (SSE), 87% (AVX), 93% (MIC) of available or performance in single precision. and ... Also it is often necessary to saturate memory bandwidth. Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 5 / 21

  7. Outline Introduction 1 An SpMM-based approach 2 Experiments 3 Conclusion 4 Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 6 / 21

  8. An SpMV-based approach A simpler definition of level synchronous BFS Vertex v is at level ℓ if and only if one of the neighbors of v is at level ℓ − 1 and v is not at any level ℓ ′ < ℓ . Let x ℓ i = true if vertex i is a part of the frontier at level ℓ . y ℓ +1 is the neighbors of level ℓ . y ℓ +1 = OR j ∈ Γ( k ) x ℓ j . ( (OR, AND)-SpMV ) k Compute the next level frontier x ℓ +1 = y ℓ +1 & ¬ ( OR ℓ ′ ≤ ℓ x ℓ ′ i ) . i i Contribution of the source to cc [ i ] is x ℓ ℓ . i bottom-up (gather reads) top-down (scatter writes) For each vertex, are the neighbors in For each element of the frontier, the frontier? touch the neighbors. Complexity O ( ED ), where D is the Complexity: O ( E ) diameter of the graph. Writes are scattered in memory Writes are performed once linearly. Read are linear Reads are (hopefully) close-by. Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 7 / 21

  9. From SpMV to SpMM Data : G = ( V , E ), b Output : cc [ . ] ⊲ Init cc [ v ] ← 0 , ∀ v ∈ V From ℓ ← 0 partition V into k batches Π = { V 1 , V 2 , . . . , V k } o T of size b x x x for each batch of vertices V p ∈ Π do x x 0 s , s ← 1 if s ∈ V p , 0 otherwise x s x ℓ while � � i , s > 0 do i = x x ⊲ SpMM x y ℓ +1 = OR j ∈ Γ( i ) x ℓ j , s , ∀ s , ∀ i x x x i , s ⊲ Update x x x ℓ +1 = y ℓ +1 i , s & ¬ ( OR ℓ ′ ≤ ℓ x ℓ ′ i , s ) , ∀ s , ∀ i x x i , s ℓ ← ℓ + 1 for all v ∈ V do s x ℓ � cc [ v ] ← cc [ v ] + v , s ℓ return cc [ . ] Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 8 / 21

  10. Some simple analysis Complexity of O ( VED ) Instead of O ( VE ) But D is typically small Vectorizable The matrix is transferred VD times b Instead of V D is small and b can be big (512-bit registers on MIC) Increasing b increases the size of the right hand side Potentially trash the cache Regularize the memory access patterns Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 9 / 21

  11. Vectorization void cc_cpu_256_spmm (int* xadj, int* adj, int n, float* cc) { int b = 256; vis = _mm256_or_ps (nei, vis); size_t size_alloc = n * b / 8; int bcnt = bitCount_256(cu); char* neighbor = (char*)_mm_malloc(size_alloc, 32); if (bcnt > 0) { char* current = (char*)_mm_malloc(size_alloc, 32); cc[i] += bcnt * flevel; char* visited = (char*)_mm_malloc(size_alloc, 32); cont = 1; for (int s = 0; s < n; s += b) { } //Init _mm256_store_ps ((float *)(visited + 32 * i), vis); #pragma omp parallel for schedule (dynamic, CC_CHUNK) _mm256_store_ps ((float *)(current + 32 * i), cu); for (int i = 0; i < n; ++i) { } __m256i neigh = _mm256_setzero_si256(); int il[8] = {0, 0, 0, 0, 0, 0, 0, 0}; } } if (i >= s && i < s + b) il[(i-s)>>5] = 1 << ((i-s) & Ox1F); _mm_free(neighbor); __m256i cu = _mm256_set_epi32(il[7], il[6], il[5], il[4], _mm_free(current); il[3], il[2], il[1], il[0]); _mm_free(visited); _mm256_store_si256 ((__m256i *)(neighbor + 32 * i), neigh); } _mm256_store_si256 ((__m256i *)(current + 32 * i), cu); _mm256_store_si256 ((__m256i *)(visited + 32 * i), cu); } int cont = 1; int level = 0; while (cont != 0) { cont = 0; level++; //SpMM #pragma omp parallel for schedule (dynamic, CC_CHUNK) for (int i = 0; i < n; ++i) { __m256 vali = _mm256_setzero_ps(); for (int j = xadj[i]; j<xadj[i+1]; ++j) { int v = adj[j]; __m256 state_v = _mm256_load_ps((float*)(current + 32 * v)); vali = _mm256_or_ps (vali, state_v); } _mm256_store_ps ((float*)(neighbor + 32 * i), vali); } //Update float flevel = 1.0f / (float) level; #pragma omp parallel for schedule (dynamic, CC_CHUNK) for (int i = 0; i < n; ++i) { __m256 nei = _mm256_load_ps ((float *)(neighbor + 32 * i)); __m256 vis = _mm256_load_ps ((float *)(visited + 32 * i)); __m256 cu = _mm256_andnot_ps (vis, nei); Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 10 / 21

  12. Vectorization void cc_cpu_256_spmm (int* xadj, int* adj, int n, float* cc) { int b = 256; vis = _mm256_or_ps (nei, vis); size_t size_alloc = n * b / 8; int bcnt = bitCount_256(cu); char* neighbor = (char*)_mm_malloc(size_alloc, 32); if (bcnt > 0) { char* current = (char*)_mm_malloc(size_alloc, 32); cc[i] += bcnt * flevel; char* visited = (char*)_mm_malloc(size_alloc, 32); cont = 1; for (int s = 0; s < n; s += b) { } //Init _mm256_store_ps ((float *)(visited + 32 * i), vis); #pragma omp parallel for schedule (dynamic, CC_CHUNK) _mm256_store_ps ((float *)(current + 32 * i), cu); for (int i = 0; i < n; ++i) { } __m256i neigh = _mm256_setzero_si256(); int il[8] = {0, 0, 0, 0, 0, 0, 0, 0}; } } if (i >= s && i < s + b) il[(i-s)>>5] = 1 << ((i-s) & Ox1F); _mm_free(neighbor); __m256i cu = _mm256_set_epi32(il[7], il[6], il[5], il[4], _mm_free(current); il[3], il[2], il[1], il[0]); _mm_free(visited); _mm256_store_si256 ((__m256i *)(neighbor + 32 * i), neigh); } _mm256_store_si256 ((__m256i *)(current + 32 * i), cu); _mm256_store_si256 ((__m256i *)(visited + 32 * i), cu); } Variants int cont = 1; int level = 0; while (cont != 0) { Similar SSE, MIC implementations. cont = 0; level++; //SpMM Also implemented in a generic way in #pragma omp parallel for schedule (dynamic, CC_CHUNK) for (int i = 0; i < n; ++i) { C++ using various tags to inform the __m256 vali = _mm256_setzero_ps(); for (int j = xadj[i]; j<xadj[i+1]; ++j) { int v = adj[j]; compiler of what it can do (restrict, __m256 state_v = _mm256_load_ps((float*)(current + 32 * v)); vali = _mm256_or_ps (vali, state_v); unroll) and using templates to fix the } _mm256_store_ps ((float*)(neighbor + 32 * i), vali); } number of BFS to generate dedicated //Update float flevel = 1.0f / (float) level; #pragma omp parallel for schedule (dynamic, CC_CHUNK) assembly code for each variant. for (int i = 0; i < n; ++i) { __m256 nei = _mm256_load_ps ((float *)(neighbor + 32 * i)); __m256 vis = _mm256_load_ps ((float *)(visited + 32 * i)); __m256 cu = _mm256_andnot_ps (vis, nei); Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 10 / 21

  13. Software vectorization Observation Performing multiple BFS at once does not only allow to utilize vector registers. It also reduces the number of times the graph is traversed Idea Why limit the number of concurrent sources to the size of the vector register? We use the compiler vectorized code to generate kernels for different number of concurrent BFS. We call this technique software vectorization. Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 11 / 21

Recommend


More recommend