an early evaluation of the scalability of graph
play

An Early Evaluation of the Scalability of Graph Algorithms on the - PowerPoint PPT Presentation

An Early Evaluation of the Scalability of Graph Algorithms on the Intel MIC Architecture Erik Saule 1 and urek 1 , 2 Umit V. C ataly { esaule,umit } @bmi.osu.edu 1 Department of Biomedical Informatics 2 Department of Electrical and


  1. An Early Evaluation of the Scalability of Graph Algorithms on the Intel MIC Architecture Erik Saule 1 and ¨ urek 1 , 2 Umit V. C ¸ataly¨ { esaule,umit } @bmi.osu.edu 1 Department of Biomedical Informatics 2 Department of Electrical and Computer Engineering The Ohio State University MTAAP 2012 Ohio State University, Biomedical Informatics Graph Algorithms on Intel MIC ¨ Umit V. C ¸ataly¨ urek HPC Lab http://bmi.osu.edu/hpc :: 1 / 22

  2. The Intel MIC Architecture Features High Performance Computing with generic x86 cores. High core count. Large SIMD. Highly hyper-threaded. The Knight Ferry prototype 32 cores (1 reserved for system purposes in our experiments). The Knight Corner 50+ cores. Ohio State University, Biomedical Informatics Graph Algorithms on Intel MIC ¨ Umit V. C ¸ataly¨ urek HPC Lab http://bmi.osu.edu/hpc Introduction:: 2 / 22

  3. Graph Algorithms and Irregular Kernels Many Applications where GPUs are holding back Basically all applications based on indirection and pointer chasing: Sparse linear algebra (solvers, factorisation), Graph problem (Shortest Path, Travelling Salesman, Network Analysis), Text search (inexact pattern matching, indexing) Graph Coloring Given a graph, assign colors (integers) for each vertex so that two adjacent vertices have different colors. Breadth First Search traversal Given a graph and a particular vertex, build a list of all the vertices from the closest ones to the farthest ones. Ohio State University, Biomedical Informatics Graph Algorithms on Intel MIC ¨ Umit V. C ¸ataly¨ urek HPC Lab http://bmi.osu.edu/hpc Introduction:: 3 / 22

  4. Graph Algorithms and Irregular Kernels Many Applications where GPUs are holding back Basically all applications based on indirection and pointer chasing: Sparse linear algebra (solvers, factorisation), Graph problem (Shortest Path, Travelling Salesman, Network Analysis), Text search (inexact pattern matching, indexing) Graph Coloring Given a graph, assign colors (integers) for each vertex so that two adjacent vertices have different colors. Breadth First Search traversal Given a graph and a particular vertex, build a list of all the vertices from the closest ones to the farthest ones. Ohio State University, Biomedical Informatics Graph Algorithms on Intel MIC ¨ Umit V. C ¸ataly¨ urek HPC Lab http://bmi.osu.edu/hpc Introduction:: 3 / 22

  5. Programming Models OpenMP Pragma directives that allow parallel processing. Support for sections, locks, ... Cilk Plus Asynchronous function call powered by workstealing. Allows nested parallelism. Focus is on programmability by looking like sequential execution. Intel TBB Collection of tools for parallel processing. Object oriented programing paradigm. Versatile programming model supporting recursive decomposition, filter based parallel processing... Ohio State University, Biomedical Informatics Graph Algorithms on Intel MIC ¨ Umit V. C ¸ataly¨ urek HPC Lab http://bmi.osu.edu/hpc Introduction:: 4 / 22

  6. Outline Introduction 1 Coloring 2 Algorithm Experimental Results Loaded Computation 3 Algorithm Experimental Results Breadth First Search 4 Algorithms Experimental Results Conclusions 5 Ohio State University, Biomedical Informatics Graph Algorithms on Intel MIC ¨ Umit V. C ¸ataly¨ urek HPC Lab http://bmi.osu.edu/hpc Introduction:: 5 / 22

  7. Speculative Coloring Each processor independently color some vertices. Conflicts might occur. They are detected in parallel; and some vertices are uncolored . The process repeats itself. Algorithm 1: TentativeColoring Algorithm 2: DetectConflict Data : G = ( V , E ), Visit ⊂ V , color[1 : | V | ] maxcolor ← 1 Data : G = ( V , E ), Visit ⊂ V , color[1 : | V | ] localMC ← 1 Conflict ← ∅ for each v ∈ Visit in parallel do for each v ∈ Visit in parallel do for each w ∈ adj ( v ) do for each w ∈ adj ( v ) do localFC[color[ w ]] ← v if color [ v ] = color [ w ] then color[ v ] ← min { i > 0 : localFC[ i ] � = v } if v < w then if color [ v ] > localMC then atomic Conflict ← Conflict ∪{ v } localMC ← color[ v ] return Conflict maxcolor ← Reduce(max) localMC return maxcolor Ohio State University, Biomedical Informatics Graph Algorithms on Intel MIC ¨ Umit V. C ¸ataly¨ urek HPC Lab http://bmi.osu.edu/hpc Coloring::Algorithm 6 / 22

  8. Variants OpenMP Implementation based on the parallel for construct. Three scheduling policies: static , dynamic , guided . Memory is allocated and indexed by threadIDs. Cilk Plus recursive decomposition of the iterations of the loop. Executed with workstealing. Allocating memory per thread is done by using Holders to allocate memory dynamically. Otherwise hack workerIDs and allocate memory first. Intel TBB tbb::parallel for can use multiple types of partitioner: simple recursively divides the range up to a given size. auto uses workstealing event to decide when to stop. affinity tries to maximize cache reuse based on the index ordering. Ohio State University, Biomedical Informatics Graph Algorithms on Intel MIC ¨ Umit V. C ¸ataly¨ urek HPC Lab http://bmi.osu.edu/hpc Coloring::Algorithm 7 / 22

  9. Experiments 80 80 OpenMP-dynamic CilkPlus OpenMP-static CilkPlus-holder 70 70 OpenMP-guided 60 60 50 50 speedup speedup 40 40 30 30 20 20 10 10 0 0 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 number of threads number of threads (a) OpenMP (b) Cilk Plus 80 160 TBB-simple OpenMP TBB-auto TBB 70 140 TBB-affinity CilkPlus 60 120 50 100 speedup speedup 40 80 30 60 20 40 10 20 0 0 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 number of threads number of threads (c) TBB (d) Randomly Ordered Graph Ohio State University, Biomedical Informatics Graph Algorithms on Intel MIC ¨ Umit V. C ¸ataly¨ urek HPC Lab http://bmi.osu.edu/hpc Coloring::Experimental Results 8 / 22

  10. Outline Introduction 1 Coloring 2 Algorithm Experimental Results Loaded Computation 3 Algorithm Experimental Results Breadth First Search 4 Algorithms Experimental Results Conclusions 5 Ohio State University, Biomedical Informatics Graph Algorithms on Intel MIC ¨ Umit V. C ¸ataly¨ urek HPC Lab http://bmi.osu.edu/hpc Loaded Computation:: 9 / 22

  11. Loaded Computation Algorithm 3: IrregularComputation Data : G = ( V , E ), Visit ⊂ V , state[1 : | V | ] for each v ∈ V in parallel do for i = 0 ; i < iter; i++ do sum ← state[ v ] for each w ∈ adj ( v ) do sum ← sum + state[ w ] sum state[ v ] ← | adj ( v )+1 | Variants are the same that in speculative coloring. Allows to change the computation intensivity by tuning the number of iterations. Ohio State University, Biomedical Informatics Graph Algorithms on Intel MIC ¨ Umit V. C ¸ataly¨ urek HPC Lab http://bmi.osu.edu/hpc Loaded Computation::Algorithm 10 / 22

  12. Experiments 70 70 1 iteration 1 iteration 3 iterations 3 iterations 60 60 5 iterations 5 iterations 10 iterations 10 iterations 50 50 speedup 40 speedup 40 30 30 20 20 10 10 0 0 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 number of threads number of threads (e) Using OpenMP (f) Using Cilk 70 70 1 iteration OpenMP 3 iterations TBB 60 60 5 iterations CilkPlus 10 iterations 50 50 speedup 40 speedup 40 30 30 20 20 10 10 0 0 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 number of threads number of threads (g) Using TBB (h) 10 iterations Ohio State University, Biomedical Informatics Graph Algorithms on Intel MIC ¨ Umit V. C ¸ataly¨ urek HPC Lab http://bmi.osu.edu/hpc Loaded Computation::Experimental Results 11 / 22

  13. Outline Introduction 1 Coloring 2 Algorithm Experimental Results Loaded Computation 3 Algorithm Experimental Results Breadth First Search 4 Algorithms Experimental Results Conclusions 5 Ohio State University, Biomedical Informatics Graph Algorithms on Intel MIC ¨ Umit V. C ¸ataly¨ urek HPC Lab http://bmi.osu.edu/hpc Breadth First Search:: 12 / 22

  14. Parallel Layered Breadth First Search Algorithm 4: ParLayeredBFS Data : G = ( V , E ), source ∈ V for v ∈ V in parallel do Sources of parallelism bfs[ v ] ← − 1 parallel vertex traversal bfs[ source ] ← 0 cur.add( source ) parallel edge traversal level ← 1 while ! cur.empty() do (inefficient) for v ∈ cur in parallel do for each w ∈ adj ( v ) in parallel do if bfs[w] = − 1 then Synchronizations bfs[ w ] ← level At the end of each level uniquely next.add( w ) Management of next SWAP (cur, next) level ← level + 1 return bfs Ohio State University, Biomedical Informatics Graph Algorithms on Intel MIC ¨ Umit V. C ¸ataly¨ urek HPC Lab http://bmi.osu.edu/hpc Breadth First Search::Algorithms 13 / 22

Recommend


More recommend