an extension of charm to optimize fine grained
play

An Extension of Charm++ to Optimize Fine-Grained Applications - PowerPoint PPT Presentation

An Extension of Charm++ to Optimize Fine-Grained Applications Alexander Frolov frolov@dislab.org Data-Intensive Systems Laboratory (DISLab), NICEVT 14th Annual Workshop on Charm++ and its Applications, Urbana-Champaign, Illinois, April 19-20,


  1. An Extension of Charm++ to Optimize Fine-Grained Applications Alexander Frolov frolov@dislab.org Data-Intensive Systems Laboratory (DISLab), NICEVT 14th Annual Workshop on Charm++ and its Applications, Urbana-Champaign, Illinois, April 19-20, 2016. 1 / 31

  2. Talk Outline • Introduction Fine-grained vs. Coarse-grained Parallelism Approaches to Large-scale Graph Processing in Charm++ Problems of Expressing Vertex-centric Model in Charm++ • uChareLib Programming Model uChareLib Programming Model & Library Design Comparing uChareLib & Charm++ (on Alltoall) • Performance Evaluation HPCC RandomAccess Graphs: Asynchronous Breadth-first Search Graphs: PageRank Graphs: Single Source Shortest Paths Graphs: Connected Components • Conclusion & Future plans 2 / 31

  3. Fine-grained vs. Coarse-grained Parallelism Fine-grained: Coarse-grained: • large number of processes/threads ( ≫ • number of processes/threads equals #CPUs), can be dynamically changed #CPUs • small messages (payload up to ∼ 1Kb) • large messages (payload from 1Kb) • dynamic partitioning of problem • static workload assignment • load balancing • load balancing is a rare case Applications where fine-grained parallelism Applications where coarse-grained parallelism can be naturally obtained: can be naturally obtained: • PDE solvers (unstructured, adaptive • PDE solvers (fixed structured meshes) meshes) • Rendering • Graph applications • etc. • Molecular dynamics • Discrete simulation • etc. Common HPC practice: due to performance reasons to coarsen granular- ity by aggregating objects/messages and increasing utilization of system resources 3 / 31

  4. Approaches to Large-scale Graph Processing on Charm++ Vertex-centric [= Fine-grained] vs Subgraph-centric [= Coarse/Medium-grained] • Vertex-centric • Subgraph-centric Graph (G) – array of chares distributed Graph (G) – array of chares distributed across parallel processes (PE) between parallel processes (PE). Vertex – chare (1:1) Vertex – chare (n:1), any local Vertices communicate via representation possible asynchronous active messages (entry Algorithms consist of local (sequential) method calls) and global parts (parallel, Charm++). Application level optimizations Program completion detected by (aggregation, local reductions, etc.) CkStartQD Program completion detected by CkStartQD or manually chare[1] chare[2] 1 2 chare[1] 1 0 3 3 chare[0] 0 2 chare[3] chare[0] 4 / 31

  5. HPCC RandomAccess Table size/PE: 2 20 × 8 bytes, HPC system: [x2 Xeon E5-2630]/IB FDR RandomAccess, np=8, ppn=8 1 Performance, GUP/s 0.1 0.01 charm-randomaccess tram-randomaccess 0.001 1 4 16 64 256 1024 4096 16384 65536 chares 5 / 31

  6. uChareLib Programming Model & Design • uChareLib ( micro -Chare Library) – small extension of Charm++, providing opportunity to mitigate overheads of RTS for fine-granular parallelism: uchare object is introduced to Charm++ model entry method calls are supported for uchares uchare array is provided to define arrays of uchares (same as chare array) uchares are distributed between common chares message aggregation is supported inside uChareLib new entry method type reentrant (only for uchares) • uChareLib can be downloaded from https://github.com/DISLab/xcharm PE[0] PE[1] PE[N-1] uChareSet[0] uChareSet[0] uChareSet[0] Proxy Proxy Proxy ... ... ... A[0] A[k-1] A[k] A[2k-1] A[Nk-k] A[Nk-1] EP Table EP Table EP Table uchares uchares uchares ... Aggregator Aggregator Aggregator Naive TRAM Naive TRAM Naive TRAM 6 / 31

  7. 7 / 31

  8. Performance Evaluation HPCC RandomAccess PE0 PE1 chare[0] chare[1] chare[4] chare[5] local table local table local table local table chare[2] chare[3] chare[6] chare[7] local table local table local table local table • Original TRAM implementation is PE2 PE3 used (from Charm++ trunk) chare[8] chare[9] chare[13] chare[14] • Charm++ & uChareLib local table local table local table local table implementations are simple chare[11] chare[12] chare[15] chare[16] conversions from TRAM based RandomAccess code local table local table local table local table NB: update function does not contain calls to other chares => no nested calls (insertData/entry method) for TRAM and uChareLib 8 / 31

  9. Performance Evaluation HPCC RandomAccess Charm++ & uChareLib (randomAccess.C): 1 void Updater::generateUpdates() { 2 int arrayN = N - ( int ) log2(( double ) numElementsPerPe); 3 int numElements = CkNumPes() * numElementsPerPe; 4 CmiUInt8 key = HPCC_starts(4 * globalStartmyProc); 5 for (CmiInt8 i = 0; i < 4 * localTableSize; i++) { 6 key = key << 1 ^ ((CmiInt8) key < 0 ? POLY : 0); 7 int destinationIndex = key >> arrayN & numElements - 8 thisProxy[destinationIndex].update(key); • Original TRAM implementation is 9 } 10 } used (from Charm++ trunk) • Charm++ & uChareLib TRAM (randomAccess.C): implementations are simple conversions from TRAM based RandomAccess code 1 void Updater::generateUpdates() { 2 ... 3 ArrayMeshStreamer<dtype, int , Updater, SimpleMeshRouter 4 * localAggregator = aggregator.ckLocalBranch(); 5 for (CmiInt8 i = 0; i < 4 * localTableSize; i++) { 6 key = key << 1 ^ ((CmiInt8) key < 0 ? POLY : 0); 7 int destinationIndex = key >> arrayN & numElements - 8 localAggregator->insertData(key, destinationIndex); 9 } 10 localAggregator->done(); 11 } 12 } 9 / 31

  10. Performance Evaluation HPCC RandomAccess (N= 2 20 /PE), HPC system: [x2 Xeon E5-2630]/IB FDR RandomAccess (n=20, nodes=2) RandomAccess (n=20, nodes=4) Updates/PE per second [GUP/s] 0.100000 Updates/PE per second [GUP/s] 0.100000 0.010000 0.010000 0.001000 0.001000 0.000100 0.000100 charm++, ppn=8 charm++, ppn=8 tram, ppn=8 tram, ppn=8 ucharelib, ppn=8 ucharelib, ppn=8 0.000010 0.000010 1.04858x10 6 1 32 1024 32768 1 32 1024 32768 1.04858x10 chares/PE chares/PE RandomAccess (n=20, nodes=8) RandomAccess (n=20, nodes=16) Updates/PE per second [GUP/s] Updates/PE per second [GUP/s] 1 1 0.1 0.1 0.01 0.01 0.001 0.001 charm++, ppn=8 charm++, ppn=8 tram, ppn=8 tram, ppn=8 ucharelib, ppn=8 ucharelib, ppn=8 0.0001 0.0001 1 32 1024 32768 1.04858x10 6 1 32 1024 32768 1.04858x10 chares/PE chares/PE 10 / 31

  11. Performance Evaluation HPCC RandomAccess (N= 2 22 /PE), HPC system: [x2 Xeon E5-2630]/IB FDR RandomAccess (n=22, nodes=2) RandomAccess (n=22, nodes=4) Updates/PE per second [GUP/s] 0.1 Updates/PE per second [GUP/s] 0.1 0.01 0.01 0.001 0.001 charm++, ppn=8 charm++, ppn=8 tram, ppn=8 tram, ppn=8 ucharelib, ppn=8 ucharelib, ppn=8 0.0001 0.0001 1.04858x10 6 1 32 1024 32768 1 4 16 64 256 1024 4096 16384 65536 262144 1.04858x10 chares/PE chares/PE RandomAccess (n=22, nodes=8) RandomAccess (n=22, nodes=16) Updates/PE per second [GUP/s] Updates/PE per second [GUP/s] 1 1 0.1 0.1 0.01 0.01 0.001 charm++, ppn=8 charm++, ppn=8 tram, ppn=8 tram, ppn=8 ucharelib, ppn=8 ucharelib, ppn=8 0.0001 0.001 1 32 1024 32768 1.04858x10 6 1 32 1024 32768 1.04858x10 chares/PE chares/PE 11 / 31

  12. Performance Evaluation PageRank • Problem description: Iteratively compute ranks for all v ∈ G PR i + 1 = ( 1 − d )⇑ N + d ×∑ u ∈ Adj ( v ) PR i u ⇑ L u v • Implementations: Charm++, naive Charm++, with incoming msg counting TRAM, naive uChareLib, naive a a source: Wikipedia NB: update function does not contain calls to other chares => no nested calls (insertData/entry method) for TRAM and uChareLib 12 / 31

  13. Performance Evaluation PageRank, naive algorithm TestDriver doPageRankStep_init Algorithm Naive PageRank G[1] 1: function PageRankVertex:: doPageRankStep_init 2: PR old ← PR new G[0] G[5] 3: PR new ← ( 1 . 0 − d )⇑ N G[2] 4: end function 5: function PageRankVertex::doPageRankStep_update 6: for u ∈ AdjList do 7: thisProxy ⋃︂ u ⨄︂ . update ( PR old ⇑ L ) G[3] 8: end for G[4] 9: end function 10: function PageRankVertex::update(r) Quiescence Detection 11: PR new ← d × r TestDriver 12: end function doPageRankStep_update 13: function TestDriver::doPageRank 14: for i = 0 ; i < N iters ; i ← i + 1 do G[1] 15: g . doPageRankStep _ init () 16: CkStartQD ( CkCallbackResumeThread ()) G[0] 17: g . doPageRankStep _ update () G[5] 18: CkStartQD ( CkCallbackResumeThread ()) G[2] 19: end for 20: end function G[3] G[4] update Quiescence Detection 13 / 31

  14. Performance Evaluation PageRank, with counting incoming messages Algorithm TestDriver PageRank /w msg counting doPageRankStep 1: function PageRankVertex::doPageRankStep 2: PR old ← � ( n iter % 2 ) ? rank 0 ∶ rank 1 G[1] 3: for u ∈ AdjList do 4: thisProxy ⋃︂ u ⨄︂ . update ( PR old ⇑ L ) 5: end for G[0] G[5] 6: end function 7: function PageRankVertex::update(r) G[2] 8: PR new ← � ( n iter % 2 ) ? rank 1 ∶ rank 0 9: PR new ← d × r 10: n msg ← n msg − 1 11: if n msg = 0 then G[3] 12: n msg ← D in G[4] update 13: n iter ← n iter + 1 14: PR new ← � ( n iter % 2 ) ? rank 1 ∶ rank 0 Quiescence Detection 15: PR new ← ( 1 . 0 − d )⇑ N 16: end if 17: end function 18: function TestDriver::doPageRank 19: for i = 0 ; i < N iters ; i ← i + 1 do 20: g . doPageRankStep () 21: CkStartQD ( CkCallbackResumeThread ()) 22: end for 23: end function 14 / 31

  15. Performance Evaluation PageRank, Kronecker/Graph500, HPC system: [x2 Xeon E5-2630]/IB FDR x6 x36 x16 x3.2 15 / 31

Recommend


More recommend