An Extension of Charm++ to Optimize Fine-Grained Applications - PowerPoint PPT Presentation

An Extension of Charm++ to Optimize Fine-Grained Applications Alexander Frolov frolov@dislab.org Data-Intensive Systems Laboratory (DISLab), NICEVT 14th Annual Workshop on Charm++ and its Applications, Urbana-Champaign, Illinois, April 19-20, 2016. 1 / 31

Talk Outline • Introduction Fine-grained vs. Coarse-grained Parallelism Approaches to Large-scale Graph Processing in Charm++ Problems of Expressing Vertex-centric Model in Charm++ • uChareLib Programming Model uChareLib Programming Model & Library Design Comparing uChareLib & Charm++ (on Alltoall) • Performance Evaluation HPCC RandomAccess Graphs: Asynchronous Breadth-first Search Graphs: PageRank Graphs: Single Source Shortest Paths Graphs: Connected Components • Conclusion & Future plans 2 / 31

Fine-grained vs. Coarse-grained Parallelism Fine-grained: Coarse-grained: • large number of processes/threads ( ≫ • number of processes/threads equals #CPUs), can be dynamically changed #CPUs • small messages (payload up to ∼ 1Kb) • large messages (payload from 1Kb) • dynamic partitioning of problem • static workload assignment • load balancing • load balancing is a rare case Applications where fine-grained parallelism Applications where coarse-grained parallelism can be naturally obtained: can be naturally obtained: • PDE solvers (unstructured, adaptive • PDE solvers (fixed structured meshes) meshes) • Rendering • Graph applications • etc. • Molecular dynamics • Discrete simulation • etc. Common HPC practice: due to performance reasons to coarsen granular- ity by aggregating objects/messages and increasing utilization of system resources 3 / 31

Approaches to Large-scale Graph Processing on Charm++ Vertex-centric [= Fine-grained] vs Subgraph-centric [= Coarse/Medium-grained] • Vertex-centric • Subgraph-centric Graph (G) – array of chares distributed Graph (G) – array of chares distributed across parallel processes (PE) between parallel processes (PE). Vertex – chare (1:1) Vertex – chare (n:1), any local Vertices communicate via representation possible asynchronous active messages (entry Algorithms consist of local (sequential) method calls) and global parts (parallel, Charm++). Application level optimizations Program completion detected by (aggregation, local reductions, etc.) CkStartQD Program completion detected by CkStartQD or manually chare[1] chare[2] 1 2 chare[1] 1 0 3 3 chare[0] 0 2 chare[3] chare[0] 4 / 31

HPCC RandomAccess Table size/PE: 2 20 × 8 bytes, HPC system: [x2 Xeon E5-2630]/IB FDR RandomAccess, np=8, ppn=8 1 Performance, GUP/s 0.1 0.01 charm-randomaccess tram-randomaccess 0.001 1 4 16 64 256 1024 4096 16384 65536 chares 5 / 31

uChareLib Programming Model & Design • uChareLib ( micro -Chare Library) – small extension of Charm++, providing opportunity to mitigate overheads of RTS for fine-granular parallelism: uchare object is introduced to Charm++ model entry method calls are supported for uchares uchare array is provided to define arrays of uchares (same as chare array) uchares are distributed between common chares message aggregation is supported inside uChareLib new entry method type reentrant (only for uchares) • uChareLib can be downloaded from https://github.com/DISLab/xcharm PE[0] PE[1] PE[N-1] uChareSet[0] uChareSet[0] uChareSet[0] Proxy Proxy Proxy ... ... ... A[0] A[k-1] A[k] A[2k-1] A[Nk-k] A[Nk-1] EP Table EP Table EP Table uchares uchares uchares ... Aggregator Aggregator Aggregator Naive TRAM Naive TRAM Naive TRAM 6 / 31

7 / 31

Performance Evaluation HPCC RandomAccess PE0 PE1 chare[0] chare[1] chare[4] chare[5] local table local table local table local table chare[2] chare[3] chare[6] chare[7] local table local table local table local table • Original TRAM implementation is PE2 PE3 used (from Charm++ trunk) chare[8] chare[9] chare[13] chare[14] • Charm++ & uChareLib local table local table local table local table implementations are simple chare[11] chare[12] chare[15] chare[16] conversions from TRAM based RandomAccess code local table local table local table local table NB: update function does not contain calls to other chares => no nested calls (insertData/entry method) for TRAM and uChareLib 8 / 31

Performance Evaluation HPCC RandomAccess Charm++ & uChareLib (randomAccess.C): 1 void Updater::generateUpdates() { 2 int arrayN = N - ( int ) log2(( double ) numElementsPerPe); 3 int numElements = CkNumPes() * numElementsPerPe; 4 CmiUInt8 key = HPCC_starts(4 * globalStartmyProc); 5 for (CmiInt8 i = 0; i < 4 * localTableSize; i++) { 6 key = key << 1 ^ ((CmiInt8) key < 0 ? POLY : 0); 7 int destinationIndex = key >> arrayN & numElements - 8 thisProxy[destinationIndex].update(key); • Original TRAM implementation is 9 } 10 } used (from Charm++ trunk) • Charm++ & uChareLib TRAM (randomAccess.C): implementations are simple conversions from TRAM based RandomAccess code 1 void Updater::generateUpdates() { 2 ... 3 ArrayMeshStreamer<dtype, int , Updater, SimpleMeshRouter 4 * localAggregator = aggregator.ckLocalBranch(); 5 for (CmiInt8 i = 0; i < 4 * localTableSize; i++) { 6 key = key << 1 ^ ((CmiInt8) key < 0 ? POLY : 0); 7 int destinationIndex = key >> arrayN & numElements - 8 localAggregator->insertData(key, destinationIndex); 9 } 10 localAggregator->done(); 11 } 12 } 9 / 31

Performance Evaluation HPCC RandomAccess (N= 2 20 /PE), HPC system: [x2 Xeon E5-2630]/IB FDR RandomAccess (n=20, nodes=2) RandomAccess (n=20, nodes=4) Updates/PE per second [GUP/s] 0.100000 Updates/PE per second [GUP/s] 0.100000 0.010000 0.010000 0.001000 0.001000 0.000100 0.000100 charm++, ppn=8 charm++, ppn=8 tram, ppn=8 tram, ppn=8 ucharelib, ppn=8 ucharelib, ppn=8 0.000010 0.000010 1.04858x10 6 1 32 1024 32768 1 32 1024 32768 1.04858x10 chares/PE chares/PE RandomAccess (n=20, nodes=8) RandomAccess (n=20, nodes=16) Updates/PE per second [GUP/s] Updates/PE per second [GUP/s] 1 1 0.1 0.1 0.01 0.01 0.001 0.001 charm++, ppn=8 charm++, ppn=8 tram, ppn=8 tram, ppn=8 ucharelib, ppn=8 ucharelib, ppn=8 0.0001 0.0001 1 32 1024 32768 1.04858x10 6 1 32 1024 32768 1.04858x10 chares/PE chares/PE 10 / 31

Performance Evaluation HPCC RandomAccess (N= 2 22 /PE), HPC system: [x2 Xeon E5-2630]/IB FDR RandomAccess (n=22, nodes=2) RandomAccess (n=22, nodes=4) Updates/PE per second [GUP/s] 0.1 Updates/PE per second [GUP/s] 0.1 0.01 0.01 0.001 0.001 charm++, ppn=8 charm++, ppn=8 tram, ppn=8 tram, ppn=8 ucharelib, ppn=8 ucharelib, ppn=8 0.0001 0.0001 1.04858x10 6 1 32 1024 32768 1 4 16 64 256 1024 4096 16384 65536 262144 1.04858x10 chares/PE chares/PE RandomAccess (n=22, nodes=8) RandomAccess (n=22, nodes=16) Updates/PE per second [GUP/s] Updates/PE per second [GUP/s] 1 1 0.1 0.1 0.01 0.01 0.001 charm++, ppn=8 charm++, ppn=8 tram, ppn=8 tram, ppn=8 ucharelib, ppn=8 ucharelib, ppn=8 0.0001 0.001 1 32 1024 32768 1.04858x10 6 1 32 1024 32768 1.04858x10 chares/PE chares/PE 11 / 31

Performance Evaluation PageRank • Problem description: Iteratively compute ranks for all v ∈ G PR i + 1 = ( 1 − d )⇑ N + d ×∑ u ∈ Adj ( v ) PR i u ⇑ L u v • Implementations: Charm++, naive Charm++, with incoming msg counting TRAM, naive uChareLib, naive a a source: Wikipedia NB: update function does not contain calls to other chares => no nested calls (insertData/entry method) for TRAM and uChareLib 12 / 31

Performance Evaluation PageRank, naive algorithm TestDriver doPageRankStep_init Algorithm Naive PageRank G[1] 1: function PageRankVertex:: doPageRankStep_init 2: PR old ← PR new G[0] G[5] 3: PR new ← ( 1 . 0 − d )⇑ N G[2] 4: end function 5: function PageRankVertex::doPageRankStep_update 6: for u ∈ AdjList do 7: thisProxy ⋃︂ u ⨄︂ . update ( PR old ⇑ L ) G[3] 8: end for G[4] 9: end function 10: function PageRankVertex::update(r) Quiescence Detection 11: PR new ← d × r TestDriver 12: end function doPageRankStep_update 13: function TestDriver::doPageRank 14: for i = 0 ; i < N iters ; i ← i + 1 do G[1] 15: g . doPageRankStep _ init () 16: CkStartQD ( CkCallbackResumeThread ()) G[0] 17: g . doPageRankStep _ update () G[5] 18: CkStartQD ( CkCallbackResumeThread ()) G[2] 19: end for 20: end function G[3] G[4] update Quiescence Detection 13 / 31

Performance Evaluation PageRank, with counting incoming messages Algorithm TestDriver PageRank /w msg counting doPageRankStep 1: function PageRankVertex::doPageRankStep 2: PR old ← � ( n iter % 2 ) ? rank 0 ∶ rank 1 G[1] 3: for u ∈ AdjList do 4: thisProxy ⋃︂ u ⨄︂ . update ( PR old ⇑ L ) 5: end for G[0] G[5] 6: end function 7: function PageRankVertex::update(r) G[2] 8: PR new ← � ( n iter % 2 ) ? rank 1 ∶ rank 0 9: PR new ← d × r 10: n msg ← n msg − 1 11: if n msg = 0 then G[3] 12: n msg ← D in G[4] update 13: n iter ← n iter + 1 14: PR new ← � ( n iter % 2 ) ? rank 1 ∶ rank 0 Quiescence Detection 15: PR new ← ( 1 . 0 − d )⇑ N 16: end if 17: end function 18: function TestDriver::doPageRank 19: for i = 0 ; i < N iters ; i ← i + 1 do 20: g . doPageRankStep () 21: CkStartQD ( CkCallbackResumeThread ()) 22: end for 23: end function 14 / 31

Performance Evaluation PageRank, Kronecker/Graph500, HPC system: [x2 Xeon E5-2630]/IB FDR x6 x36 x16 x3.2 15 / 31

An Extension of Charm++ to Optimize Fine-Grained Applications - PowerPoint PPT Presentation

An Extension of Charm++ to Optimize Fine-Grained Applications Alexander Frolov frolov@dislab.org Data-Intensive Systems Laboratory (DISLab), NICEVT 14th Annual Workshop on Charm++ and its Applications, Urbana-Champaign, Illinois, April 19-20,

Fine Grained Access Control Fine-Grained Access Control Fine Grained Access Control

Fine-Grained Access Control Fine Grained Access Control Fine-grained access control examples:

Fine-Grained Geographic Communication (Geocast) Nexus Workshop Frank Drr 23.07.2003 1

Average-Case Fine-Grained Hardness Marshall Ball Alon Rosen Manuel Sabin Prashant Nalini

Fine-grained Visual Analysis: From Classification to Retrieval Yi-Zhe Song SketchX Lab, CVSSP,

Mechanized Verification of Fine-grained Concurrent Programs Ilya Sergey Aleks Nanevski

Recent Results in Charm Physics Recent Results in Charm Physics Topics Topics Rare Charm

Junfeng Fan ESAT/COSIC ECC implementation methods Multi-core systems Coarse-Grained

Combining Data-Intense and Compute-Intense Methods for Fine-Grained Morphological Analyses Petra

Fine-Grained Power Modeling for Smartphones Using System Call Tracing Based on paper and

Fine-Grained Tracking of Grid Infections Ashish Gehani SRI Basim Baig, Salman Mahmood, Dawood

Addressing Inter-Class Similarity in Fine-Grained Visual Classification Abhimanyu Dubey

Fine-grained Image Recognition Lei Wang VILA group School of Computing and Information

On the Correctness Criteria of Fine-Grained Access Control in Relational Databases Qihua Wang,

Fine Grained Coordinated Parallelism in a Real World Application Mohammad Rezaei, PhD June 2012

Phase Transition in 3SAT Yi Zhou Phase Transition in 3SAT Phase Transition in 3SAT Fine Grained

T h e a n g u l a r p o we r s p e c t r u m a n d e R O S I T A '

A new era in the quest for Dark Matter Gianfranco Bertone GRAPPA center of excellence, U. of

Comparative Performance and Optimization of Chapel in Modern Manycore Architectures* Engin

Introductory talk on Beyond the Standard Models of Particle Physics and Cosmology Steve King

Hydrodynamic simulations and dark matter direct detection Nassim Bozorgnia GRAPPA Institute

Chava: Reverse Engineering and Tracking of Java Applets Jeffrey Korn Yih-Farn Chen Eleftherios

The holographic fluid dual to vacuum Einstein gravity Marika Taylor Institute for Theoretical

Intuitive and machine understandable representation of the bioinformatics domain and of related