An Empirical Performance Study of Chapel Programming Language Nan Dun ✝ and Kenjiro Taura The University of Tokyo ✝ dun@logos.ic.i.u-tokyo.ac.jp Monday, May 21, 12
Background Modern parallel machines Massive parallelism: 100K~ cores Heterogenous architecture: CPUs + GPGPUs Modern parallel programming languages Programmability, portability, robustness, performance Chapel, X10, and Fortress, etc. 2 Monday, May 21, 12
Motivation Programmability has been well illustrated My First FMM Program in Chapel Abstract of parallelism 30 Relative Elapsed Time Performance is yet unknown 20 Performance implications 10 Performance tuning Language improvements 0 Chapel C The performance should not surprise newbies... 3 Monday, May 21, 12
Agenda Short overview of Chapel Approach Evaluation Microbenchmark results Suggestions for writing efficient Chapel programs N-body FMM results Conclusions 4 Monday, May 21, 12
The Chapel Language Developed by Cray Inc, initiated by HPCS in 2003 Designed to improve programmability Global view model vs. fragmented model Abstract of parallelism (task, data parallelism, etc.) Object-oriented, generic programming For more details: http://chapel.cray.com 5 Monday, May 21, 12
Evaluation Approach Chapel benchmarks: data structures, language features, etc. Equivalent C Intermediate C code Comparisons Implementation Assembly code Assembly code Comparisons Executable Executable Performance Results 6 Monday, May 21, 12
Environment Xeon 2.33GHz 8 core CPU, 32GB MEM Linux 2.6.26, GCC 4.6.2, Chapel 1.4.0 Compile options $ chpl -o prog --fast prog.chpl // Chapel $ gcc -o prog -O3 -lm prog.c // C Use “ --savec ” to keep intermediate C code “ $CHPL_COMM=none ” for single locale, malloc series used Synthesized benchmarks from N-Body simulations 7 Monday, May 21, 12
Primitive Types (1/3) while (...) { var res: int(32); T1 = ((_real32)(i); for i in 1..N do res = res + i; T2 = (resReal32 + T1); resReal32 = T2; i = ...; int(32) vs. C int32 int(64) vs. C int64 } real(32) vs. C float real(64) vs. C double 1 Relative Performance (vs. Cref) .L1046: cvtsi2ss %eax, %xmm0 0.8 addl $1, %eax 0.6 cmpl %eax, %r12d addss %xmm2, %xmm0 0.4 movaps %xmm0, %xmm2 jge .L1046 0.2 The redundant instruction 0 can be removed by add sub mul div combining T2 assignments 8 Monday, May 21, 12
Primitive Types (2/3) while (T80) { _ret42 = arrInt; var arr: [1..N] int; // int and real _ret43 = (_ret42->origin); for d in arr.domain do _ret_10 = (&(_ret42->blk)); res = res + arr(d); // read only _ret_x110 = (*_ret_10)[0]; T82 = (i5 * _ret_x110); T83 = (_ret43 + T82); int vs. C int real vs. C double _ret44 = (_ret42->factoredOffs); T84 = (T83 - _ret44); 1 Relative Performance (vs. Cref) T85 = (_ret42->data); T86 = (&((T85)->_data[T84])); 0.8 _ret45 = *(T86); T87 = (resInt / _ret45); 0.6 resInt = T87; T88 = (i5 + 1); 0.4 i5 = T88; T89 = (T88 != end5); T80 = T89; 0.2 } 0 $ gcc ... -ftree-vectorize -ftree- add sub mul div vectorizer-verbose=5 9 Monday, May 21, 12
Primitive Types (3/3) # Assembly of Chapel C mappings .L1046: var arr: [1..N] int; // int and real cvtsi2sd %edx, %xmm1 for d in arr.domain do addl $1, %edx arr(d) = arr(d) + d; // read + write movsd (%rax), %xmm0 divsd %xmm1, %xmm0 movsd %xmm0, (%rax) int vs. C int real vs. C double addq %rcx, %rax cmpl %edx, %r12d 1 Relative Performance (vs. Cref) jne .L1046 0.8 # Assembly of hand-written C .L32: 0.6 leal (%rsi,%rax), %ecx movsd (%rdx,%rax,8), %xmm0 cvtsi2sd %ecx, %xmm1 0.4 divsd %xmm1, %xmm0 movsd %xmm0, (%rdx,%rax,8) 0.2 addq $1, %rax cmpq %rdi, %rax 0 jne .L32 asg add sub mul div LEA instruction is executed by a separate addressing unit 10 Monday, May 21, 12
Structured Types (1/3) Tuple C Mapping of Tuple var Tuple: double Tuple[3]; (real, real, real); var 2D_Tuple: double Tuple[3][3]; (Tuple, Tuple, Tuple); Record C Mapping of Record record Record { struct Record { var x, y, z: real double x, y, z; } } record 2D_Record { struct 2D_Record { var x, y, z: Record; struct Record x, y, z; } } 11 Monday, May 21, 12
Structured Types (2/2) tuple vs. C array tuple+ vs. C array Walk through the array and record vs. C struct record+ vs. C struct manipulate each element 2D-tuple vs. C 2D-array 2D-tuple+ vs. C 2D-array 2D-record vs. C 2D-struct 2D-record+ vs. C 2D-struct 1 Relative Performance (vs. Cref) 0.8 0.6 0.4 0.2 0 asg add sub mul div 12 Monday, May 21, 12
Structured Types (3/3) Redundant address substitution in 2D-Tuple while (...) { Asm: 197 vs. 33 of C ref _tmp_37 = (&(_ret57[0])); _tmp_x139 = (*_tmp_37)[0]; Complex for GCC to optimize _tmp_x239 = (*_tmp_37)[1]; _tmp_x339 = (*_tmp_37)[2]; ... Data references chpl__tupleRestHelper(...) ... Redundant operations T297[0] = _tmp_x139; T297[1] = _tmp_x239; T297[0] = _tmp_x339; May be related to construction ... of heterogenous tuple } 13 Monday, May 21, 12
Iterators for Loops (1/2) iter myIter(min: int, max: int, step: int = 1) { while min <= max { yield min; min += step; } } // Nested loops var dom = [1..N]; // or 1..N for i in 1..M do for j in [1..N] do ...; // domain for j in 1..N do ...; // range for j in dom do ...; // pre-defined domain for j in myIter(1, N) do ...; // iterator 14 Monday, May 21, 12
Iterators for Loops (2/2) 6x 890x // Domain 1E+06 chpl__buildDomainExpr(...); [1..N] while (loop_variable) { ... } 1.2x Elapsed Time (usec, log-scale) chpl__autoDestroy(...); 1E+05 // Range 42x _build_range(...); (1..N) while (loop_variable) { ... } 1E+04 1..N // Pre-defined domain 3.x _ret10 = dom; ... 1E+03 _ret12 = (T45._low); _ret13 = (T45._high); ... while (loop_variable) { ... } 1E+02 Inner Loop=1 Inner loop=100 // User defined iterator Domain Range while (loop_variable) { ... } Pre-defined domain Iterator 15 Monday, May 21, 12
Domain and Array var rctDom3D: domain(3) = [1..N, 1..N, 1..N]; // rectangular domain var rctArr3D: [rctDom3D] real; var irrDom3D: domain(3*int); // irregular domain var irrArr3D: [irrDom3D] real; 1D-Rect 1D-Associate 2D-Rect 2D-Associate 3D-Rect 3D-Associate 1E+07 1E+07 Relative Performance (vs. Cref) 1E+06 1E+06 1E+05 1E+05 1E+04 Array i.e. space allocation 1E+03 1E+04 1E+02 Domain i.e. index set 1E+03 1E+01 1E+00 1E+02 alloc add sub mul div alloc add sub mul div 16 Monday, May 21, 12
Domain Maps (1/2) var space = [1..N, 1..N]; var blockSpace = space dmapped Block(space); L0 L1 L2 L3 var arrBlock: [blockSpace] real; var cyclicSpace = space dmapped Cyclic(space); L4 L5 L6 L7 var arrCyclic: [cyclicSpace] real; var blkCycSpace = space dmapped BlockCyclic(space); var arrBlkCyc: [blkCycSpace] real; var replicatedSpace = space dmapped ReplicatedDist(); var arrRep: [replicatedSpace] real; for d in arr.domain do on Locales(here.id) do /* arithmetic on arr(d) */ 1 8 1 8 1 1 1 di 4 4 Block Distribution Cyclic Distribution 17 Monday, May 21, 12
Domain Maps (2/2) Block ro Block rw Cyclic ro Cyclic rw Elapsed Time (usec, log-scale) BlockCyclic ro BlockCyclic rw Replicated ro Replicated rw 1E+06 Single Locale 1E+05 1E+04 1E+03 asg (wo) add sub mul div Elapsed Time (usec, log-scale) 1E+09 300Kbps achieved << 434Mbps measured by Iperf Two Locales 1E+08 1E+07 1E+06 1E+05 1E+04 asg (wo) add sub mul div 18 Monday, May 21, 12
Speedup FMM Application Manipulate a large array of structured elements Use record instead of tuple Optimize small inner loop Auxiliary data structure Use rectangular domain instead of associative domain Reduce locks to improve scalability (increase computation in some cases) 19 Monday, May 21, 12
Molecular Dynamics (1/2) Fast Multipole Method Calculate the N -body interactions in O( N ) time Relative Performance (vs. Serial Cref) Parallel Version Serial Version 1 0.8 0.6 0.4 0.2 0 l t ) t s c s o ) a 1 2 i s n e l e m t ( i a ( o I g L c c g C r T o r r r o e o o b e r r h F F F e l F o T p N l l p a l p y a a a d v i l W e t p e l E l i L u p L u M A B 20 Monday, May 21, 12
Molecular Dynamics (2/2) 8 N=8^3 7 N=16^3 N=32^3 6 N=64^3 Speedup 5 4 3 2 1 1 2 3 4 5 6 7 8 # of Threads 21 Monday, May 21, 12
Related Work Evaluations of the Chapel language Programmability [Chamberlain et al. ’06,’07,’08,’11] Performance potential [Barrett et al. ’08] HPCC benchmark [Chamberlain et al. ’11] 95% for EP STREAM & 50% for Random Access Task parallel feature [Weiland et al. ’09] On GPGPU [Ren et al. ’11] 22 Monday, May 21, 12
Conclusions Chapel can achieve comparable performance to C 70%~ on single locale (w/ current v1.4.0) User should be aware of performance implications Choose proper data structure Write program in proper structure Current performance penalties are FIXABLE By improving the Chapel compiler 23 Monday, May 21, 12
Questions? Monday, May 21, 12
Recommend
More recommend