np scidac project jlab site report
play

NP SciDAC Project: JLab Site Report Blint Jo Jefferson Lab, Oct - PowerPoint PPT Presentation

NP SciDAC Project: JLab Site Report Blint Jo Jefferson Lab, Oct 18, 2013 Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013 JLab Year 1 Tasks T1: Extend the Just-In-Time (JIT) based version of QDP++ to use


  1. NP SciDAC Project: JLab Site Report Bálint Joó Jefferson Lab, Oct 18, 2013 Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

  2. JLab Year 1 Tasks • T1: Extend the Just-In-Time (JIT) based version of QDP++ to use multiple GPUs for deployment on large scale GPU based systems • T2: Optimize QDP++ and Chroma for large scale multi GPU resources • T3: Continue Collaboration with Intel corporation for MIC • T4: Develop a generalized contraction code that is suitable for a small number of initial algorithms and final-state particles using a three-dimensional implementation over QDP++ • T5: Implement “distillation” for the study of hadronic structure and matrix elements • T6: Optimize QDP++ and Chroma to efficiently use the new floating-point features of the BG/Q Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

  3. Tasks 1 & 2: QDP++ & Chroma on GPU • Status: Done. - QDP-JIT/PTX is in production on Titan • Using Chroma + QUDA solvers & Device interface - Paper in preparation for IPDPS • Staffing: - Frank Winter and Balint Joo, and NVIDIA Colleague (not funded by us): Mike Clark Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

  4. A Sampling of Results shoulder Good Mem B/W region • NVIDIA K20m GPUs have max Mem Bandwidth of ~180 (208) GB/sec ECC on (off) • QDP-JIT/PTX achieves 150 (162) GB/sec, ~83% (~78%) of peak with ECC on (off). • Max perf reached around 12 4 -14 4 local lattice on single node. Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

  5. Large Scale Running V=40x40x40x256, m π ~ 230 MeV, Anisotropic clover 10000 0.5x execution time CPU (all MPI) 9000 =2x speedup GPU (JITPTX+QUDA) • Runs from Titan in the Summer. 8000 • In terms of local volume: we hit the Trajectory time in seconds 7000 “shoulder region” at 400 GPUs 4 sites/node) 6000 0.65x execution time - Local vol: 40x8x8x16=40960 sites =1.53x speedup Start of shoulder region (~14 5000 • Cannot strong scale this volume beyond 4000 800 nodes 3000 2000 400 600 800 1000 1200 1400 1600 Nodes Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

  6. Collaboration with SUPER • New Student on the Project: Diptorup Deb • Identifying potential inefficiencies in QDP-JIT/PTX/LLVM - e.g. C++ Literal Constants used in operations impact performance • up to 20% overhead in certain operations • Future work on QDP-JIT - code refactoring (improve performance of huge kernels, requires LLVM work) - Kernel fusion (requires higher level work) • Staffing: Frank Winter (JLab), Rob Fowler, Allan Porterfield, Diptorup Deb (RENCI) Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

  7. Task 3: Xeon Phi & work with Intel • Status - Paper at ISC ʼ 13, Leipzig, June 16-20, 2013 - Running Code on Stampede @ TACC for testing purposes - Current focus: Integration with Chroma, boundaries, anisotropy, double precision • Prognosis: Excellent - Intel Colleagues are Engaged and Enthusiastic • Staffing: - Balint Joo, Jie Chen + Intel Colleagues (not funded by us): M. Smelyanskiy, D. Kalamkar, K. Vaidyanathan primarily, • Future steps: - Aiming for a public code release soon. Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

  8. Optimizing QDP++ on Xeon Phi • QDP++ ʻ parscalarvec ʼ - work by Jie Chen & B. Joo 42*+4" - vector friendly layout in 32*+3" QDP++ 56-"789:";8-<"=>:7"??@A%" *12,-./*0+*" - Single Xeon Phi 789:"$B#(":7C<,-9" &":7C<,-9D;8C<" *12*+*" comparable to 2 SNB %":7C<,-9D;8C<" sockets (no intrinsics, no $":7C<,-9D;8C<" ,-./*0+*" #":7C<,-D;8C<" prefetch) *+*" - parscalarvec intrinsic free !" #!!!!" $!!!!" %!!!!" &!!!!" '!!!!" (!!!!" )!!!!" !"#$%&%'%()*+,-& host code comparable to SSE optimized host code for single precision Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

  9. Optimized Code: Wilson Dslash *"+# *!&# *+&# *+!# "&'# "&'# "&)# "(+# *++# "%"# "%!# • Blocking scheme maximises ")&# ")(# "))# ")!# "$%# "$$# "$"# "$!# "$!# "*(# "*'# ""(# ")+# ""$# "!%# #of cores used "+'# !((# !(%# !("# !(!# "++# !'$# • SOA layout with tuned ʻ inner !)+# !"*# !""# !""# !""# !"!# !"!# !!&# !!&# !!'# !!'# array length ʼ !++# )+# • CPU Performance is excellent +# also (used 2.6 GHz SNB) ,-./01234435# 6/01234435# ,-./01234435# 6/01234435# ,-./01234435# 6/01234435# ,-./01234435# 6/01234435# 7-839:#;3/-:##<)="'(+#>?@A=<BC# 7-839:#;3/-#BDEF#>G@6C#)!!+B# 7-839:#;3/-#BDEF#>G@6C# @J7K7L:#G31932#G"+0# A!BHI=%!!+B# • Here Xeon Phi is comparable ME94/-#K49N4D# to 4 sockets JO"$P"$P"$P!"(## JO*"P*"P*"P!"(# JO$+P$+P$+P&'# JO$(P$(P"$P'$# JO*"P$+P"$P&'# From: B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnani,V. W. Lee, P. Dubey, W. Watson ||| “Lattice QCD on Intel(R) Xeon Phi(tm) Coprocessors”, Proceedings of ISC ʼ 13 (Leipzig) Lecture Notes in Computer Science Vol 7905 (to appear), Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

  10. Multi-Node Performance • 2D Comms only (Z&T) Wilson Dslash !((($ +(+!$ ;<)&=)&=)&=&+!$ - vectorization mixes X & Y +((($ *('($ ;<*%=*%=*%=&+!$ )#()$ )!''$ *((($ • Intel Endeavor Cluster &&%"$ )((($ &"'($ ""%&$ &((($ • 1 Xeon Phi device per node !"#$ "((($ • MPI Proxy: ($ &$ *$ %$ "!$ )&$ ,-./01$23$4025$678$-589:$ - pick fastest bandwidth path between devices (via host Wilson CG in this case) )&)"# !"((# ;<)%=)%=)%=%"*# !(((# ;<!$=!$=!$=%"*# %&!"# )"((# %$!!# - similar to GPU strong scaling at this level (expected) %")$# )(((# %"((# '**&# '"(&# %(((# • Space to explore here: e.g. CCL-Proxy, MVAPICH2-MIC '"((# $%&# '(((# !""# "((# (# %# !# $# '*# )%# From: B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnani, V. W. Lee, P. Dubey, W. Watson |||, “Lattice QCD on Intel(R) Xeon Phi(tm) Coprocessors”, Proceedings of ISC ʼ 13 (Leipzig) Lecture Notes in Computer Science Vol 7905 (to appear), +,-./0#12#3/14#567#8479:# Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

  11. Weak Scaling on Stampede • Without proxy Wilson Dslash, Weak Scaling, 48x48x24x64 sites per node, single precision 300 - drop in performance when going to multiple nodes - 250 performance halves when introducing second comms 31 TF !! GFLOPS per Node direction 200 Without Proxy - suggests issue is with async With CML Proxy progress rather than attainable 150 bandwidth or latency • With proxy 100 - small drop in performance from 1 to 2D comms. More likely due to 50 Communication in Communication in 2 dimensions B/W constraints... 1 dimension • Dslash scaled to 31 TF on 128 0 1 2 4 8 16 32 64 128 nodes (CG to 16.8 TF) number of nodes Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

  12. Strong Scaling 48x48x48x256 sites, strong scaling, single precision, using CML proxy • Endeavor results from 6000 ISC ʼ 13 paper Dslash Endeavor Dslash Stampede • Need to understand better 5000 CG Endeavor CG Stampede the performance difference but 4000 GFLOPS • Stampede 3000 - no icache_snoop_off - shape difference may be due 2000 to different virtual topology on 16 nodes 1000 • Over 4TF reached in Dslash and 3.6TF reached in CG on 0 Stampede 4 8 16 32 64 Number of Nodes Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

  13. Task 4,5: Contractions etc. • Status - ʻ redstar ʼ - generalized contraction code: • Computes quark line diagrams, generates list of propagators for Chroma to compute - ʻ harom ʼ - 3D version of QDP++ • Performs timeslice by timeslice contraction • Status: In production now - Used for multi-particle 2pt and 3pt calculations with variational method • Staff: - Robert Edwards, Jie Chen • Follow on tasks: - More optimization via BLAS/Lapack Library integration (GPU acceleration via CUBLAS?) - Support “isobar”-like operator constructions (recursive contractions) - improve I/O performance. Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

  14. Task 6: Chroma on BG/Q • Status: Slight progress - Chroma has compiled with XLC on BG/Q but performance is low - Have Clover Solver through BAGEL, not yet integrated - IBM has re-coded SSE cpp_wilson_dslash package for BG/Q under contract with ANL • Integrated this on Cetus. Best observed performance is on the order of - ~11-12% of peak in my single node tests - ~7-9% of peak communicating in all directions - Tried Parscalarvec on single node, but this still needs work. • Staffing: - Balint Joo Thomas Jefferson National Accelerator Facility Thursday, October 17, 2013

Recommend


More recommend