acesiii
play

ACESIII Outline Collaborators Design philosophy Mr. Mark - PowerPoint PPT Presentation

ACESIII Outline Collaborators Design philosophy Mr. Mark Ponton, ACES Q. C. (SIP/SIAL/Compiler) Implementation Dr. Norbert Flocke, QTP Results (Integral package) Conclusions Dr. Erik Deumens,


  1. ACESIII • Outline • Collaborators • Design philosophy • Mr. Mark Ponton, ACES Q. C. (SIP/SIAL/Compiler) • Implementation • Dr. Norbert Flocke, QTP • Results (Integral package) • Conclusions • Dr. Erik Deumens, QTP (Architect) • Dr. Ajith Perera, QTP • Dr. H. Lei, ACES Q. C. (Compiler) • Dr. Anthony Yau, HPTi • Dr. Rodney Bartlett, QTP ACES Q. C.

  2. Traditional Design control compute code communication disk input output hardware

  3. ACESIII Design control code compute communication Disk I/O hardware

  4. ACESIII design High level Problem Performance Low level concepts communication Data structures algorithms Input/output Super instruction Super instruction Assembly language Processor SIAL SIP (xaces3) input output

  5. SIAL (Super Instruction Assembly Language) • Key features • Advantageous • Index segmentation • Flexibility • Data blocking • Tune ability: Fast • Task isolation optimization • New methods implemented in reduced time • Portable

  6. Implemented • SCF • RHF, UHF • MBPT(2) gradient • RHF, UHF, ROHF • CCSD gradient • RHF, UHF • CCSD(T) • RHF, UHF • MBPT(2) Hessian • RHF, UHF, ROHF • EOM CCSD (Tomasz • RHF, UHF Kus)

  7. DMMP MBPT(2) gradient timings N bf = 397, N corr occ = 33 Ideal 689 min Actual Normal scaling region Time/[sec] 4 10 Super scaling region 46 min 43 min 1 2 8 10 10 128 Number of processors

  8. CCSD(T) • SCF • Easy if you have a good integrals package • Hard but small cost • Transformation • Hard as highly • CCSD nonlinear • Trivial !!! At least that • CCSD(T) is the common wisdom

  9. E4 o4 (T) Strategy E3 o3 E(TOTAL) occupied E2 o2 E1 o1

  10. E4 o4 (T) Strategy E3 o3 E(TOTAL) occupied o2 E2 E1 o1

  11. Advantages of DUAL layer parallelism • Less data replication or I/O bottlenecks • Trivial restart capability • Better turnaround due to queuing • Since more processors are used the effective (T) time is comparable to the CCSD time making the CCSD as/more important that the (T)!

  12. CCSD(T) • Luciferin( C 11 H 8 O 3 S 2 N 2 ) • Sucrose ( C 12 H 22 O 11 ) • RHF • RHF • C 1 symmetry • C 1 symmetry • Basis = aug-cc-pvdz • Basis = 6-311G** (498bf) (546bf) • N corrocc = 46 • 68

  13. Luciferin CCSD timings, N bf = 498, N corr occ = 46 Ideal 2 10 Actual 115.9 Normal scaling region Time/iteration [min] Super scaling region 14.5 13.1 1 10 2 10 32 256 Number of processors

  14. Luciferin CCSD timings, N bf = 498, N corr occ = 46 Ideal 2 10 Actual 115.9 Normal scaling region Time/iteration [min] CCSD(T)=420 min/8 orb Super scaling region 14.5 13.1 1 10 2 10 32 256 Number of processors

  15. Sucrose CCSD timings, N bf = 546, N corr occ = 68 3 10 Ideal Actual 908.6 Time/iteration [min] Normal scaling region 2 10 56.8 Super scaling region 24.0 2 10 32 512 Number of processors

  16. DMMP+OH • H 10 C 3 O 4 P • Number of processors = 64 • C 1 • Time CCSD = 69 • 208 basis functions minutes (3.8 min/iter) • 75 electrons • Time (T) =111 minutes *** • *** 7 dual jobs

  17. Systematic set of Benchmarks • Why? To remove confusion over technological verses algorithmic advances. • Allow users informed choices. • Provide a set of calculations to evaluate each program so strengths and weaknesses become evident. • Remove ambiguity in literature.

  18. Ar N Cluster Benchmarks(Performance) • Specifications(Mine!) • Methods • N=6 • MBPT(2) gradient • UHF • CCSD gradient • C 1 symmetry • CCSD(T) (core dropped) • Basis = aug-cc-pvtz (300bf) • MBPT(2) Hessian (RHF) • N corrocc = 54 • R = 5 bohr

  19. Ar6 MBPT(2) gradient timings, N bf = 300, N corr occ = 54 C 1 Symmetry 2 10 Ideal Actual 67 Normal scaling region Time [min] 16.0 Super scaling region 1 10 8.4 2 10 32 256 Number of processors

  20. Ar6 UCCSD timings, N bf = 300, N corr occ = 54, C 1 Symmetry Ideal 2 10 Actual 103.5 Normal scaling region Time/iteration [min] Super scaling region 13.6 12.9 1 10 2 10 32 256 32 32 256 Number of processors

  21. Ar6 ULAMBDA timings, N bf = 300, N corr Ar6 ULAMBDA timings, N bf = 300, N corr Ar6 ULAMBDA timings, N bf = 300, N corr Ar6 ULAMBDA timings, N bf = 300, N corr occ = 54, C 1 Symmetry occ = 54, C 1 Symmetry occ = 54, C 1 Symmetry occ = 54, C 1 Symmetry Ideal Ideal Ideal Ideal 2 2 2 Actual Actual Actual 10 10 10 2 Actual 10 119.7 119.7 119.7 Normal scaling region Normal scaling region Normal scaling region Normal scaling region Time/iteration [min] Time/iteration [min] Time/iteration [min] Time/iteration [min] Super scaling region Super scaling region Super scaling region 16.0 Super scaling region 16.0 16.0 15.0 15.0 16.0 15.0 15.0 2 2 2 10 10 10 2 32 256 10 32 32 256 256 Number of processors Number of processors Number of processors Number of processors

  22. Ar6 UGRAD timings, N bf = 300, N corr occ = 54 C 1 Symmetry Ideal 3 10 Actual 1273 Normal scaling region Time [min] Super scaling region 159 141 2 10 256 32 Number of processors

  23. bf = 300, N corr bf = 300, N corr Ar6 Total time for one UHF CCSD gradient, N Ar6 Total time for one UHF CCSD gradient, N occ = 54 occ = 54 C 1 Symmetry C 1 Symmetry Ideal Ideal Actual Actual 3505 3505 Normal scaling region Normal scaling region Time [min] Time [min] 3 3 10 10 Super scaling region Super scaling region 438 437 437 438 2 2 10 10 32 256 32 256 Number of processors Number of processors

  24. Ar6 UCCSD(T) timings, N bf = 300, C 1 Symmetry Ar6 UCCSD(T) timings, N bf = 300, C 1 Symmetry Ideal Ideal Actual Actual 784 784 Normal scaling region Normal scaling region Time[min] Time[min] 131 131 Super scaling region Super scaling region 2 2 98 98 10 10 2 2 32 32 10 10 256 256 Number of processors Number of processors

  25. MBPT(2) Hessian perturbations d d[ [V ab ij V ab ij D ab ij ] / dp ]dq dV/dp*d V /dq V*d 2 V/dp/dq

  26. Details of calculation • Number of basis functions = 300 • Number of correlated occupied = 54 • Number of Hessian elements = 324/2 • Number of processors = 128 • RHF reference

  27. Results • V*d 2 V/dpdq • T=381 minutes • dV/dp • 155 sec / pert p • d V /dq • 330 sec / pert q • dV/dp*d V /dq • 16 sec / element

  28. Observations • Ideally suited for dual layer parallelization with ‘dual’ layer being over the perturbations. • Dual layer strategy not optimal from an operation viewpoint as some computation must be repeated but many advantages: restart capability, real time of calculation, queuing, data storage.

  29. Conclusions • ACESIII provides an ideal parallel environment in which to implement computationally intense methods. • MBPT(2) gradient achieved over 90% scaling until work exhausted • CCSD achieved better than ideal scaling up to 512 processors (32 as reference) indicating an optimal range of processors exists for each computation. • CCSD(T) perturbative triples can be computed quit effectively using a dual layer parallelization strategy so that (T) and CCSD are comparable to compute in a pragmatic way. • CCSD gradients (Ar 6 ) exhibit ideal scaling from 32-256 processors.

  30. Conclusions • MBPT(2) Hessians (and others also) benefit from dual layer parallelism but care bust be taken to segment the work optimally. • A set of benchmark calculations would be very valuable to the quantum chemistry community to remove ambiguities among various programs. • ACESIII has been successfully ported to the following systems: IBM SP4 SP5, ALTIX, Linux cluster, Opteron cluster and is available on many DOD machines. • ACESIII benefits from ‘many’ processors indicating potential in the massively parallel regime. • The flexibility offered by the ACESIII environment allows for rapid tuning and implementation of codes.

Recommend


More recommend