ACESIII • Outline • Collaborators • Design philosophy • Mr. Mark Ponton, ACES Q. C. (SIP/SIAL/Compiler) • Implementation • Dr. Norbert Flocke, QTP • Results (Integral package) • Conclusions • Dr. Erik Deumens, QTP (Architect) • Dr. Ajith Perera, QTP • Dr. H. Lei, ACES Q. C. (Compiler) • Dr. Anthony Yau, HPTi • Dr. Rodney Bartlett, QTP ACES Q. C.
Traditional Design control compute code communication disk input output hardware
ACESIII Design control code compute communication Disk I/O hardware
ACESIII design High level Problem Performance Low level concepts communication Data structures algorithms Input/output Super instruction Super instruction Assembly language Processor SIAL SIP (xaces3) input output
SIAL (Super Instruction Assembly Language) • Key features • Advantageous • Index segmentation • Flexibility • Data blocking • Tune ability: Fast • Task isolation optimization • New methods implemented in reduced time • Portable
Implemented • SCF • RHF, UHF • MBPT(2) gradient • RHF, UHF, ROHF • CCSD gradient • RHF, UHF • CCSD(T) • RHF, UHF • MBPT(2) Hessian • RHF, UHF, ROHF • EOM CCSD (Tomasz • RHF, UHF Kus)
DMMP MBPT(2) gradient timings N bf = 397, N corr occ = 33 Ideal 689 min Actual Normal scaling region Time/[sec] 4 10 Super scaling region 46 min 43 min 1 2 8 10 10 128 Number of processors
CCSD(T) • SCF • Easy if you have a good integrals package • Hard but small cost • Transformation • Hard as highly • CCSD nonlinear • Trivial !!! At least that • CCSD(T) is the common wisdom
E4 o4 (T) Strategy E3 o3 E(TOTAL) occupied E2 o2 E1 o1
E4 o4 (T) Strategy E3 o3 E(TOTAL) occupied o2 E2 E1 o1
Advantages of DUAL layer parallelism • Less data replication or I/O bottlenecks • Trivial restart capability • Better turnaround due to queuing • Since more processors are used the effective (T) time is comparable to the CCSD time making the CCSD as/more important that the (T)!
CCSD(T) • Luciferin( C 11 H 8 O 3 S 2 N 2 ) • Sucrose ( C 12 H 22 O 11 ) • RHF • RHF • C 1 symmetry • C 1 symmetry • Basis = aug-cc-pvdz • Basis = 6-311G** (498bf) (546bf) • N corrocc = 46 • 68
Luciferin CCSD timings, N bf = 498, N corr occ = 46 Ideal 2 10 Actual 115.9 Normal scaling region Time/iteration [min] Super scaling region 14.5 13.1 1 10 2 10 32 256 Number of processors
Luciferin CCSD timings, N bf = 498, N corr occ = 46 Ideal 2 10 Actual 115.9 Normal scaling region Time/iteration [min] CCSD(T)=420 min/8 orb Super scaling region 14.5 13.1 1 10 2 10 32 256 Number of processors
Sucrose CCSD timings, N bf = 546, N corr occ = 68 3 10 Ideal Actual 908.6 Time/iteration [min] Normal scaling region 2 10 56.8 Super scaling region 24.0 2 10 32 512 Number of processors
DMMP+OH • H 10 C 3 O 4 P • Number of processors = 64 • C 1 • Time CCSD = 69 • 208 basis functions minutes (3.8 min/iter) • 75 electrons • Time (T) =111 minutes *** • *** 7 dual jobs
Systematic set of Benchmarks • Why? To remove confusion over technological verses algorithmic advances. • Allow users informed choices. • Provide a set of calculations to evaluate each program so strengths and weaknesses become evident. • Remove ambiguity in literature.
Ar N Cluster Benchmarks(Performance) • Specifications(Mine!) • Methods • N=6 • MBPT(2) gradient • UHF • CCSD gradient • C 1 symmetry • CCSD(T) (core dropped) • Basis = aug-cc-pvtz (300bf) • MBPT(2) Hessian (RHF) • N corrocc = 54 • R = 5 bohr
Ar6 MBPT(2) gradient timings, N bf = 300, N corr occ = 54 C 1 Symmetry 2 10 Ideal Actual 67 Normal scaling region Time [min] 16.0 Super scaling region 1 10 8.4 2 10 32 256 Number of processors
Ar6 UCCSD timings, N bf = 300, N corr occ = 54, C 1 Symmetry Ideal 2 10 Actual 103.5 Normal scaling region Time/iteration [min] Super scaling region 13.6 12.9 1 10 2 10 32 256 32 32 256 Number of processors
Ar6 ULAMBDA timings, N bf = 300, N corr Ar6 ULAMBDA timings, N bf = 300, N corr Ar6 ULAMBDA timings, N bf = 300, N corr Ar6 ULAMBDA timings, N bf = 300, N corr occ = 54, C 1 Symmetry occ = 54, C 1 Symmetry occ = 54, C 1 Symmetry occ = 54, C 1 Symmetry Ideal Ideal Ideal Ideal 2 2 2 Actual Actual Actual 10 10 10 2 Actual 10 119.7 119.7 119.7 Normal scaling region Normal scaling region Normal scaling region Normal scaling region Time/iteration [min] Time/iteration [min] Time/iteration [min] Time/iteration [min] Super scaling region Super scaling region Super scaling region 16.0 Super scaling region 16.0 16.0 15.0 15.0 16.0 15.0 15.0 2 2 2 10 10 10 2 32 256 10 32 32 256 256 Number of processors Number of processors Number of processors Number of processors
Ar6 UGRAD timings, N bf = 300, N corr occ = 54 C 1 Symmetry Ideal 3 10 Actual 1273 Normal scaling region Time [min] Super scaling region 159 141 2 10 256 32 Number of processors
bf = 300, N corr bf = 300, N corr Ar6 Total time for one UHF CCSD gradient, N Ar6 Total time for one UHF CCSD gradient, N occ = 54 occ = 54 C 1 Symmetry C 1 Symmetry Ideal Ideal Actual Actual 3505 3505 Normal scaling region Normal scaling region Time [min] Time [min] 3 3 10 10 Super scaling region Super scaling region 438 437 437 438 2 2 10 10 32 256 32 256 Number of processors Number of processors
Ar6 UCCSD(T) timings, N bf = 300, C 1 Symmetry Ar6 UCCSD(T) timings, N bf = 300, C 1 Symmetry Ideal Ideal Actual Actual 784 784 Normal scaling region Normal scaling region Time[min] Time[min] 131 131 Super scaling region Super scaling region 2 2 98 98 10 10 2 2 32 32 10 10 256 256 Number of processors Number of processors
MBPT(2) Hessian perturbations d d[ [V ab ij V ab ij D ab ij ] / dp ]dq dV/dp*d V /dq V*d 2 V/dp/dq
Details of calculation • Number of basis functions = 300 • Number of correlated occupied = 54 • Number of Hessian elements = 324/2 • Number of processors = 128 • RHF reference
Results • V*d 2 V/dpdq • T=381 minutes • dV/dp • 155 sec / pert p • d V /dq • 330 sec / pert q • dV/dp*d V /dq • 16 sec / element
Observations • Ideally suited for dual layer parallelization with ‘dual’ layer being over the perturbations. • Dual layer strategy not optimal from an operation viewpoint as some computation must be repeated but many advantages: restart capability, real time of calculation, queuing, data storage.
Conclusions • ACESIII provides an ideal parallel environment in which to implement computationally intense methods. • MBPT(2) gradient achieved over 90% scaling until work exhausted • CCSD achieved better than ideal scaling up to 512 processors (32 as reference) indicating an optimal range of processors exists for each computation. • CCSD(T) perturbative triples can be computed quit effectively using a dual layer parallelization strategy so that (T) and CCSD are comparable to compute in a pragmatic way. • CCSD gradients (Ar 6 ) exhibit ideal scaling from 32-256 processors.
Conclusions • MBPT(2) Hessians (and others also) benefit from dual layer parallelism but care bust be taken to segment the work optimally. • A set of benchmark calculations would be very valuable to the quantum chemistry community to remove ambiguities among various programs. • ACESIII has been successfully ported to the following systems: IBM SP4 SP5, ALTIX, Linux cluster, Opteron cluster and is available on many DOD machines. • ACESIII benefits from ‘many’ processors indicating potential in the massively parallel regime. • The flexibility offered by the ACESIII environment allows for rapid tuning and implementation of codes.
Recommend
More recommend