ACESIII Outline Collaborators Design philosophy Mr. Mark - PowerPoint PPT Presentation

ACESIII • Outline • Collaborators • Design philosophy • Mr. Mark Ponton, ACES Q. C. (SIP/SIAL/Compiler) • Implementation • Dr. Norbert Flocke, QTP • Results (Integral package) • Conclusions • Dr. Erik Deumens, QTP (Architect) • Dr. Ajith Perera, QTP • Dr. H. Lei, ACES Q. C. (Compiler) • Dr. Anthony Yau, HPTi • Dr. Rodney Bartlett, QTP ACES Q. C.

Traditional Design control compute code communication disk input output hardware

ACESIII Design control code compute communication Disk I/O hardware

ACESIII design High level Problem Performance Low level concepts communication Data structures algorithms Input/output Super instruction Super instruction Assembly language Processor SIAL SIP (xaces3) input output

SIAL (Super Instruction Assembly Language) • Key features • Advantageous • Index segmentation • Flexibility • Data blocking • Tune ability: Fast • Task isolation optimization • New methods implemented in reduced time • Portable

Implemented • SCF • RHF, UHF • MBPT(2) gradient • RHF, UHF, ROHF • CCSD gradient • RHF, UHF • CCSD(T) • RHF, UHF • MBPT(2) Hessian • RHF, UHF, ROHF • EOM CCSD (Tomasz • RHF, UHF Kus)

DMMP MBPT(2) gradient timings N bf = 397, N corr occ = 33 Ideal 689 min Actual Normal scaling region Time/[sec] 4 10 Super scaling region 46 min 43 min 1 2 8 10 10 128 Number of processors

CCSD(T) • SCF • Easy if you have a good integrals package • Hard but small cost • Transformation • Hard as highly • CCSD nonlinear • Trivial !!! At least that • CCSD(T) is the common wisdom

E4 o4 (T) Strategy E3 o3 E(TOTAL) occupied E2 o2 E1 o1

E4 o4 (T) Strategy E3 o3 E(TOTAL) occupied o2 E2 E1 o1

Advantages of DUAL layer parallelism • Less data replication or I/O bottlenecks • Trivial restart capability • Better turnaround due to queuing • Since more processors are used the effective (T) time is comparable to the CCSD time making the CCSD as/more important that the (T)!

CCSD(T) • Luciferin( C 11 H 8 O 3 S 2 N 2 ) • Sucrose ( C 12 H 22 O 11 ) • RHF • RHF • C 1 symmetry • C 1 symmetry • Basis = aug-cc-pvdz • Basis = 6-311G** (498bf) (546bf) • N corrocc = 46 • 68

Luciferin CCSD timings, N bf = 498, N corr occ = 46 Ideal 2 10 Actual 115.9 Normal scaling region Time/iteration [min] Super scaling region 14.5 13.1 1 10 2 10 32 256 Number of processors

Luciferin CCSD timings, N bf = 498, N corr occ = 46 Ideal 2 10 Actual 115.9 Normal scaling region Time/iteration [min] CCSD(T)=420 min/8 orb Super scaling region 14.5 13.1 1 10 2 10 32 256 Number of processors

Sucrose CCSD timings, N bf = 546, N corr occ = 68 3 10 Ideal Actual 908.6 Time/iteration [min] Normal scaling region 2 10 56.8 Super scaling region 24.0 2 10 32 512 Number of processors

DMMP+OH • H 10 C 3 O 4 P • Number of processors = 64 • C 1 • Time CCSD = 69 • 208 basis functions minutes (3.8 min/iter) • 75 electrons • Time (T) =111 minutes *** • *** 7 dual jobs

Systematic set of Benchmarks • Why? To remove confusion over technological verses algorithmic advances. • Allow users informed choices. • Provide a set of calculations to evaluate each program so strengths and weaknesses become evident. • Remove ambiguity in literature.

Ar N Cluster Benchmarks(Performance) • Specifications(Mine!) • Methods • N=6 • MBPT(2) gradient • UHF • CCSD gradient • C 1 symmetry • CCSD(T) (core dropped) • Basis = aug-cc-pvtz (300bf) • MBPT(2) Hessian (RHF) • N corrocc = 54 • R = 5 bohr

Ar6 MBPT(2) gradient timings, N bf = 300, N corr occ = 54 C 1 Symmetry 2 10 Ideal Actual 67 Normal scaling region Time [min] 16.0 Super scaling region 1 10 8.4 2 10 32 256 Number of processors

Ar6 UCCSD timings, N bf = 300, N corr occ = 54, C 1 Symmetry Ideal 2 10 Actual 103.5 Normal scaling region Time/iteration [min] Super scaling region 13.6 12.9 1 10 2 10 32 256 32 32 256 Number of processors

Ar6 ULAMBDA timings, N bf = 300, N corr Ar6 ULAMBDA timings, N bf = 300, N corr Ar6 ULAMBDA timings, N bf = 300, N corr Ar6 ULAMBDA timings, N bf = 300, N corr occ = 54, C 1 Symmetry occ = 54, C 1 Symmetry occ = 54, C 1 Symmetry occ = 54, C 1 Symmetry Ideal Ideal Ideal Ideal 2 2 2 Actual Actual Actual 10 10 10 2 Actual 10 119.7 119.7 119.7 Normal scaling region Normal scaling region Normal scaling region Normal scaling region Time/iteration [min] Time/iteration [min] Time/iteration [min] Time/iteration [min] Super scaling region Super scaling region Super scaling region 16.0 Super scaling region 16.0 16.0 15.0 15.0 16.0 15.0 15.0 2 2 2 10 10 10 2 32 256 10 32 32 256 256 Number of processors Number of processors Number of processors Number of processors

Ar6 UGRAD timings, N bf = 300, N corr occ = 54 C 1 Symmetry Ideal 3 10 Actual 1273 Normal scaling region Time [min] Super scaling region 159 141 2 10 256 32 Number of processors

bf = 300, N corr bf = 300, N corr Ar6 Total time for one UHF CCSD gradient, N Ar6 Total time for one UHF CCSD gradient, N occ = 54 occ = 54 C 1 Symmetry C 1 Symmetry Ideal Ideal Actual Actual 3505 3505 Normal scaling region Normal scaling region Time [min] Time [min] 3 3 10 10 Super scaling region Super scaling region 438 437 437 438 2 2 10 10 32 256 32 256 Number of processors Number of processors

Ar6 UCCSD(T) timings, N bf = 300, C 1 Symmetry Ar6 UCCSD(T) timings, N bf = 300, C 1 Symmetry Ideal Ideal Actual Actual 784 784 Normal scaling region Normal scaling region Time[min] Time[min] 131 131 Super scaling region Super scaling region 2 2 98 98 10 10 2 2 32 32 10 10 256 256 Number of processors Number of processors

MBPT(2) Hessian perturbations d d[ [V ab ij V ab ij D ab ij ] / dp ]dq dV/dp*d V /dq V*d 2 V/dp/dq

Details of calculation • Number of basis functions = 300 • Number of correlated occupied = 54 • Number of Hessian elements = 324/2 • Number of processors = 128 • RHF reference

Results • V*d 2 V/dpdq • T=381 minutes • dV/dp • 155 sec / pert p • d V /dq • 330 sec / pert q • dV/dp*d V /dq • 16 sec / element

Observations • Ideally suited for dual layer parallelization with ‘dual’ layer being over the perturbations. • Dual layer strategy not optimal from an operation viewpoint as some computation must be repeated but many advantages: restart capability, real time of calculation, queuing, data storage.

Conclusions • ACESIII provides an ideal parallel environment in which to implement computationally intense methods. • MBPT(2) gradient achieved over 90% scaling until work exhausted • CCSD achieved better than ideal scaling up to 512 processors (32 as reference) indicating an optimal range of processors exists for each computation. • CCSD(T) perturbative triples can be computed quit effectively using a dual layer parallelization strategy so that (T) and CCSD are comparable to compute in a pragmatic way. • CCSD gradients (Ar 6 ) exhibit ideal scaling from 32-256 processors.

Conclusions • MBPT(2) Hessians (and others also) benefit from dual layer parallelism but care bust be taken to segment the work optimally. • A set of benchmark calculations would be very valuable to the quantum chemistry community to remove ambiguities among various programs. • ACESIII has been successfully ported to the following systems: IBM SP4 SP5, ALTIX, Linux cluster, Opteron cluster and is available on many DOD machines. • ACESIII benefits from ‘many’ processors indicating potential in the massively parallel regime. • The flexibility offered by the ACESIII environment allows for rapid tuning and implementation of codes.

ACESIII Outline Collaborators Design philosophy Mr. Mark - PowerPoint PPT Presentation

ACESIII Outline Collaborators Design philosophy Mr. Mark Ponton, ACES Q. C. (SIP/SIAL/Compiler) Implementation Dr. Norbert Flocke, QTP Results (Integral package) Conclusions Dr. Erik Deumens,

The dual Voronoi diagrams with respect to representational Bregman divergences Frank Nielsen and

More Practical Single-Trace Attacks on the Number Theoretic Transform Peter Pessl, Robert Primas

Stochastic Iterative Hard Thresholding for Graph-Structured Sparsity Optimization Baojian Zhou 1 ,

Comparison of the 2005 Weimer and HAO Empirical High Latitude Models of Energy Transfer in terms

Se Search arch for for th the e Lep Lepton ton Fl Flavor avor Vio Violating lating De

Vassil Roussev The Current Forensic Workflow Forensic Target (3TB) Clone Process @150MB/s

Matching and Inequality in the World Economy Arnaud Costinot Jonathan Vogel MIT & Columbia

Online Algorithms Lecture 3 Ji r Sgall Computer Science Institute of the Charles Univ.,

Randomness with CA Bruno Martin Universit e C ote dAzur, I3S-CNRS Journ ee Al ea

Mesh Denoising via L 0 Minimization Lei He Scott Schaefer Texas A&M University Surface

Massive, Open, Online Courses (MOOCs) as Components of Rich Landscapes of Learning Gerhard

Bifrost: Easy GPU Pipeline Development github.com/ledatelescope/bifrost Presenter: Miles

First Quarter 2020 Earnings Presentation April 30, 2020 www.ussteel.com Forward-looking

Graph Algorithms Graphs Nodes/vertexes: Edges: (undirected) (directed) b a Representations

Adaptive Metric-Aware Job Scheduling for Production Supercomputers Wei Tang, Dongxu Ren,

Acylindrically hyperbolic groups Denis Osin Vanderbilt University June 6, 2013 1 / 12 Some

The b-functions of quiver semi-invariants Andr as Cristian L orincz Northeastern University

The role of double trace deformations in AdS/CMT Gary Horowitz UC Santa Barbara Based on

k-Maximum Likelihood Estimator for mixtures of generalized Gaussians ICPR 2012, Tokyo, Japan

CSE 101 Algorithm Design and Analysis Sanjoy Dasgupta Russell Impagliazzo Ragesh Jaiswal

Conformal theory of MacDowell-Mansouri type Micha Szczachor Capstone Institute for Theoretical

The Coq proof assistant : From graphical presentation to principles and practice Coq syntax

Chapter 4: Foundations for inference OpenIntro Statistics, 2nd Edition Variability in estimates

9/23/2009 C O NFERENC ES Short Name Full Name Special Interest Group on Management Of SIGMOD

Sambuz

Useful Links

Newsletter

Mail Us

ACESIII Outline Collaborators Design philosophy Mr. Mark - PowerPoint PPT Presentation

ACESIII Outline Collaborators Design philosophy Mr. Mark Ponton, ACES Q. C. (SIP/SIAL/Compiler) Implementation Dr. Norbert Flocke, QTP Results (Integral package) Conclusions Dr. Erik Deumens,

The dual Voronoi diagrams with respect to representational Bregman divergences Frank Nielsen and

More Practical Single-Trace Attacks on the Number Theoretic Transform Peter Pessl, Robert Primas

Stochastic Iterative Hard Thresholding for Graph-Structured Sparsity Optimization Baojian Zhou 1 ,

Comparison of the 2005 Weimer and HAO Empirical High Latitude Models of Energy Transfer in terms

Se Search arch for for th the e Lep Lepton ton Fl Flavor avor Vio Violating lating De

Vassil Roussev The Current Forensic Workflow Forensic Target (3TB) Clone Process @150MB/s

Matching and Inequality in the World Economy Arnaud Costinot Jonathan Vogel MIT &amp; Columbia

Online Algorithms Lecture 3 Ji r Sgall Computer Science Institute of the Charles Univ.,

Randomness with CA Bruno Martin Universit e C ote dAzur, I3S-CNRS Journ ee Al ea

Mesh Denoising via L 0 Minimization Lei He Scott Schaefer Texas A&amp;M University Surface

Massive, Open, Online Courses (MOOCs) as Components of Rich Landscapes of Learning Gerhard

Bifrost: Easy GPU Pipeline Development github.com/ledatelescope/bifrost Presenter: Miles

First Quarter 2020 Earnings Presentation April 30, 2020 www.ussteel.com Forward-looking

Graph Algorithms Graphs Nodes/vertexes: Edges: (undirected) (directed) b a Representations

Adaptive Metric-Aware Job Scheduling for Production Supercomputers Wei Tang, Dongxu Ren,

Acylindrically hyperbolic groups Denis Osin Vanderbilt University June 6, 2013 1 / 12 Some

The b-functions of quiver semi-invariants Andr as Cristian L orincz Northeastern University

The role of double trace deformations in AdS/CMT Gary Horowitz UC Santa Barbara Based on

k-Maximum Likelihood Estimator for mixtures of generalized Gaussians ICPR 2012, Tokyo, Japan

CSE 101 Algorithm Design and Analysis Sanjoy Dasgupta Russell Impagliazzo Ragesh Jaiswal

Conformal theory of MacDowell-Mansouri type Micha Szczachor Capstone Institute for Theoretical

The Coq proof assistant : From graphical presentation to principles and practice Coq syntax

Chapter 4: Foundations for inference OpenIntro Statistics, 2nd Edition Variability in estimates

9/23/2009 C O NFERENC ES Short Name Full Name Special Interest Group on Management Of SIGMOD

Sambuz

Useful Links

Newsletter

Mail Us

Matching and Inequality in the World Economy Arnaud Costinot Jonathan Vogel MIT & Columbia

Mesh Denoising via L 0 Minimization Lei He Scott Schaefer Texas A&M University Surface