Functional Requirements Don Holmgren Fermilab djholm@fnal.gov LQCD-ext II CD-1 Review Germantown, MD February 25, 2014
Outline • Computational needs • Functional requirements Functional Requirements 2
Capability and Capacity Computing LQCD computing involves a mixture of capability and capacity computing tasks. Functional Requirements 3
Capability versus Capacity Computing Gauge Files Capability tasks, such a gauge configuration generation, depend critically on achieving minimum time-to-solution, because each step depends on the prior step. • They benefit from architectures and algorithms that allow the maximum computing power (FLOPs/sec) to be achieved on individual large problems. • Large gauge configurations are generated at the DOE LCFs and at other large supercomputing sites (e.g., NCSA BlueWaters). • Smaller configurations, e.g. for BSM or thermodynamics, are often generated on USQCD dedicated hardware. Functional Requirements 4
Capability versus Capacity Computing Quark Propagators Capacity tasks, such as propagator generation, achieve high aggregate computing throughput by working on many independent computations (“jobs”) simultaneously using systems with large numbers of processors. • The jobs are relatively small – O(10) to O(1K) core counts - compared to gauge configuration jobs, with relative large simulation volumes addressed by each core. • Clusters, with good performance on jobs with hundreds to a few thousands of cores, or with a few to many dozens of GPU accelerators, are well suited to capacity tasks. • A large number of very small jobs – from one to a few cluster nodes – are required for performing correlations between propagators ( “tie - ups” ) and for doing fits. Functional Requirements 5
Computational Needs • LQCD is dependent on capability computing for generating ensembles of gauge configurations • The need for computing capacity for analysis is at least as large, and growing – The complexity of analysis jobs is typically much greater, and the number of jobs about 1000X greater – The need for capacity is essentially unbounded – USQCD would use 100X more capacity if it were available – Because the desired capacity is not available, different approaches and algorithms are used to make tradeoffs to achieve specific physics goals (for example, different actions – HISQ, DWF, anisotropic clover – are employed by the various subfields) – Because of this variety of approaches, it is possible to optimize capacity systems for specific problems. For example, the first GPU-accelerated cluster in 2009 was tailored for NP spectrum calculations that required very high numbers of inversions to produce propagators. • The LQCD community has a long history of producing well-optimized code for all available hardware. DOE support through SciDAC has been an important part of this. Functional Requirements 6
Functional Requirements • Computational capacity – Individual analysis jobs ( e.g. propagator generation) require from 8 to 128 cluster nodes (0.4 TF/s to 6.4 TF/s based on 50 GF/node) • The larger jobs – 64 to 128 nodes – are used for eigenvector projection methods that also have high file I/O requirements – GPU-accelerated propagator generation requires from 4 to 16 accelerators (0.6 TF/s to 2.5 TF/s based on 150 GF/GPU) • In the next talk, Chip Watson will define LQCD metrics like sustained TF (conventional hardware) and effective TF (GPUs) – In aggregate, at least 188 TF/s capacity will be needed by the end of FY16 • Because of the funding profile, no new hardware will be purchased in FY15 • At the beginning of FY15, an estimated 200 TF/s aggregate capacity will be brought forward from the prior project (LQCD-ext) • A combination of new hardware deployment and old system decommissioning in FY16 will result in the 188 TF aggregate (+50 TF/s, -62 TF/s) Functional Requirements 7
Functional Requirements • Characteristics of production LQCD codes: – SU(3) algebra dominates (low arithmetic intensity) • Single precision complex matrix (3x3) – vector (3x1): 96 bytes read, 24 bytes written, 66 FLOPs 1.8:1 bytes:flops • Memory bandwidth is more important than peak FLOPs – Inter-node communications for message passing require roughly (oversimplification) 1 Gbit/sec of bandwidth for each GFLOP/sec of node capability • Also, low latency required for efficient global reductions, and for good strong scaling Functional Requirements 8
Functional Requirements • Access to large shared file systems and to tape storage – Analysis jobs read gauge configurations, and read and write propagators (for each configuration, O(10) or more propagators). Gauge configurations are of order 10 Gbyte (latest are 250 GB). Individual propagators are up to 12X the volume of gauge configurations. – Tape provides intermediate storage for long analysis campaigns with high data volumes (cheaper than disk), and archival storage of important files (gauge configurations, expensive propagators). Functional Requirements 9
LQCD Machine Architectures • A number of architectures are currently used by USQCD: – Traditional supercomputers (BlueGene, Cray) – Conventional clusters based on x86_64 processors and Infiniband networking – Accelerated clusters based on NVIDIA GPUs, x86_64 processors, and Infiniband networking, of two types: • “Gaming” card GPU systems, using graphics cards designed for the display hardware of desktop computers • “Tesla - class” systems, using GPUs designed for numerical work • LQCD hardware deployed at FNAL, JLab, BNL: – BlueGene/Q half-rack at BNL – Conventional, gaming-GPU, and Tesla-GPU clusters at JLab – Conventional and Tesla-GPU clusters at FNAL Functional Requirements 10
Matching Architectures to Job Requirements • Gauge configuration generation – Large lattices are generated at DOE and NSF leadership centers (BlueGene and Cray architectures) – Small lattices are generated on the BG/Q half-rack at BNL, or in some cases (BSM, some thermodynamics) on conventional clusters • Quark propagator production (traditional) – Propagators are produced on conventional clusters and/or on accelerated clusters (for actions with available GPU code) – Propagators from small lattices can be produced on gaming GPUs with additional correctness checks. • Quark propagator production (eigenvector projection) – Large jobs on conventional clusters (suitable for BlueGene/Q) • Combining propagators (“tie - ups”): conventional clusters (I/O bound) • Physics parameter extraction: conventional clusters Functional Requirements 11
End Functional Requirements 12
Backup Slides Functional Requirements 13
Hardware Requirements • Either memory bandwidth, floating point performance, or network performance (bandwidth at message sizes used) will be the limit on performance on a given parallel machine • On current single commodity nodes memory bandwidth is the constraint – GPUs have proven to be very cost effective for LQCD because they have the lowest price per unit of memory bandwidth. Intel Xeon Phi accelerators have similar memory bandwidth costs as NVIDIA GPUs. • On current parallel computer clusters, the constraint is either memory bandwidth or network performance, depending upon how many nodes are used on a given job – Network performance limits scaling: Surface area to volume ratio increases as more nodes are used, causing relatively more communications and smaller messages Functional Requirements 14
Balanced Design Requirements Communications for Dslash • Modified for improved staggered from Steve Gottlieb's staggered model: physics.indiana.edu/~sg/pcnets/ • Assume: – L^4 lattice – communications in 4 directions • Then: – L implies message size to communicate a hyperplane – Sustained MFlop/sec together with message size implies achieved communications bandwidth • Required network bandwidth increases as L decreases, and as sustained MFlop/sec increases Functional Requirements 15
SDR vs. DDR Infiniband Functional Requirements 16
Sample Analysis Workflow (Fermilab- MILC “Superscript”) Using 64^3 x 96 ensemble, for each configuration: 1. Generate a set of staggered quark propagators 2. Extract many extended sources from propagators from #1 3. Compute many clover propagators, write some to disk, tie them together with each other and with propagators from #1 4. Compute another set of clover propagators and tie them together with each each other, with propagators from #3, and with propagators from #1 System Cores Total Inverter Time Total I/O Time Perf/Core Ds (cluster) 1024 5480 sec 396 sec (7%) 488.8 MF (clover) 473.6 MF (asqtad) Intrepid (BG/P) 4096 1872 sec 1845 sec (50%) 345.5 MF (clover) 316.5 MF (asqtad) The dedicated capacity system, “Ds”, was more appropriate and was used for production: • better per core performance • better I/O performance (smaller fraction of time with idle cores) • more cost effective by a factor of more than 4X based on hardware costs Ds: $184/core, BG/P: $317/core based on $1.3M / rack (4 x $317/$184 x 3717sec/5876sec = 4.36 Inverter Only = 2.35) Functional Requirements 17
Recommend
More recommend