15th May 2017 Dagstuhl Seminar 17202 F ILE S YSTEMS FOR HPC A Data Centre View Professor Mark Parsons EPCC Director Associate Dean for e-Research The University of Edinburgh
10th March 2017 Graham Spittle Visit Advanced Computing Facility • The ‘ACF’ • Opened 2005 • Purpose built, secure, world class data centre Advanced Computing Facility HPC and Big Data • Houses wide variety of leading-edge systems • Major expansion in 2013 • 7.5 MW, 850m2 plant room, 550m2 machine room • Next … the Exascale
10th March 2017 Graham Spittle Visit Principal services • Houses variety of leading edge systems and infrastructures • UK national services • ARCHER 118,080 cores (Cray XC30) • DiRAC 98,304 cores (IBM BlueGene/Q) • Funded by EPSRC and NERC • UK RDF (25Pb Disk / 50Pb Tape) • Service opened in 2013 • 5,053 users since opening • Local services • 3,494 users in past 12 months • Cirrus – industry and MSc machine • ARCHER 2 procurement starting • ULTRA – SGI UV2000 • ECDF – DELL and IBM clusters for University researchers • FARR – system for Farr Institute and NHS Scotland
15th May 2017 Dagstuhl Seminar 17202 Data centre file systems in 2017 • Complexity has greatly • Resiliency increased in past decade • Storage platforms generally use some form of RAID • Most HPC systems have: • Isn’t good enough for “golden” data • Multiple storage systems • A lot of tape is still used • Multiple file systems per storage • Generally LT06 or LT07 or IBM system Enterprise formats • UPS focussed on keeping file systems • Filesystems are predominately: up while shutting down smoothly • Via directly attached storage • In 2017 compute is robust – • GPFS (IBM Spectrum Scale) storage is not • Lustre (versions 2.6 – 2.8 are most common)
15th May 2017 Dagstuhl Seminar 17202 Systems grow rapidly … it gets complex very quickly April 2016 – our new system “cirrus” is installed
15th May 2017 Dagstuhl Seminar 17202 Schematic layout April 2016 110 TB LFS 800 TB LFS
15th May 2017 Dagstuhl Seminar 17202 Schematic layout September 2016 800 TB LFS
15th May 2017 Dagstuhl Seminar 17202 Schematic layout March 2017 - compute From 5,184 cores To 13,248 cores
15th May 2017 Dagstuhl Seminar 17202 Schematic layout March 2017 - storage 1.9PB WOS
15th May 2017 Dagstuhl Seminar 17202 General data centre I/O challenges • Many application codes do not • Some examples use parallel I/O • Genome processing • Most users still have a simple • ~400 TB moves through storage every week POSIX FS view of the world • One step creates many small files – real • Even when they do use parallel Lustre challenges I/O we find libraries fighting • HSM solutions not up to the job against FS optimisations • Users not thinking about underlying • Buying storage is terribly complicated and confusing constraints • Start/End of job read/write • A user on national service created 240+ performance wastes investment million files in a single directory last year • Performance degrades … • Issues exist with Lustre and GPFS
15th May 2017 Dagstuhl Seminar 17202 Performance and benchmarking • A real challenge is managing the difference in performance between the day you buy storage and benchmark it and 6 months later • We see enormous differences in file system performance • Write can be 3-4X slower • Read can be 2-3X slower • Significant degradation in performance • Very difficult to predict performance • IOR and IOZone are commonly used but neither predicts performance once file system has significant amounts of real user data stored on it • We need new parallel I/O benchmarks urgently for procurement purposes
15th May 2017 Dagstuhl Seminar 17202 Performance and user configuration • “Setting striping to 1 has reduced total read time for his 36000 small files from 2 hours to 6 minutes” - comment on resolution of an ARCHER helpdesk query • User was performing I/O on 36000 separate files of ~300KB with 10000 processes • Had set parallel striping to maximum possible (48 OSTs / -1) assuming this would give best performance • Overhead of querying every OST for every file dominated the access time • Moral: more stripes does not mean better performance (Thanks to David Henty for this slide) • But how do users learn non-intuitive configurations?
30th September 2015 NEXTGenIO Summary 13 A new hierarchy Memory & Storage Latency Gaps socket socket CPU CPU 1x 1x socket socket • Next generation NVRAM Register Register 10x 10x technologies will profoundly socket socket Cache Cache changing memory and storage 10x 10x DIMM DIMM DRAM DRAM hierarchies 10x DIMM Memory NVRAM • HPC systems and Data Intensive 100x 100,000x IO Storage SSD systems will merge 100x IO IO • Profound changes are coming to Spinning storage disk Spinning storage disk 1,000x ALL Data Centres backup 10,000x Storage disk - MAID 10x • … but in HPC we need to develop backup backup Storage tape Storage tape software – OS and application – HPC systems today HPC systems of the future to support their use
30th September 2015 NEXTGenIO Summary 14 The future - I/O is the Exascale challenge • Parallelism beyond 100 million threads demands a new approach to I/O • Today’s Petascale systems struggle with I/O • Inter-processor communication limits performance • Reading and writing data to parallel filesystems is a major bottleneck • New technologies are needed • To improve inter-processor communication • To help us rethink data movement and processing on capability systems • Truly parallel file systems with reproducible performance are required • Current technologies simply will not scale • Large jobs will spend hours reading initial data and writing results
30th September 2015 NEXTGenIO Summary 15 Project Objectives • Develop a new server architecture using next generation processor and memory advances • New Fujitsu server motherboard • Built around Intel Xeon and 3D XPoint memory technologies • Investigate the best ways of utilising these technologies in HPC • Develop the systemware to support their use at the Exascale • Model three different I/O workloads and use this understanding in a co-design process • Representative of real HPC centre workloads • Predict performance of changes to I/O infrastructure and workloads
Recommend
More recommend