Lessons Learned in Deploying the World's Largest Scale Lustre File - PowerPoint PPT Presentation

Lessons Learned in Deploying the World's Largest Scale Lustre File System Presented by David Dillow Galen Shipman, David Dillow, Sarp Oral, Feiyi Wang, Douglas Fuller, Jason Hill, and Zhe Zhang 1

Brief overview of Spider ● 10 PB storage to users ● 244 GB/s demonstrated bandwidth ● Currently serves 26,887 clients ● Based on Lustre 1.6.5 plus Cray and Oracle patches 2

Spider Hardware ● 13,696 1 TB SATA Drives ● 13,440 used for object storage ● 256 used for metadata and management ● 48 DDN 9900 Couplets (IB) ● 1 Engenio 7900 Storage Server (FC) ● 192 Dell PowerEdge 1950 Object servers ● 3 Dell R900 Metadata servers ● Other various management servers 3

Why Spider? ● Data availability ● Better MTTI/MTTF than locally attached Lustre ● Available during system maintenance periods ● Data accessibility ● No need to copy from simulation platform to visualization/analysis clusters ● Allows use of dedicated transfer nodes for movement off-site 4

What could go wrong? ● Interference from other systems ● Interactive performance ● Hardware failures ● Management headaches ● Lustre bugs ● Scapegoat syndrome 5

System Interference ● It's a shared resource ● Big jobs on the XT5 necessarily impact smaller clusters during heavy I/O ● Turns out to be mostly a non-issue ● Most users I/O needs seem to be more modest ● Occurs within a single cluster as well if multiple jobs are scheduled ● “Mostly....” 6

I/O Shadow 7

Metadata scaling ● Conventional wisdom is that one needs huge disk IOPS for metadata service ● Provisioned the Engenio for MDT ● RAID10 of 80 1 TB SATA disks ● Short stroked to an 8 TB volume ● Write-back caching enabled – With mirroring! ● Achieved 18,485 read and 23,150 write IOPS – 4 KB requests, random seeks over 8TB 8

Metadata scaling ● Conventional wisdom may be a bit inaccurate ● I/O rates very low during steady state operation – Bursts of ~1000 8K write requests every 5 seconds – Bursts of ~1500 to 3000 8K writes every 30 seconds – Occasional reads ● MDS memory sizing is paramount! – Try to fit working set size into memory – We have 32 GB which does well so far ● IOPS more important when cache cold after mount 9

Metadata scaling ● Lock pingpong hurts large jobs with shared files ● Opening a file with O_CREAT holds lock over a full round trip to client from MDS – 65,536 core job with O_CREAT takes 50 seconds – 65,536 core job without takes 5 seconds ● Lustre 1.8.3 fixes this if the file exists ● Still takes an exclusive lock, though ● But why does this hurt other jobs? 10

Metadata scaling ● Lock pingpong hurts interactivity due to Lustre's request model ● Every request is handled by a thread ● If the request needs to wait on a lock, it sleeps ● If you run out of threads, request handling stalls ● No quick fix for this ● Can bump up the thread count ● High thread counts can cause high CPU contention 11

Hardware failures ● We've dodged many bullets ● Server hardware has been very reliable ● Relatively few disk failures (one to two per week) ● More singlet failures than we'd like – Upper bound of about 2 per month – Some of those were likely software faults rather than HW – Multipath has had good results surviving a singlet failure 12

Hardware failures ● We've also had some issues ● SRP has had a few issues releasing the old connection ● Leaf modules in the core IB switches seem to prefer Byzantine failures ● OpenSM has not dealt gracefully with flapping links ● No one seems to make a good power supply ● MDS soft lockups ● OSTs transitioning to read-only due to external event 13

Management issues ● How do you find needles with particular attributes when you have 280 million of them? ● lfs find -R –obd <OST> – Over five days to complete ● Ne2scan – Almost two days ● Find – Three days with no arguments that require a stat call 14

Management issues ● Who is beating up the file system? ● Who is using the most space? ● How does our data and IO operation rates trend? ● Can we tie that to an application? 15

Scapegoat Syndrome ● Lustre has bugs ● Modern Lustre is much more of a “canary in the coal mine” than the walking trouble ticket it has been in the past ● User's first response is that “the file system is slow!” ● Even if the root cause is that the HSN is melting down 16

Questions? ● Contact info: David Dillow 865-241-6602 dillowda@ornl.gov 17

Lessons Learned in Deploying the World's Largest Scale Lustre File - PowerPoint PPT Presentation

Lessons Learned in Deploying the World's Largest Scale Lustre File System Presented by David Dillow Galen Shipman, David Dillow, Sarp Oral, Feiyi Wang, Douglas Fuller, Jason Hill, and Zhe Zhang 1 Brief overview of Spider 10 PB storage to

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Deploying Large Scale AVB/TSN Networks Jeff Koftinoff, Meyer Sound Laboratories, Inc. June 19,

Deploying And Supporting Perl 6 Jonathan Worthington UKUUG Spring 2007 Conference Deploying And

Deploying Machine Learning Models on The Edge Deploying Machine Learning Models on The Edge Yan

Deploying Information Deploying Information Agents on the Web Agents on the Web Craig A.

Experiences in deploying the high-end Experiences in deploying the high-end visualization

Stagnation of deploying of Stagnation of deploying of Jun Takei 4 G and beyond Are you using

Lessons learned from deploying SUSE OpenStack Cloud and Enterprise Storage in the Public Cloud

Lessons Learned in Deploying OpenStack for HPC Users Graham T. Allan Edward Munsell Evan F.

Lessons Learned Deploying and Monitoring AI Models in Production at Major Tech Companies Who are

TLS 1.3 Lessons Learned from Implementing and Deploying the Latest Protocol Nick Sullivan

Lessons Learned in Deploying PaaS Colin Humphreys What we have done and why we have done it

Lessons Learned: Deploying Microservices Software Product in Customer Environments Mark Galpin,

Lessons Learned From Sequenced, Integrated Strategies of Economic After Hours Seminar

Some lessons learned from Team Science Some lessons learned from Team Science Lewis Cantley Weill

Opportunities Opportunities Lessons Learned Using Lessons Learned Using Vegetative

Virtualizing the Philippine e-Science Grid International Symposium on Grids and Clouds 2011 25

Synchronous multi-master clusters with MySQL: an introduction to Galera Henrik Ingo OUGF

EDUCATION WITH INNOVATIVE, INTEGRATED CURRICULA Yen-Ping Kuo, PhD School of Osteopathic Medicine

Clustering via Uncoupled REgression (CURE) Kaizheng Wang Department of ORFE Princeton

Introduzione al text mining Outline Introduzione e concetti di base Motivazioni,

Aaron LeMasters & Michael Murphy 1 1 RETRI is a new, agile approach to the Incident

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

An overview of ab initio scattering, reactions, and operators (circa 2014) Kenneth Nollett

Sambuz

Useful Links

Newsletter

Mail Us

Lessons Learned in Deploying the World's Largest Scale Lustre File - PowerPoint PPT Presentation

Lessons Learned in Deploying the World's Largest Scale Lustre File System Presented by David Dillow Galen Shipman, David Dillow, Sarp Oral, Feiyi Wang, Douglas Fuller, Jason Hill, and Zhe Zhang 1 Brief overview of Spider 10 PB storage to

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Deploying Large Scale AVB/TSN Networks Jeff Koftinoff, Meyer Sound Laboratories, Inc. June 19,

Deploying And Supporting Perl 6 Jonathan Worthington UKUUG Spring 2007 Conference Deploying And

Deploying Machine Learning Models on The Edge Deploying Machine Learning Models on The Edge Yan

Deploying Information Deploying Information Agents on the Web Agents on the Web Craig A.

Experiences in deploying the high-end Experiences in deploying the high-end visualization

Stagnation of deploying of Stagnation of deploying of Jun Takei 4 G and beyond Are you using

Lessons learned from deploying SUSE OpenStack Cloud and Enterprise Storage in the Public Cloud

Lessons Learned in Deploying OpenStack for HPC Users Graham T. Allan Edward Munsell Evan F.

Lessons Learned Deploying and Monitoring AI Models in Production at Major Tech Companies Who are

TLS 1.3 Lessons Learned from Implementing and Deploying the Latest Protocol Nick Sullivan

Lessons Learned in Deploying PaaS Colin Humphreys What we have done and why we have done it

Lessons Learned: Deploying Microservices Software Product in Customer Environments Mark Galpin,

Lessons Learned From Sequenced, Integrated Strategies of Economic After Hours Seminar

Some lessons learned from Team Science Some lessons learned from Team Science Lewis Cantley Weill

Opportunities Opportunities Lessons Learned Using Lessons Learned Using Vegetative

Virtualizing the Philippine e-Science Grid International Symposium on Grids and Clouds 2011 25

Synchronous multi-master clusters with MySQL: an introduction to Galera Henrik Ingo OUGF

EDUCATION WITH INNOVATIVE, INTEGRATED CURRICULA Yen-Ping Kuo, PhD School of Osteopathic Medicine

Clustering via Uncoupled REgression (CURE) Kaizheng Wang Department of ORFE Princeton

Introduzione al text mining Outline Introduzione e concetti di base Motivazioni,

Aaron LeMasters &amp; Michael Murphy 1 1 RETRI is a new, agile approach to the Incident

Multi-tenant Machine Learning Apache Aurora &amp; Apache Mesos Stephan Erb

An overview of ab initio scattering, reactions, and operators (circa 2014) Kenneth Nollett

Sambuz

Useful Links

Newsletter

Mail Us

Aaron LeMasters & Michael Murphy 1 1 RETRI is a new, agile approach to the Incident

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb