Efficient Object Storage Journaling in a Distributed Parallel File System Presented by Sarp Oral Sarp Oral, Feiyi Wang, David Dillow, Galen Shipman, Ross Miller, and Oleg Drokin FAST’10, Feb 25, 2010
A Demanding Computational Environment 18,688 224,256 300+ TB 2.3 PFlops Jaguar XT5 Nodes Cores memory Jaguar XT4 7,832 31,328 63 TB 263 TFlops Nodes Cores memory Frost (SGI Ice) 128 Node institutional cluster 80 Node software development cluster Smoky 30 Node visualization and analysis cluster Lens FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010 2
Spider Fastest Lustre file system in the world Demonstrated bandwidth of 240 GB/s on the center wide file system Largest scale Lustre file system in the world Demonstrated stability and concurrent mounts on major OLCF systems • Jaguar XT5 • Jaguar XT4 • Opteron Dev Cluster (Smoky) • Visualization Cluster (Lens) Over 26,000 clients mounting the file system and performing I/O General availability on Jaguar XT5, Lens, Smoky, and GridFTP servers Cutting edge resiliency at scale Demonstrated resiliency features on Jaguar XT5 • DM Multipath • Lustre Router failover FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010 3
Designed to Support Peak Performance 100.00 ReadBW GB/s WriteBW GB/s 90.00 80.00 70.00 Bandwidth GB/s 60.00 50.00 40.00 30.00 20.00 10.00 0.00 1/1/10 0:00 1/6/10 0:00 1/11/10 0:00 1/16/10 0:00 1/21/10 0:00 1/26/10 0:00 1/31/10 0:00 Timeline (January 2010) Max data rates (hourly) on ½ of available storage controllers FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010 4
Motivations for a Center Wide File System • Building dedicated file systems for platforms does not scale – Storage often 10% or more of new system cost – Storage often not poised to grow independently of attached machine – Different curves for storage and compute technology – Data needs to be moved between different compute islands • Simulation platform to visualization platform – Dedicated storage is only accessible Jaguar Jaguar Ewok XT5 when its machine is available Smoky XT4 Lens – Managing multiple file systems Ewok SION Network & Spider System requires more manpower Jaguar XT4 Jaguar XT5 Smoky Lens data sharing path FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010 5
Spider: A System At Scale • Over 10.7 PB of RAID 6 formatted capacity • 13,440 1 TB drives • 192 Lustre I/O servers • Over 3 TB of memory (on Lustre I/O servers) • Available to many compute systems through high-speed SION network – Over 3,000 IB ports – Over 3 miles (5 kilometers) cables • Over 26,000 client mounts for I/O • Peak I/O performance is 240 GB/s • Current Status – in production use on all major OLCF computing platforms FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010 6
Lustre File System • Developed and maintained by CFS, then Sun, now Oracle • POSIX compliant, open source parallel file system, driven by DOE Labs • Metadata server (MDS) manages Metadata Server Metadata Target namespace (MDS) (MDT) • Object storage server (OSS) manages Object storage targets (OST) • OST manages block devices – ldiskfs on OSTs High- performance V. 1.6 superset of ext3 • interconnect • V. 1.8 + superset of ext3 or ext4 • High-performance Lustre Clients – Parallelism by object striping Object Storage Servers Object Storage Targets (OSS) (OST) • Highly scalable • Tuned for parallel block I/O FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010 7
Spider - Overview • Currently providing high-performance scratch space to all major OLCF platforms Aggregation 2 60 DDR 5 DDR 96 DDN S2A9900 IB V Lens/Everest Smoky Controllers (Singlets) 192 Lustre I/O servers 32 DDR 192 4x DDR IB Core 1 Core 2 64 DDR 64 DDR IB V IB V IB V Aggregation 1 SAS connections 192 4x DDR IB 24 DDR 24 DDR Jaguar XT4 segment 96 DDR 96 DDR Jaguar XT5 segment SION IB Network 13,440 SATA-II Disks 48 Leaf Switches 1,344 (8+2) RAID level 6 arrays (tiers) 96 DDR 96 DDR IB V 192 DDR Spider FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010 8
Spider - Speeds and Feeds XT5 Serial ATA InfiniBand SeaStar2+ 3D Torus 3 Gbit/sec 16 Gbit/sec 9.6 Gbytes/sec 384 366 Gbytes/s 384 Gbytes/s 384 Gbytes/s Gbytes/s Jaguar XT5 96 Gbytes/s Jaguar XT4 Other Systems (Viz, Clusters) Enterprise Storage Storage Nodes SION Network Lustre Router Nodes controllers and large run parallel file system provides connectivity run parallel file system racks of disks are connected software and manage between OLCF client software and via InfiniBand. incoming FS traffic. resources and forward I/O operations primarily carries from HPC clients. 48 DataDirect S2A9900 192 dual quad core storage traffic. controller pairs with Xeon servers with 192 (XT5) and 48 (XT4) 1 Tbyte drives 16 Gbytes of RAM each 3000+ port 16 Gbit/sec one dual core and 4 InifiniBand InfiniBand switch Opteron nodes with connections per pair complex 8 GB of RAM each FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010 9
Spider - Couplet and Scalable Cluster DDN S2A 9900 280 1TB Disks Disks DDN Couplet 16 SC units on the floor Couplet in 5 disk trays Disks DDN Couplet 280 in 5 trays (2 controllers) (2 controllers) 2 racks for each SC 280 in 5 trays (2 controllers) SC SC SC SC 24 IB ports Lustre I/O Servers 24 IB ports Flextronics Switch (4 Dell nodes) OSS (4 Dell nodes) 24 IB ports Flextronics Switch SC SC SC SC IB OSS (4 Dell nodes) Flextronics Switch Ports IB Ports IB Ports SC SC SC SC Uplink to Uplink to Cisco Core Switch Uplink to Unit 1 Cisco Core Switch Cisco Core Switch Unit 2 SC SC SC SC Unit 3 A Spider Scalable Cluster (SC) FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010 10
Spider - DDN S2A9900 Couplet Controller1 Controller2 Power Supply Power Supply (House) (House) Power Supply Power Supply (UPS) (UPS) A 1 A 2 IO IO Module Module B 1 B 2 C 1 C 2 IO IO Module Module D 1 D 2 E 1 E 2 IO IO Module Module F 1 F 2 G 1 G 2 IO IO Module Module H 1 H 2 P 1 P 2 IO IO Module Module S 1 S 2 ... DEM D 1 D 14 DEM A 2 A 1 ... DEM D 15 D 28 DEM ... DEM D 29 D 42 DEM B 2 B 1 ... DEM D 43 D 56 DEM Power Supply Power Supply (House) (UPS) Disk Enclosure 1 Disk Enclosure 2 ... Disk Enclosure 5 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010 11
Spider - DDN S2A9900 (cont’d) • RAID 6 (8+2) Disk Controller 1 Disk Controller 2 ... ... D1A D2A D14A D15A D16A D28A Channel A Channel A ... ... D1B D2B D14B D15B D16B D28B Channel B Channel B 8 data drives ... ... D1P D2P D14P D15P D16P D28P Channel P Channel P 2 parity ... ... drives D1S D15S D16S D28S Channel S Channel S D2S D14S ... ... Tier1 Tier 2 Tier 14 Tier 15 Tier 16 Tier 28 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010 12
Spider - How Did We Get Here? • 4 years project • We didn’t just pick up phone and order a center-wide file system – No single vendor could deliver this system – Trail blazing was required • Collaborative effort was key to success – ORNL – Cray – DDN – Cisco – CFS, SUN, and now Oracle FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010 13
Spider – Solved Technical Challenges • Performance – Asynchronous journaling – Network congestion avoidance • Scalability SeaStar Torus Congestion – 26,000 file system clients Hard bounce of 7844 nodes via 48 routers 120 Combined R/W MB/s Combined R/W IOPS • Fault tolerance design Bounce XT4 @ 206s 100 Percent of observed peak {MB/s,IOPS} – Network 80 OST Evicitions Full I/O @ 524s – I/O servers 60 I/O returns @ 435s – Storage arrays 40 RDMA Timeouts • Infiniband support on XT SIO 20 Bulk Timeouts 0 0 100 200 300 400 500 600 700 800 900 Elapsed time (seconds) FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010 14
ldiskfs Journaling Overhead • Even sequential writes exhibit random I/O behavior due to journaling • Observed 4-8 KB writes along with 1 MB sequential writes on DDNs • DDN S2A9900’s are not well tuned for small I/O access • For enhanced reliability write-back cache on DDNs are turned off • Special file (contiguous block space) reserved for journaling on ldiskfs – Labeled as journal device – Beginning on physical disk layout • Ordered mode • After file data portion committed on disk journal meta data portion needs to be committed • Extra head seek needed for every journal transaction commit! FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010 15
ldiskfs Journaling Overhead (Cont’d) • Block level benchmarking (writes) for 28 tiers 5608.15 MB/s (baseline) • File system level benchmark ( obdfilter ) gives 1398.99 MB/s – 24.9% of baseline bandwidth – One couplet, 4 OSS each with 7 OSTs – 28 clients, one-to-one mapping with OSTs • Analysis – Large number of 4KB writes in addition to 1MB writes – Traced back to ldiskfs journal updates Table 1: XDD baseline performance !"#$ %&'(" !"#$"%&'() *+,-*. ./,-0, )'*+,"-./0 1(%234 565-70 8*-77 !"#$"%&'() ,,76-5, ,*6+-5, 12-./0 1(%234 .7,/-+7 .,/6-, FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010 16
Recommend
More recommend