Cluster 2010 Presentation Optimization Techniques at the I/O - PowerPoint PPT Presentation

Cluster 2010 Presentation Optimization Techniques at the I/O Forwarding Layer Kazuki Ohta (presenter) : Preferred Infrastructure, Inc., University of Tokyo Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross: Argonne National Laboratory Yutaka Ishikawa: University of Tokyo Contact: kazuki.ohta@gmail.com 1 2010 年 9 月 23 日木曜日

Background: Compute and Storage Imbalance • Leadership-class computational scale: • 100,000+ processes • Advanced Multi-core architectures, Compute node OSs • Leadership-class storage scale: • 100+ servers • Commercial storage hardware, Cluster file system • Current leadership-class machines supply only 1GB/s of storage throughput for every 10TF of compute performance . This gap grew factor of 10 in recent years. • Bridging this imbalance between compute and storage is a critical problem for the large-scale computation. 2 2010 年 9 月 23 日木曜日

Previous Studies: Current I/O Software Stack Storage Abstraction, Data Portability (HDF5, NetCDF, Parallel/Serial Applications ADIOS) High-Level I/O Libraries Organizing Accesses from Many Clients POSIX I/O (ROMIO) MPI-IO VFS, FUSE Parallel File Systems Logical File System over Many Storage Devices (PVFS2, Lustre, GPFS, PanFS, Storage Devices Ceph, etc) 3 2010 年 9 月 23 日木曜日

Disk Disk Disk Disk Challenge: Millions of Concurrent Clients • 1,000,000+ concurrent clients present a challenge to current I/O stack • e,g. metadata performance, locking, network incast problem, etc. • I/O Forwarding Layer is introduced. • All I/O requests are delegated to dedicated I/O forwarder process. • I/O forwarder reduces the number of clients seen by the file system for all applications, without collective I/O. I/O Path Compute Compute .... Compute Compute Compute Compute Compute Processes I/O Forwarder I/O Forwarder Parallel File System Parallel File System PVFS2 PVFS2 PVFS2 PVFS2 4 2010 年 9 月 23 日木曜日

I/O Software Stack with I/O Forwarding Parallel/Serial Applications High-Level I/O Libraries POSIX I/O MPI-IO VFS, FUSE Bridge Between Compute Process and Storage System I/O Forwarding (IBM ciod, Cray DVS, IOFSL) Parallel File Systems Storage Devices 5 2010 年 9 月 23 日木曜日

Example I/O System: Blue Gene/P Architecture 6 2010 年 9 月 23 日木曜日

I/O Forwarding Challenges • Large Requests • Latency of the forwarding • Memory limit of the I/O • Variety of backend file system node performance • Small Requests • Current I/O forwarding mechanism reduces the number of clients, but does not reduces the number of requests. • Request processing overheads at the file systems • We proposed two optimization techniques for the I/O forwarding layer. • Out-Of-Order I/O Pipelining , for large requests. • I/O Request Scheduler , for small requests. 7 2010 年 9 月 23 日木曜日

Out-Of-Order I/O Pipelining • Split large I/O requests into small fixed-size chunks • These chunks are forwarded in File FileSystem an out-of-order way. Client IOFSL System Client IOFSL Threads • Good points • Reduce forwarding latency, by overlapping the I/O requests and the network transfer. • I/O sizes are not limited by the memory size at the forwarding node. No-Pipelining Out-Of-Order Pipelining • Little effect by the slowest file system node. 8 2010 年 9 月 23 日木曜日

I/O Request Scheduler • Scheduling and Merging the small requests at the forwarder • Reduce number of seeks • Reduce number of requests, the file systems actually sees • Scheduling overhead must be minimum • Handle-Based Round-Robin algorithm for the fairness between files • Ranges are managed by Interval Tree • The contiguous requests are merged Pick N Requests Q and Issue I/O H H H H H Read Read Write Read Read 9 2010 年 9 月 23 日木曜日

I/O Forwarding and Scalability Layer (IOFSL) •IOFSL Project [Nawab 2009] •Open-Source I/O Forwarding Implementation •http://www.iofsl.org/ •Portable on most HPC environment •Network Independent •All network communication is done by BMI [Carns 2005] •TCP/IP , Infiniband, Myrinet, Blue Gene/P Tree, Portals, etc. •File System Independent •MPI-IO (ROMIO) / FUSE Client 10 2010 年 9 月 23 日木曜日

IOFSL Software Stack Application POSIXInterface MPI-IO Interface FUSE ROMIO-ZOIDFS IOFSL Server ZOIDFS Client API FileSystem Dispatcher BMI BMI PVFS2 libsysio I/O Request ZOIDFS Protocol (TCP, Infiniband, Myrinet, etc. ) • Out-Of-Order I/O Pipelining and the I/O request scheduler have been implemented in the IOFSL, and evaluated on two environments. • T2K Tokyo (Linux Cluster), and ANL Surveyor (Blue Gene/P) 11 2010 年 9 月 23 日木曜日

Evaluation on T2K: Spec • T2K Open Super Computer, Tokyo Sites • http://www.open-supercomputer.org/ • 32 node Research Cluster • 16 cores: 2.3 GHz Quad-Core Opteron*4 • 32GB Memory • 10Gbps Myrinet Network • SATA Disk (Read: 49.52 MB/sec, Write 39.76 MB/sec) • One IOFSL, Four PVFS2, 128 MPI Processes • Software • MPICH2 1.1.1p1 • PVFS2 CVS (almost 2.8.2) 12 2010 年 9 月 23 日木曜日

Evaluation on T2K: IOR Benchmark •Each process issues the same amount of I/O •Gradually increasing the message size, and see the bandwidth change •Note: modified to do fsync() for MPI-IO Message Size 13 2010 年 9 月 23 日木曜日

Evaluation on T2K: IOR Benchmark, 128procs 120 ROMIO PVFS2 IOFSL none IOFSL hbrr 100 80 Bandwidth (MB/s) 60 40 20 0 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 14 2010 年 9 月 23 日木曜日

Evaluation on T2K: IOR Benchmark, 128procs 120 ROMIO PVFS2 IOFSL none IOFSL hbrr 100 80 Bandwidth (MB/s) 60 Out-Of-Order Pipelining Improvements ( ～ 29.5%) 40 20 0 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 14 2010 年 9 月 23 日木曜日

Evaluation on T2K: IOR Benchmark, 128procs 120 ROMIO PVFS2 IOFSL none IOFSL hbrr 100 80 Bandwidth (MB/s) 60 Out-Of-Order Pipelining Improvements ( ～ 29.5%) 40 I/O Scheduler Improvement ( ～ 40.0%) 20 0 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 14 2010 年 9 月 23 日木曜日

Evaluation on Blue Gene/P: Spec • Argonne National Laboratory BG/P “Surveyor” • Blue Gene/P platform for research and development • 1024 nodes, 4096-core • Four PVFS2 servers • DataDirect Networks S2A9550 SAN • 256 compute nodes, with 4 I/O nodes were used. Node Card: 4 core Node Board: 128 core Rack: 4096 core 15 2010 年 9 月 23 日木曜日

Evaluation on BG/P: BMI PingPong 900 BMI TCP/IP BMI ZOID 800 CNK BG/P Tree Network 700 600 Bandwidth (MB/s) 500 400 300 200 100 0 1 4 16 64 256 1024 4096 Buffer Size (KB) 16 2010 年 9 月 23 日木曜日

Evaluation on BG/P: IOR Benchmark, 256nodes 800 CIOD IOFSL FIFO(32threads) 700 600 Bandwidth (MiB/s) 500 400 300 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 17 2010 年 9 月 23 日木曜日

Evaluation on BG/P: IOR Benchmark, 256nodes 800 CIOD IOFSL FIFO(32threads) 700 600 Bandwidth (MiB/s) Performance Improvements 500 ( ～ 42.0%) 400 300 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 17 2010 年 9 月 23 日木曜日

Evaluation on BG/P: IOR Benchmark, 256nodes 800 CIOD IOFSL FIFO(32threads) 700 600 Bandwidth (MiB/s) Performance Improvements 500 ( ～ 42.0%) 400 300 Performance Drop ( ～ -38.5%) 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 17 2010 年 9 月 23 日木曜日

Evaluation on BG/P: Thread Count Effect 800 IOFSL FIFO (16threads) IOFSL FIFO (32threads) 700 600 Bandwidth (MiB/s) 500 400 300 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 18 2010 年 9 月 23 日木曜日

Evaluation on BG/P: Thread Count Effect 800 IOFSL FIFO (16threads) IOFSL FIFO (32threads) 700 600 Bandwidth (MiB/s) 500 400 300 16 threads > 32threads 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 18 2010 年 9 月 23 日木曜日

Related Work • Computational Plant Project @ Sandia National Laboratory • First introduced I/O Forwarding Layer • IBM Blue Gene/L, Blue Gene/P • All I/O requests are forwarded to I/O nodes • Compute OS can be stripped down to minimum functionality, and reduces the OS noise • ZOID: I/O Forwarding Project [Kamil 2008] • Only on Blue Gene • Lustre Network Request Scheduler (NRS) [Qian 2009] • Request scheduler at the parallel file system nodes • Only simulation results 19 2010 年 9 月 23 日木曜日

Future Work • Event-driven server architecture • reduced thread contension • Collaborative Caching at the I/O forwarding layer • multiple I/O forwarder works collaboratively for caching data and also metadata • Hints from MPI-IO • Better cooperation with collective I/O • Evaluation on other leadership scale machines • ORNL Jaguar, Cray XT4, XT5 systems 20 2010 年 9 月 23 日木曜日

Cluster 2010 Presentation Optimization Techniques at the I/O - PowerPoint PPT Presentation

Cluster 2010 Presentation Optimization Techniques at the I/O Forwarding Layer Kazuki Ohta (presenter) : Preferred Infrastructure, Inc., University of Tokyo Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross: Argonne National Laboratory Yutaka

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

Eye and Brain Eye and Brain Central visual pathways 1 2/22/2010 2 2/22/2010 3 2/22/2010 4

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

EDEN CLUSTER STATIONS EDEN CLUSTER STATIONS Density MUNICIPALITY SAPS STATION (inhabitants/km 2

Build Your Cluster with Rocks Build Your Cluster with Rocks Yu Fu Yu Fu University of Florida

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Cluster Computing

Reaching the Goal with the Regensburg Marathon Cluster - A NetBSD Cluster Project - Hubert Feyrer

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Mechatronics MECHATREC 1 MECHATREC STRATEGIC INNOVATIVE CLUSTER : as founded as an

with Cluster Development Ms. Atchaka Sibunruang Minister of Industry 23 November 2015 Concepts

NDC SUPPORT CLUSTER NDC SUPPORT CLUSTER In 2015 the German Federal Environment Ministry (BMUB)

RUXCON Courtesy of google images Metamorphic template with

Proio: YAIO! David Blyth Introduction A new IO scheme has been written, and its called proio .

DAY TWO Developed by Kaseya University Powered by IT Scholars Kaseya Version 6.2 Last updated

Less is More: Accelerating Deep Neural Networks with Micro-Batching 1 Yosuke Oyama 1, a , Tal

ARA: Automatic Instance-Level Analysis in Real- Time Systems Gerion Entrup , Benedikt Steinmeier,

Linux perf_events status update Stephane Eranian Google Petascale Tools Workshop 2014 Google

MAS MASTE TER R OF OF INTERN INTERNATION TIONAL AL BUSINESS USINESS DU DUAL AL DEGR

What w e have learned from developing and running ABw E Jiri Navratil, Les R.Cottrell (SLAC)

Cluster 2010 Presentation Optimization Techniques at the I/O - PowerPoint PPT Presentation

Cluster 2010 Presentation Optimization Techniques at the I/O Forwarding Layer Kazuki Ohta (presenter) : Preferred Infrastructure, Inc., University of Tokyo Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross: Argonne National Laboratory Yutaka

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

Eye and Brain Eye and Brain Central visual pathways 1 2/22/2010 2 2/22/2010 3 2/22/2010 4

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

EDEN CLUSTER STATIONS EDEN CLUSTER STATIONS Density MUNICIPALITY SAPS STATION (inhabitants/km 2

Build Your Cluster with Rocks Build Your Cluster with Rocks Yu Fu Yu Fu University of Florida

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Cluster Computing

Reaching the Goal with the Regensburg Marathon Cluster - A NetBSD Cluster Project - Hubert Feyrer

Computing Cluster Usage Visualization Tool Compu&amp;ng Cluster Usage Visualiza&amp;on

Computing Cluster Usage Visualization Tool Compu&amp;ng Cluster Usage Visualiza&amp;on

Mechatronics MECHATREC 1 MECHATREC STRATEGIC INNOVATIVE CLUSTER : as founded as an

with Cluster Development Ms. Atchaka Sibunruang Minister of Industry 23 November 2015 Concepts

NDC SUPPORT CLUSTER NDC SUPPORT CLUSTER In 2015 the German Federal Environment Ministry (BMUB)

RUXCON Courtesy of google images Metamorphic template with

Proio: YAIO! David Blyth Introduction A new IO scheme has been written, and its called proio .

DAY TWO Developed by Kaseya University Powered by IT Scholars Kaseya Version 6.2 Last updated

Less is More: Accelerating Deep Neural Networks with Micro-Batching 1 Yosuke Oyama 1, a , Tal

ARA: Automatic Instance-Level Analysis in Real- Time Systems Gerion Entrup , Benedikt Steinmeier,

Linux perf_events status update Stephane Eranian Google Petascale Tools Workshop 2014 Google

MAS MASTE TER R OF OF INTERN INTERNATION TIONAL AL BUSINESS USINESS DU DUAL AL DEGR

What w e have learned from developing and running ABw E Jiri Navratil, Les R.Cottrell (SLAC)

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on