Improving I/O Performance of HPC Applications Using Intra-Job Scheduling Arnab K. Paul † , Olaf Faaland ‡ , Adam Moody ‡ , Elsa Gonsiorowski ‡ , Kathryn Mohror ‡ , Ali R. Butt † † Virginia Tech , ‡ Lawrence Livermore National Laboratory PDSW-DISCS 2019; collocated with SC’19, Denver, CO
Motivation: The Increasing Gap Processor Performance vs Disk Access Time 2 https://newsroom.intel.com/editorials/3d-xpoint-memory-storage/#gs.gqtcop
Motivation I/O operations become a limiting factor in application efficiency. Processor Performance vs Disk Access Time 3 https://newsroom.intel.com/editorials/3d-xpoint-memory-storage/#gs.gqtcop
Motivation I/O operations become a limiting factor in application efficiency. Improve I/O Performance of HPC Applications Using Intra-Job Scheduling Processor Performance vs Disk Access Time 4 https://newsroom.intel.com/editorials/3d-xpoint-memory-storage/#gs.gqtcop
Lustre Parallel File System Lustre Clients . . . Management Server (MGS) Management Ethernet or Infiniband Network Target (MGT) Metadata Server (MDT) Metadata direct, Target (MDT) parallel file access DNE Metadata . . . Servers and Metadata Object Storage Servers and Targets (OSS & OSTs) Targets . . . . . . 5
System Design Job Statistics Machine Learning Validation Dataset Modeling Models are stored 6
System Design Currently Model running jobs DB New jobs Job scheduler Current and new jobs’ future requests 7
Preliminary Results • Built a Lustre Simulator on NS3. • Results from time-series modeling show an accuracy of 95% in predicting job write bursts. 8
Next Steps • Modify the scheduler to reduce I/O contention. • Measure the I/O performance of the jobs as well as the overall performance of the system. 9
Thank You! Q & A akpaul@vt.edu http://research.cs.vt.edu/dssl/ 10
Recommend
More recommend