improving i o performance of hpc applications using intra
play

Improving I/O Performance of HPC Applications Using Intra-Job - PowerPoint PPT Presentation

Improving I/O Performance of HPC Applications Using Intra-Job Scheduling Arnab K. Paul , Olaf Faaland , Adam Moody , Elsa Gonsiorowski , Kathryn Mohror , Ali R. Butt Virginia Tech , Lawrence Livermore National


  1. Improving I/O Performance of HPC Applications Using Intra-Job Scheduling Arnab K. Paul † , Olaf Faaland ‡ , Adam Moody ‡ , Elsa Gonsiorowski ‡ , Kathryn Mohror ‡ , Ali R. Butt † † Virginia Tech , ‡ Lawrence Livermore National Laboratory PDSW-DISCS 2019; collocated with SC’19, Denver, CO

  2. Motivation: The Increasing Gap Processor Performance vs Disk Access Time 2 https://newsroom.intel.com/editorials/3d-xpoint-memory-storage/#gs.gqtcop

  3. Motivation I/O operations become a limiting factor in application efficiency. Processor Performance vs Disk Access Time 3 https://newsroom.intel.com/editorials/3d-xpoint-memory-storage/#gs.gqtcop

  4. Motivation I/O operations become a limiting factor in application efficiency. Improve I/O Performance of HPC Applications Using Intra-Job Scheduling Processor Performance vs Disk Access Time 4 https://newsroom.intel.com/editorials/3d-xpoint-memory-storage/#gs.gqtcop

  5. Lustre Parallel File System Lustre Clients . . . Management Server (MGS) Management Ethernet or Infiniband Network Target (MGT) Metadata Server (MDT) Metadata direct, Target (MDT) parallel file access DNE Metadata . . . Servers and Metadata Object Storage Servers and Targets (OSS & OSTs) Targets . . . . . . 5

  6. System Design Job Statistics Machine Learning Validation Dataset Modeling Models are stored 6

  7. System Design Currently Model running jobs DB New jobs Job scheduler Current and new jobs’ future requests 7

  8. Preliminary Results • Built a Lustre Simulator on NS3. • Results from time-series modeling show an accuracy of 95% in predicting job write bursts. 8

  9. Next Steps • Modify the scheduler to reduce I/O contention. • Measure the I/O performance of the jobs as well as the overall performance of the system. 9

  10. Thank You! Q & A akpaul@vt.edu http://research.cs.vt.edu/dssl/ 10

Recommend


More recommend