collective prefetching for parallel i o systems
play

Collective Prefetching for Parallel I/O Systems Yong Chen and - PowerPoint PPT Presentation

Collective Prefetching for Parallel I/O Systems Yong Chen and Philip C. Roth Oak Ridge National Laboratory Outline I/O gap in high-performance computing I/O prefetching and limitation Collective prefetching design and implementation


  1. Collective Prefetching for Parallel I/O Systems Yong Chen and Philip C. Roth Oak Ridge National Laboratory

  2. Outline • I/O gap in high-performance computing • I/O prefetching and limitation • Collective prefetching design and implementation • Preliminary experimental evaluation • Conclusion and future work 2 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

  3. High-Performance Computing Trend 3 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

  4. I/O for Large-scale Scientific Computing • Reading input and restart files • Reading and processing large amount of data • Writing checkpoint files • Writing movie, history files Compute Compute Compute Compute Compute Compute Node Node Compute Node • Applications tend to be data Node Compute Node Node Node Node intensive Metadata Server Object Object Object Storage Storage Storage Server Server Server 4 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

  5. The I/O Gap • Widening gap between computing and I/O • Widening gap between demands and I/O capability • Long I/O access latency leads FLOPS v.s. Disk Bandwidth to severe overall performance degradation • Limited I/O capability attributed as the cause of low sustained I/O Gap performance Application I/O Demand I/O System Capability System size 5 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

  6. Bridging Gap: Prefetching • Move data in advance and closer • Improve I/O system capability I/O Compute Compute Compute I/O Compute I/O Compute Prefetch Prefetch Time Time • Representative existing works – Patterson and Gibson, TIP, SOSP’95 – Tran and Reed, time series model based, TPDS’04 – Yang et. al, speculative execution, USENIX’02 – Byna et. al, signature based, SC’08 – Blas et. al, multi-level caching and prefetching for BGP, PVM/MPI’09 6 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

  7. Limitation of Existing Strategies • The effectiveness of I/O prefetching depends on carrying out prefetches efficiently and moving data swiftly • Existing studies take an independent approach, without considering the correlation of accesses among processes – Independent prefetching • Multiple processes of parallel applications have strong correlation with each other with respect to I/O accesses – Foundation of collective I/O, data sieving, etc. • We propose to take advantage of this correlation – Parallel I/O prefetching should be done in a collective way rather an ad hoc individual and independent way 7 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

  8. Collective Prefetching Idea • Take advantage of the correlation among I/O accesses of multiple processes to optimize prefetching • Benefits/features – Filter overlapping and redundant prefetch requests – Combine prefetch requests from multiple processes – Combine demand requests with prefetch requests – Form large and contiguous requests – Reduces system calls • Similar mechanism exploited in optimizations like collective I/O, data sieving, but no study for prefetching yet 8 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

  9. Collective Prefetching Framework Application Application Application Process Process Process Collective Prefetching Parallel I/O Middleware/Library (Prefetch Delegates) Collective I/O Two-Phase I/O Caching Parallel File Systems (PVFS, Lustre, GPFS, PanFS) I/O Hardware, Storage Devices 9 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

  10. MPI-IO with Collective Prefetching • MPI-IO and ROMIO • Collective I/O and Two-phase Implementation Process 0 Process 1 Process 2 Process 3 MPI-IO Comm. phase File domains ADIO Aggregator 0 Aggregator 1 Aggregator 2 Aggregator 3 ROMIO … Interconnect UFS I/O phase NFS PFS PVFS2 0 1 2 3 File servers 10 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

  11. Two-Phase Read Protocol in ROMIO • Each aggregator calculates Calc offsets the I/O requests span and & exchange exchange • Partitions the aggregated span Calc FDs & into file domains requests • Each aggregator carries out I/O requests for its own file Reads domain • All aggregators send data to the requesting processes, and Exchange each process receives its required data 11 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

  12. Extended Protocol with Collective Prefetching Calc offsets & exchange A. Maintain history D. Check w/ cachebuffer B. Predict E. Calc FDs & requests w/ prefs Reads C. Place Exchange pref data 12 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

  13. Collective Prefetching Algorithm Algorithm cpf /* Collective Prefetching at MPI-IO */ Input : I/O request offset list, I/O request length list Output : none Begin 1. Each aggregator maintains recent access history of window size w 2. Aggregators/prefetch delegates run prediction or mining algorithms on all tracked global access history 1.Algorithms can be as streaming, strided, Markov, or advanced mining algorithms such as PCA/ANN 3. Generate prefetch requests and enqueue them in PFQ 4. Process requests in PFQs together with demand accesses 5. Filter out overlapping and redundant requests 6. Perform extended two-phase I/O protocol with prefetch requests 1.Prefetched data are kept in cache buffer to satisfy future requests 2.Exchange data to satisfy demand requests (move data to user buffer) End 13 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

  14. Preliminary Results • Strided access pattern, with 1MB and 4MB strides 450 450 400 400 Sustained Bandwidth (MB/s) Sustained Bandwidth (MB/s) 350 350 300 300 250 250 200 200 150 150 100 100 50 50 0 0 8 16 32 64 128 8 16 32 64 128 MPI-IO Collective Prefetching Individual Prefetching MPI-IO Collective Prefetching Individual Prefetching With 1MB stride With 4MB stride Collective prefetching: up to 22%, 19% on average Collective prefetching: up to 17%, 15% on average Individual prefetching: up to 12%, 8% on average Individual prefetching: up to 8%, 6% on average 14 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

  15. Preliminary Results • Strided access pattern, with 1MB and 4MB strides 0.25 0.2 0.2 0.15 0.15 Speedup Speedup 0.1 0.1 0.05 0.05 0 0 8 16 32 64 128 8 16 32 64 128 Individual Prefetching Collective Prefetching Individual Prefetching Collective Prefetching With 1MB stride With 4MB stride • Collective prefetching outperformed by over one fold • Collective prefetching had a more stable performance trend 15 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

  16. Preliminary Results • Nested strided access pattern, with (1MB, 3MB) stride 350 0.2 300 Sustained Bandwidth (MB/s) 0.15 250 Speedup 200 0.1 150 100 0.05 50 0 0 8 16 32 64 128 8 16 32 64 128 Individual Prefetching Collective Prefetching MPI-IO Collective Prefetching Individual Prefetching Bandwidth Speedup • Collective prefetching outperformed by over 66% • Collective prefetching had a similar stable performance trend 16 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

  17. Conclusion • I/O has been widely recognized as the performance bottleneck for many HEC/HPC applications • Correlation of I/O accesses exploited in data sieving and collective I/O, but no study exploit for prefetching yet • We propose a new form of collective prefetching for parallel I/O systems • Preliminary results have demonstrated the potential • A general idea that can be applied at many levels, such as the storage device level or server level 17 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

  18. Ongoing and Future Work • Exploit the potential at the server level • LACIO: A New Collective I/O Strategy, and I/O customization Processes Processes Processes Processes Processes Comm. phase LB# 0 1 2 3 4 5 6 7 8 9 10 11 S# 0 1 2 3 0 1 2 3 0 1 2 3 File domains (Logical) Aggregator 0 Aggregator 1 Aggregator 2 Aggregator 3 Interconnect I/O phase 0 1 2 3 File servers (Physical) LB0 LB1 LB2 LB3 LB4 LB5 LB6 LB7 LB8 LB9 10 11 18 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

  19. Any Questions? Thank you. • Acknowledgement: Prof. Xian-He Sun of Illinois Institute of Technology, Dr. Rajeev Thakur of Argonne National Lab, Prof. Wei-Keng Liao and Prof. Alok Choudary of Northwestern University. 19 Managed by UT-Battelle for the U.S. Department of Energy PDSW 2010

Recommend


More recommend