with the scheduler s collaboration
play

[ with the schedulers collaboration] Alberto Miranda, PhD - PowerPoint PPT Presentation

www.bsc.es echofs: Enabling Transparent Access to Node-local NVM Burst Buffers for Legacy Applications [ with the schedulers collaboration] Alberto Miranda, PhD Researcher on HPC I/O alberto.miranda@bsc.es Dagstuhl, May 2017 The


  1. www.bsc.es echofs: Enabling Transparent Access to Node-local NVM Burst Buffers for Legacy Applications [… with the scheduler’s collaboration] Alberto Miranda, PhD Researcher on HPC I/O alberto.miranda@bsc.es Dagstuhl, May 2017 The NEXTGenIO project has received funding from the European Union’s Horizon 2020 Research and Innovation programme under Grant Agreement no. 671951

  2. I/O -> a fundamental challenge; • Petascale already struggles with I/O… external filesystem – Extreme parallelism w/ millions of threads – Job’s input data read from external PFS – Checkpoints: periodic writes to external PFS – Job’s output data write to external PFS high performance network • HPC & Data Intensive systems merging – Modelling, simulation and analytic workloads increasing… compute nodes 2

  3. I/O -> a fundamental challenge; • Petascale already struggles with I/O… external filesystem – Extreme parallelism w/ millions of threads – Job’s input data read from external PFS – Checkpoints: periodic writes to external PFS – Job’s output data write to external PFS high performance network • HPC & Data Intensive systems merging – Modelling, simulation and analytic workloads increasing… • And it will only get worse at Exascale … compute nodes 3

  4. Burst Buffers -> remote; • Fast storage devices that temporarily store burst filesystem external filesystem application data before sending it to PFS – Goal: Absorb peak I/O to avoid overtaxing PFS – Cray Datawarp, DDN IME • Growing interest to add them into next-gen high performance network HPC architectures – NERSC’s Cori, LLNL’s Sierra , ANL’s Aurora, … – Typically a separate resource to PFS – Usage/allocation/data movements/etc become user responsibility compute nodes 4

  5. Burst Buffers -> on-node; • Non-volatile coming to the node external filesystem – Argonne’s Theta has 128GB SSDs in each compute node high performance network compute nodes 5

  6. NEXTGenIO EU Project [http://www.nextgenio.eu] • Node-local, high-density NVRAM becomes a fundamental component I/O Stack – Intel’s 3DXPoint TM cache – Capacity much larger than DRAM memory – Slightly slower than DRAM but significantly faster storage than SSDs – DIMM form factor  standard memory controller – No refresh  no/low energy leakage 6

  7. NEXTGenIO EU Project [http://www.nextgenio.eu] • Node-local, high-density NVRAM becomes a fundamental component I/O Stack – Intel’s 3DXPoint TM cache – Capacity much larger than DRAM memory – Slightly slower than DRAM but significantly faster nvram than SSDs – DIMM form factor  standard memory controller fast storage – No refresh  no/low energy leakage slow storage 7

  8. NEXTGenIO EU Project [http://www.nextgenio.eu] • Node-local, high-density NVRAM becomes a fundamental component 1. How do we manage I/O Stack – Intel’s 3DXPoint TM access to these layers? cache – Capacity much larger than DRAM memory – Slightly slower than DRAM but significantly faster nvram than SSDs – DIMM form factor  standard memory controller fast storage – No refresh  no/low energy leakage slow storage 2. How can we bring the benefits from these layers to legacy code? 8

  9. OUR SOLUTION: MANAGING ACCESS THROUGH A USER-LEVEL FILESYSTEM

  10. echofs -> objectives; • First goal: Allow legacy applications to transparently benefit from new storage layers I/O Stack – Accessible storage layers under unique mount point cache – Make new layers readily available to applications memory – I/O stack complexity hidden from applications nvram – Allows for automatic management of data location SSD – POSIX interface [sorry] Lustre PFS namespace is “echoed” /mnt/PFS/User/App  /mnt/echofs/ /mnt/ECHOFS/User/App 10

  11. echofs -> objectives; • Second goal: construct a collaborative burst buffer by joining NVM regions assigned to a batch job by scheduler [SLURM] – Filesystem’s lifetime linked to batch job’s lifetime – Input files staged into NVM before job starts – Allow HPC jobs to perform collaborative NVM I/O parallel processes – Output files staged out to PFS when job ends POSIX read/writes echofs collaborative burst buffer stage-in/ NVM NVM NVM NVM stage-out external filesystem compute nodes 11

  12. echofs -> intended workflow; • User provides job I/O requirements through SLURM – Nodes required, files accessed, type of access [in|out|inout], expected lifetime [temporary|persistent], expected “survivability”, required POSIX semantics [?], … • SLURM allocates nodes and mounts echofs across them – Also forwards I/O requirements through API • echofs builds the CBB and fills it with input files – When finished, SLURM starts the batch job 12

  13. echofs -> intended workflow; • User provides job I/O requirements through SLURM – Nodes required, files accessed, type of access [in|out|inout], We can’t expect expected lifetime [temporary|persistent], optimization details expected “survivability”, required POSIX semantics [?], … from users, but maybe for them to offer us • SLURM allocates nodes and mounts echofs across them enough hints… – Also forwards I/O requirements through API • echofs builds the CBB and fills it with input files – When finished, SLURM starts the batch job 13

  14. echofs -> intended workflow; • Job I/O absorbed by collaborative burst buffer – Non-CBB open()s forwarded to PFS (throttled to limit PFS congestion) – Temporary files do not need to make it to PFS (e.g. checkpoints) – Metadata attributes for temporary files cached  Distributed key-value store If some other job reuses these files, we • When job completes, future of files managed by echofs can leave them “as is” – Persistent files eventually sync'd to PFS – Decision orchestrated by SLURM & DataScheduler component depending on requirements of upcoming jobs 14

  15. echofs -> data distribution; • Distributed data servers – Job’s data space partitioned across compute nodes data space – Each node acts as data server for its partition – Each node acts as data client for other partitions NVM NVM NVM NVM compute nodes • Pseudo-random file segment distribution – No replication ⇒ avoid coherence mechanisms hash – Resiliency through erasure codes (eventually) – Each node acts as lock manager for its partition [0-8MB) [8-16MB) [16-32MB) shared file 15

  16. echofs -> data distribution; • Why pseudo-random? – Efficient & decentralized segment lookup data space [no metadata request to lookup segment] – Balance workload w.r.t. partition size – Allows for collaborative I/O NVM NVM NVM NVM compute nodes hash [0-8MB) [8-16MB) [16-32MB) shared file 16

  17. echofs -> data distribution; • Why pseudo-random? – Efficient & decentralized segment lookup job scheduler [no metadata request to lookup segment] – Balance workload w.r.t. partition size – Allows for collaborative I/O allocated nodes – Guarantees minimal movements of allocated NVM nodes data if node allocation changes allocated allocated NVM NVM nodes nodes +1 node +1 node -2 nodes [future research on elasticity] NVM NVM NVM NVM data data data transfer transfer transfer NVM NVM NVM NVM job phases over time 17

  18. echofs -> data distribution; • Why pseudo-random? – Efficient & decentralized segment lookup job scheduler [no metadata request to lookup segment] – Balance workload w.r.t. partition size – Allows for collaborative I/O allocated nodes – Guarantees minimal movements of allocated NVM nodes data if node allocation changes allocated allocated NVM NVM nodes nodes +1 node +1 node -2 nodes [future research on elasticity] NVM NVM NVM NVM data data data Other strategies would transfer transfer transfer NVM NVM NVM NVM be possible depending on job semantics job phases over time 18

  19. echofs -> integration with batch scheduler; • Data Scheduler daemon external to echofs – Interfaces SLURM & echofs  Allows SLURM to send requests to echofs  Allows echofs to ACK these requests applications static dynamic – Offers an API to [non-legacy] applications I/O requirements I/O requirements willing to send I/O hints to echofs data – In the future will coordinate w/ SLURM to SLURM scheduler decide when different echofs instances stage-in/stage-out asynchronous requests should access PFS echofs [data-aware job scheduling] 19

  20. Summary; • Main features: – Ephemeral filesystem linked to job lifetime – Allows legacy applications to benefit from newer storage technologies – Provides aggregate I/O for applications • Research goals: – Improve coordination w/ job scheduler and other HPC management infrastructure – Investigate ad-hoc data distributions tailored for each job I/O – Scheduler-triggered specific optimizations for jobs/files 20

  21. Food for thought; • POSIX compliance is hard… – But maybe we don’t need FULL COMPLIANCE for ALL jobs… • Adding I/O- awareness to the scheduler is important… – Allows wasting I/O work already done… – … but requires user/developer collaboration (tricky…) • User- level filesystems/libraries solve very specific I/O problems… – Can we reuse/integrate these efforts? Can we learn what works for a specific application, characterize it & automatically run similar ones in a “best fit” FS? 21

Recommend


More recommend