[ with the schedulers collaboration] Alberto Miranda, PhD - PowerPoint PPT Presentation

www.bsc.es echofs: Enabling Transparent Access to Node-local NVM Burst Buffers for Legacy Applications [… with the scheduler’s collaboration] Alberto Miranda, PhD Researcher on HPC I/O alberto.miranda@bsc.es Dagstuhl, May 2017 The NEXTGenIO project has received funding from the European Union’s Horizon 2020 Research and Innovation programme under Grant Agreement no. 671951

I/O -> a fundamental challenge; • Petascale already struggles with I/O… external filesystem – Extreme parallelism w/ millions of threads – Job’s input data read from external PFS – Checkpoints: periodic writes to external PFS – Job’s output data write to external PFS high performance network • HPC & Data Intensive systems merging – Modelling, simulation and analytic workloads increasing… compute nodes 2

I/O -> a fundamental challenge; • Petascale already struggles with I/O… external filesystem – Extreme parallelism w/ millions of threads – Job’s input data read from external PFS – Checkpoints: periodic writes to external PFS – Job’s output data write to external PFS high performance network • HPC & Data Intensive systems merging – Modelling, simulation and analytic workloads increasing… • And it will only get worse at Exascale … compute nodes 3

Burst Buffers -> remote; • Fast storage devices that temporarily store burst filesystem external filesystem application data before sending it to PFS – Goal: Absorb peak I/O to avoid overtaxing PFS – Cray Datawarp, DDN IME • Growing interest to add them into next-gen high performance network HPC architectures – NERSC’s Cori, LLNL’s Sierra , ANL’s Aurora, … – Typically a separate resource to PFS – Usage/allocation/data movements/etc become user responsibility compute nodes 4

Burst Buffers -> on-node; • Non-volatile coming to the node external filesystem – Argonne’s Theta has 128GB SSDs in each compute node high performance network compute nodes 5

NEXTGenIO EU Project [http://www.nextgenio.eu] • Node-local, high-density NVRAM becomes a fundamental component I/O Stack – Intel’s 3DXPoint TM cache – Capacity much larger than DRAM memory – Slightly slower than DRAM but significantly faster storage than SSDs – DIMM form factor  standard memory controller – No refresh  no/low energy leakage 6

NEXTGenIO EU Project [http://www.nextgenio.eu] • Node-local, high-density NVRAM becomes a fundamental component I/O Stack – Intel’s 3DXPoint TM cache – Capacity much larger than DRAM memory – Slightly slower than DRAM but significantly faster nvram than SSDs – DIMM form factor  standard memory controller fast storage – No refresh  no/low energy leakage slow storage 7

NEXTGenIO EU Project [http://www.nextgenio.eu] • Node-local, high-density NVRAM becomes a fundamental component 1. How do we manage I/O Stack – Intel’s 3DXPoint TM access to these layers? cache – Capacity much larger than DRAM memory – Slightly slower than DRAM but significantly faster nvram than SSDs – DIMM form factor  standard memory controller fast storage – No refresh  no/low energy leakage slow storage 2. How can we bring the benefits from these layers to legacy code? 8

OUR SOLUTION: MANAGING ACCESS THROUGH A USER-LEVEL FILESYSTEM

echofs -> objectives; • First goal: Allow legacy applications to transparently benefit from new storage layers I/O Stack – Accessible storage layers under unique mount point cache – Make new layers readily available to applications memory – I/O stack complexity hidden from applications nvram – Allows for automatic management of data location SSD – POSIX interface [sorry] Lustre PFS namespace is “echoed” /mnt/PFS/User/App  /mnt/echofs/ /mnt/ECHOFS/User/App 10

echofs -> objectives; • Second goal: construct a collaborative burst buffer by joining NVM regions assigned to a batch job by scheduler [SLURM] – Filesystem’s lifetime linked to batch job’s lifetime – Input files staged into NVM before job starts – Allow HPC jobs to perform collaborative NVM I/O parallel processes – Output files staged out to PFS when job ends POSIX read/writes echofs collaborative burst buffer stage-in/ NVM NVM NVM NVM stage-out external filesystem compute nodes 11

echofs -> intended workflow; • User provides job I/O requirements through SLURM – Nodes required, files accessed, type of access [in|out|inout], expected lifetime [temporary|persistent], expected “survivability”, required POSIX semantics [?], … • SLURM allocates nodes and mounts echofs across them – Also forwards I/O requirements through API • echofs builds the CBB and fills it with input files – When finished, SLURM starts the batch job 12

echofs -> intended workflow; • User provides job I/O requirements through SLURM – Nodes required, files accessed, type of access [in|out|inout], We can’t expect expected lifetime [temporary|persistent], optimization details expected “survivability”, required POSIX semantics [?], … from users, but maybe for them to offer us • SLURM allocates nodes and mounts echofs across them enough hints… – Also forwards I/O requirements through API • echofs builds the CBB and fills it with input files – When finished, SLURM starts the batch job 13

echofs -> intended workflow; • Job I/O absorbed by collaborative burst buffer – Non-CBB open()s forwarded to PFS (throttled to limit PFS congestion) – Temporary files do not need to make it to PFS (e.g. checkpoints) – Metadata attributes for temporary files cached  Distributed key-value store If some other job reuses these files, we • When job completes, future of files managed by echofs can leave them “as is” – Persistent files eventually sync'd to PFS – Decision orchestrated by SLURM & DataScheduler component depending on requirements of upcoming jobs 14

echofs -> data distribution; • Distributed data servers – Job’s data space partitioned across compute nodes data space – Each node acts as data server for its partition – Each node acts as data client for other partitions NVM NVM NVM NVM compute nodes • Pseudo-random file segment distribution – No replication ⇒ avoid coherence mechanisms hash – Resiliency through erasure codes (eventually) – Each node acts as lock manager for its partition [0-8MB) [8-16MB) [16-32MB) shared file 15

echofs -> data distribution; • Why pseudo-random? – Efficient & decentralized segment lookup data space [no metadata request to lookup segment] – Balance workload w.r.t. partition size – Allows for collaborative I/O NVM NVM NVM NVM compute nodes hash [0-8MB) [8-16MB) [16-32MB) shared file 16

echofs -> data distribution; • Why pseudo-random? – Efficient & decentralized segment lookup job scheduler [no metadata request to lookup segment] – Balance workload w.r.t. partition size – Allows for collaborative I/O allocated nodes – Guarantees minimal movements of allocated NVM nodes data if node allocation changes allocated allocated NVM NVM nodes nodes +1 node +1 node -2 nodes [future research on elasticity] NVM NVM NVM NVM data data data transfer transfer transfer NVM NVM NVM NVM job phases over time 17

echofs -> data distribution; • Why pseudo-random? – Efficient & decentralized segment lookup job scheduler [no metadata request to lookup segment] – Balance workload w.r.t. partition size – Allows for collaborative I/O allocated nodes – Guarantees minimal movements of allocated NVM nodes data if node allocation changes allocated allocated NVM NVM nodes nodes +1 node +1 node -2 nodes [future research on elasticity] NVM NVM NVM NVM data data data Other strategies would transfer transfer transfer NVM NVM NVM NVM be possible depending on job semantics job phases over time 18

echofs -> integration with batch scheduler; • Data Scheduler daemon external to echofs – Interfaces SLURM & echofs  Allows SLURM to send requests to echofs  Allows echofs to ACK these requests applications static dynamic – Offers an API to [non-legacy] applications I/O requirements I/O requirements willing to send I/O hints to echofs data – In the future will coordinate w/ SLURM to SLURM scheduler decide when different echofs instances stage-in/stage-out asynchronous requests should access PFS echofs [data-aware job scheduling] 19

Summary; • Main features: – Ephemeral filesystem linked to job lifetime – Allows legacy applications to benefit from newer storage technologies – Provides aggregate I/O for applications • Research goals: – Improve coordination w/ job scheduler and other HPC management infrastructure – Investigate ad-hoc data distributions tailored for each job I/O – Scheduler-triggered specific optimizations for jobs/files 20

Food for thought; • POSIX compliance is hard… – But maybe we don’t need FULL COMPLIANCE for ALL jobs… • Adding I/O- awareness to the scheduler is important… – Allows wasting I/O work already done… – … but requires user/developer collaboration (tricky…) • User- level filesystems/libraries solve very specific I/O problems… – Can we reuse/integrate these efforts? Can we learn what works for a specific application, characterize it & automatically run similar ones in a “best fit” FS? 21

[ with the schedulers collaboration] Alberto Miranda, PhD - PowerPoint PPT Presentation

www.bsc.es echofs: Enabling Transparent Access to Node-local NVM Burst Buffers for Legacy Applications [ with the schedulers collaboration] Alberto Miranda, PhD Researcher on HPC I/O alberto.miranda@bsc.es Dagstuhl, May 2017 The

Preempting Scheduler Activations Scheduler activations are completely preemptable Deadlocks

WORK STEALING SCHEDULER 2 6/16/2010 Work Stealing Scheduler

Design and Implemention of a Plugin Scheduler for DIET March 11, 2005 Design and Implemention of

LTE eNB Scheduler performance 3rd Fed4FIRE Engineering Conference experiments 14.03.2018

Avoiding Scheduler Subversion usin ing Scheduler-Cooperative Locks Yuvraj Patel, Leon Yang * ,

CPU Scheduling Schedulers Structure of a CPU scheduler Criteria for scheduling

GNU Radio Advanced Scheduler Dude: Josh Blum - New scheduler features and stuff GRAS - Project

scheduling 2 FCFS, RR, priority, SRTF 1 last time xv6 scheduler design separate scheduler

Three-Level Scheduling CPU CPU scheduler Scheduling Arriving jobs How to choose which of the

A Configurable Hardware Scheduler A Configurable Hardware Scheduler (CHS) for Real- -Time

Avoiding Scheduler Subversion using Scheduler - Cooperative Locks Yuvraj Patel , Leon Yang , Leo

Interrupt Coalescing in Xen with Scheduler Awareness Michael Peirce & Kevin Boos Outline

Maximo Scheduler and Spatial a powerful combination September 1, 2016 Agenda Introductions

CPU Scheduling Chester Rebeiro IIT Madras Execution phases of a process 2 Types of Processes

CPU Scheduling Chester Rebeiro IIT Madras Execution phases of a process 2 Types of Processes

Project 3 Preemp,ve Scheduler COS 318 Fall 2015

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Fei Sun ,

Agenda Agenda Linda Rammler, UConn UCEDD (copy from Agenda handout) Fr. John Gallagher,

The Privacy of Secured Computations Adam Smith Penn State Crypto & Big Data Workshop

UL HPC School 2017 Overview & Challenges of the UL HPC Facility at the Belval & EuroHPC

Highlights from BESIII Hai-Bo Li for BESIII Collaboration Institute of High Energy Physics

Re Recognizing Afffect in Dialog Systems ms Nate Perkins

Linear Programming in Low Dimensions (most slides by Nati Srebro) September 11, 2003 Lecture 3:

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Fei Sun ,

[ with the schedulers collaboration] Alberto Miranda, PhD - PowerPoint PPT Presentation

www.bsc.es echofs: Enabling Transparent Access to Node-local NVM Burst Buffers for Legacy Applications [ with the schedulers collaboration] Alberto Miranda, PhD Researcher on HPC I/O alberto.miranda@bsc.es Dagstuhl, May 2017 The

Preempting Scheduler Activations Scheduler activations are completely preemptable Deadlocks

WORK STEALING SCHEDULER 2 6/16/2010 Work Stealing Scheduler

Design and Implemention of a Plugin Scheduler for DIET March 11, 2005 Design and Implemention of

LTE eNB Scheduler performance 3rd Fed4FIRE Engineering Conference experiments 14.03.2018

Avoiding Scheduler Subversion usin ing Scheduler-Cooperative Locks Yuvraj Patel, Leon Yang * ,

CPU Scheduling Schedulers Structure of a CPU scheduler Criteria for scheduling

GNU Radio Advanced Scheduler Dude: Josh Blum - New scheduler features and stuff GRAS - Project

scheduling 2 FCFS, RR, priority, SRTF 1 last time xv6 scheduler design separate scheduler

Three-Level Scheduling CPU CPU scheduler Scheduling Arriving jobs How to choose which of the

A Configurable Hardware Scheduler A Configurable Hardware Scheduler (CHS) for Real- -Time

Avoiding Scheduler Subversion using Scheduler - Cooperative Locks Yuvraj Patel , Leon Yang , Leo

Interrupt Coalescing in Xen with Scheduler Awareness Michael Peirce &amp; Kevin Boos Outline

Maximo Scheduler and Spatial a powerful combination September 1, 2016 Agenda Introductions

CPU Scheduling Chester Rebeiro IIT Madras Execution phases of a process 2 Types of Processes

CPU Scheduling Chester Rebeiro IIT Madras Execution phases of a process 2 Types of Processes

Project 3 Preemp,ve Scheduler COS 318 Fall 2015

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Fei Sun ,

Agenda Agenda Linda Rammler, UConn UCEDD (copy from Agenda handout) Fr. John Gallagher,

The Privacy of Secured Computations Adam Smith Penn State Crypto &amp; Big Data Workshop

UL HPC School 2017 Overview &amp; Challenges of the UL HPC Facility at the Belval &amp; EuroHPC

Highlights from BESIII Hai-Bo Li for BESIII Collaboration Institute of High Energy Physics

Re Recognizing Afffect in Dialog Systems ms Nate Perkins

Linear Programming in Low Dimensions (most slides by Nati Srebro) September 11, 2003 Lecture 3:

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Fei Sun ,

Interrupt Coalescing in Xen with Scheduler Awareness Michael Peirce & Kevin Boos Outline

The Privacy of Secured Computations Adam Smith Penn State Crypto & Big Data Workshop

UL HPC School 2017 Overview & Challenges of the UL HPC Facility at the Belval & EuroHPC