Automatic and Transparent I/O Optimization With Storage Integrated Runtime Support Noah Watkins Zhihao Jia Galen Shipman Carlos Maltzahn Alex Aiken Pat McCormick UC Santa Cruz Stanford LANL
What is this talk about? Convince you that storage should be interested into [HPC] application ● execution models Add Index Application Application Application Add-Index() Database Engine Data Data Data Data Data Data 2
Application and System Development SCIENCE ? bits bits bits POSIX bits bits bits bits bits bits ● Isolated development ● Maximize FLOPS ● Checkpoint / Restart 3
Application and System Development SCIENCE ? bits bits bits POSIX bits bits bits bits bits bits ● Isolated development ● Maximize BW/Latency ● File system interface 4
Conflict of Interest in System Development SCIENCE bits bits bits POSIX bits bits bits bits bits I/O bits ● Isolated development ● Isolated development ● Maximize FLOPS ● Maximize BW/Latency ● Checkpoint / Restart ● File system interface 5
Abstractions hide important parameters application intent, data model SCIENCE bits bits bits POSIX bits bits bits bits bits I/O bits pwrite(fd, data, 1.5MB, 1MB) ● unaligned write ● update multiple blocks ● locking protocols 6
Inflexible applications cannot adapt storage system state, configuration SCIENCE SCIENCE SCIENCE bits bits bits POSIX bits bits SCIENCE SCIENCE SCIENCE bits bits bits I/O bits SCIENCE SCIENCE SCIENCE App1 App2 App1 App2 t0 write(...) write(...) t0 write(...) contention 7 t1 write(...) blocking
Communicating application requirements Database engines use SQL to communicate declarative requirements ● ● HPC applications are entirely different, and require different mechanisms Application HPC Application SQL ??? Database Engine ??? (Runtime) Data Data Data Data 8
I/O Middleware Stacks common data model bits bits bits I/O POSIX bits bits Array bits bits bits bits Collective I/O I/O Pattern Data Sieving Transform Hints Hints 9
Remainder of the talk Illustrate challenges for existing I/O stacks and application design ● ● Describe our work integrating storage into the Legion runtime Preliminary results ● 10
Motivating Example Heterogeneous memory hierarchy ● ○ Multiple tiers and networks Adaptive mesh refinement (AMR) ● ○ Resolution-aware I/O Workflow systems and in-transit ● Data rendezvous ○ ● Out-of-core algorithms Data management challenges ● Metadata ○ ○ Consistency Independent I/O ● 11
Motivating Example: Independent I/O I/O I/O I/O I/O I/O Computation bits bits bits bits bits bits bits bits bits System A 12
Motivating Example: Independent I/O compute(); I/O I/O I/O I/O I/O a = async_io(); compute(); Computation wait(a); compute(a); bits bits bits bits bits bits bits bits bits System A 13
Independent I/O: Consistency Challenges Timestep 1 Timestep 2 Timestep 3 I/O I/O I/O I/O I/O Computation bits bits bits bits bits bits bits bits bits System A 14
Independent I/O: Consistency Challenges Timestep 1 Timestep 2 Timestep 3 I/O I/O I/O I/O I/O Computation TS State Done bits bits bits 1 Red (T1) Yes bits bits bits bits bits bits 2 Green (Xfer) No System A 15
Independent I/O: Consistency Challenges Timestep 1 Timestep 2 Timestep 3 I/O I/O I/O I/O I/O Computation TS State Done bits bits bits 1 Red (T1) Yes bits bits bits bits bits bits 2 Green (T1) No System A Blue (Xfer) 3 Yellow (T2) Yes 16
Independent I/O: Consistency Challenges Timestep 1 Timestep 2 Timestep 3 I/O I/O I/O I/O I/O Computation TS State Done bits bits bits 1 Red (T1) Yes bits bits bits bits bits bits 2 Green (T1) No System A Blue (Xfer) 3 Yellow (T2) Yes 17
Independent I/O: Portability I/O I/O I/O I/O I/O Computation bits bits bits bits bits bits bits bits bits System A 18
Independent I/O: Portability I/O I/O I/O I/O I/O I/O Computation bits bits bits bits bits bits bits bits bits System B 19
Independent I/O: Portability I/O compute(); I/O I/O I/O I/O I/O a = async_io(); compute(); Computation wait(a); compute(a); bits bits bits bits bits bits bits bits bits System B 20
Melding I/O and Application Semantics Application Runtime Intent SCIENCE bits bits bits 1. Data model I/O bits bits bits bits 2. Memory bits Model bits Intent 3. Data Dependence 21
Melding I/O and Application Semantics Intent SCIENCE bits bits bits 1. Data model I/O bits bits bits bits 2. Memory bits Model bits Intent 3. Data Dependence 22
Legion Programming Model and Runtime Prototype built in Legion ● ○ Parallel, data-centric, task-based Logical Region Data Model ● ○ Do not commit to physical layout Legion Runtime Memory hierarchy ● Unified model across memory types ○ Machine ● Data dependencies extracted from GASNet Memory application GPU RAM ZeroCopy RAM Managed by runtime ○ ○ Optimizations “Legion: Expressing Locality and Independence with Logical Regions”, 23 Michael Bauer, Sean Treichler, Elliott Slaughter, Alex Aiken, SC 12
Legion and Persistent Memory Integration Our work introduces persistent ● memory into Legion HDF5 and RADOS targets ● Legion tracks instances like any ● Legion Runtime other memory Machine ● Persistent is transparent to application GASNet Memory ZeroCop Integrated with dependence tracking ● GPU RAM RAM HDF5 RADOS y and coherence control 24
Preliminary Results: Microbenchmark Legion Runtime HDF5 (FS) librados (Object) 25
Preliminary Results: Microbenchmark checkpoint Legion Runtime HDF5 (FS) librados (Object) 26
Preliminary Results: Microbenchmark restart Legion Runtime HDF5 (FS) librados (Object) 27
Preliminary Results: Optimizations Optimizations Legion Runtime ● Sharding ● Independent I/O HDF5 (FS) librados (Object) 28
Preliminary Results: Weak Scaling Application state partitioned into 256 shards ● ● Scaled from 4 GB to 32 GB across 2 to 16 nodes Compared throughput against IOR, N-1, HDF5, MPI-IO ● Lustre, HDF5 (Read) Lustre, HDF5 (Write) MPI-IO, Caching N-N N-1 29
Preliminary Results: Weak Scaling Application state partitioned into 256 shards ● ● Scaled from 4 GB to 32 GB across 2 to 10 nodes Transparent integration with non-POSIX backends ● RADOS Target (R/W) 30
Checkpoint without global barrier Application state partitioned into 256 shards ● ● 14 GB data set size (56 MB shards), fixed set of 12 nodes Tracked read-write phases for each shard ● Legion, Lustre, HDF5 Legion, RADOS Target shared storage dedicated storage 31
Preliminary Results: Strong Scaling Application state partitioned into 256 shards ● ● Total application state size 14 GB Scaled from 4 to 32 nodes (Lustre), 2 to 12 nodes (RADOS) ● ● Caching, Noise Journaling Lustre, HDF5 RADOS Target Limited DMA OSD Cache 32 Threads
Conclusion Memory hierarchies are becoming complex! ● ● We cannot continue to just evolve applications Storage should be interested into application execution models ● Hard-coding optimizations is bad ○ ○ Restricts flexibility and portability Legion runtime and programming model supports pluggable memory ● Integrate persistent storage as a memory ● ● Initial results show feasibility of the system design Enables wide range of transparent optimizations ● Questions? ● ○ Noah Waktins (jayhawk@soe.ucsc.edu) 33
Recommend
More recommend