optimized scatter gather for parallel storage
play

Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 - PowerPoint PPT Presentation

Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 Latchesar Ionkov Carlos Maltzahn Michael Lang Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA LA-UR-17-2163 HPC Storage: Stuck in the Past


  1. Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 Latchesar Ionkov Carlos Maltzahn Michael Lang Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA LA-UR-17-2163

  2. HPC Storage: Stuck in the Past Los Alamos National Laboratory 3

  3. Replacing POSIX is hard • Great interface • Easy to understand and use • Easy to implement almost correctly • Not scalable for shared use • A lot of unsettled corner cases • Made for programmers • Scientists don’t care about files • they have datasets • they have other things to worry about • best case — know how data is laid out in memory Los Alamos National Laboratory 4

  4. Middleware • Different (better?) user interface • HDF5 • MPI I/O • ADIOS • ArrayQL • Better performance • MPI I/O • PLFS • DeltaFS • GIGA+ • They all have to deal with POSIX idiosyncrasies Los Alamos National Laboratory 5

  5. Complete Systems • Huge effort • Feature creep — even harder to finish • Interoperability? Los Alamos National Laboratory 6

  6. Interfaces are important • Simple • Not too extendable, not too many knobs • Too much freedom is bad, the designer should make the right choices • ASGARD tries to be the best interface for something specific • right level of description of data • for distributed environment • so data can be efficiently gathered from pieces scattered across many nodes • language and library independent Los Alamos National Laboratory 7

  7. Fragments • Describe part of the dataset • Contiguous • Can be materialized in memory, or stored on disk Los Alamos National Laboratory 8

  8. Blocks • Fragments consist of blocks • Describe contiguous region of the fragment • Can be connected to Blocks in other fragments • Each Blocks has: • offset • size • list of Blocks • Three types of Blocks Los Alamos National Laboratory 9

  9. SBlock • “Simple” Block • Properties • offset • size • list of Blocks (connections, same type and size) • Examples: • double -> SBlock of size 8 • uint32_t -> SBlock of size 4 Los Alamos National Laboratory 10

  10. TBlock • “sTruct” Block • Groups other Blocks (of different sizes) • Properties • offset • size • list of Blocks (fields) • list of Blocks (connections, same type) • Offsets of the field Blocks relative to the start of the TBlock • Can have holes Los Alamos National Laboratory 11

  11. ABlock • “Array” Block • Groups Blocks of the same type and size • Properties: • offset • dimension sizes • element order (row-major, column-major, etc.) • element Block • list of Destinations (connections) • Destination • Block � idx i = a i x i + b i • (a i , b i , c i , d i , idx i ) for each dimension c i x i + d i Los Alamos National Laboratory 12

  12. Fragment • Fragment • Collection of Blocks • Top-level Blocks • Transformation ( src , dest ) • For each top-level block in src • SBlock — copy to each destination Block ∈ dest • TBlock — recursively run for each field (keep offsets) • ABlock — for each element with index [ x 1 , x 2 , … , x n ] • calculate offset in src • for each destination Block ∈ dest • calculate index [ y 1 , y 2 , … , y n ] in dest • calculate offset • recursively run transformation for the element Block Los Alamos National Laboratory 13

  13. Transformation Rules S 0000 field a pa T 0000 fragment dataset { var p struct { dest dest a, b, c float32 S 0008 field a } pba T 0004 dest field b } dest S 0004 dest viz fragment default { var p = p dest dest } S 0000 field a fragment viz { p T 0000 field b S 0004 var pa { a } = p field c var pba { b, a } = p S 0008 } default Los Alamos National Laboratory 14

  14. Fragment Sources J (A:0.25) (A:0.25 J:0.25) A B (B:0.3) (B:0.2 J:0.2) C (D:0.4) (D:0.1 J:0.1) D E F G (E:0.2) (E:0.3 J:0.3) H J A B C D F E G H J (H:1.0) A B C D E F G H Los Alamos National Laboratory 15

  15. Transformation Rules Node A Node T Node D rmt xform rmt xform local xform local xform Node B Node C rmt xform local xform local xform rmt xform Los Alamos National Laboratory 16

  16. f1 f eld3 fragment ds { S 000020:2 type P struct { f eld1 S 000008:4 a float64 dest0 f eld0 { i } | { j } dest0 b float32 S 000000:8 dest0 f eld2 c float64 el T 000000:22 dest0 A 000000:110000 (100, 50) S 000012:8 d int16 dest0 dest0 } dest0 dest0 ds dest0 var data [100, 100] P { i } | { j } S 000008:4 dest0 f eld1 dest0 } f eld3 T 000000:22 S 000020:2 dest0 el f eld0 dest0 dest1 fragment f1 { f eld2 A 000000:220000 (100, 100) { i } | { j } S 000000:8 dest2 dest0 var d1 = data { i-25 } | { j-25 } S 000012:8 dest0 dest1 } dest1 dest1 fragment f2 { dest2 dest0 dest0 var d2 { a, c } = data } f2 dest0 el f eld0 A 000000:80000 (50, 100) T 000000:16 S 000000:8 dest0 f eld1 fragment f3 { { i } | { j } S 000008:8 var d3[i, j] {d, c} = data[i-25, j-25] dest1 f3 } dest2 el f eld0 A 000000:25000 (50, 50) T 000000:10 S 000000:2 dest0 f eld1 { i+25 } | { j+25 } S 000002:8 Los Alamos National Laboratory 17

  17. Optimizations ABlock ABlock TBlock TBlock TBlock TBlock TBlock SBlock SBlock TBlock … SBlock SBlock … SBlock SBlock … SBlock SBlock … … … SBlock SBlock SBlock SBlock … TBlock SBlock SBlock SBlock SBlock a. Merging neighboring fields TBlock TBlock TBlock TBlock … … SBlock SBlock … … SBlock SBlock … SBlock SBlock … SBlock SBlock … TBlock SBlock SBlock ABlock ABlock b. Replacing TBlock with a SBlock Los Alamos National Laboratory 18

  18. Ceph Integration • RADOS Objects - custom object class extension • Dataset object • metadata: dataset + stripe definitions • no data • Stripe object • partial read/write using transformation rules • write triggers updates to secondary replicas • Client Side • access unit is fragment • server sends back list of objects and transformation rules • executes local transformation rules, sends to OSD remote transformation rules (+ data) Los Alamos National Laboratory 19

  19. Results: MPI Tile I/O Write Read 160 3000 ASGARD ASGARD Collective MPI I/O Collective MPI I/O Non-collective MPI I/O Non-collective MPI I/O 140 2500 120 2000 Bandwidth (MB/s) 100 80 1500 60 1000 40 500 20 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 500 1000 1500 2000 2500 3000 3500 4000 4500 Tile Size Tile Size Write Read 160 3500 ASGARD ASGARD Collective MPI I/O Collective MPI I/O Non-collective MPI I/O Non-collective MPI I/O 140 3000 120 2500 Bandwidth (MB/s) 100 2000 80 1500 60 1000 40 500 20 0 0 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 Number Of Ranks Number Of Ranks Los Alamos National Laboratory 20

  20. Results: HPIO Read Contiguous Memory / Contiguous Storage Contiguous Memory / Non-contiguous Storage 1800 1200 ASGARD ASGARD Collective MPI I/O Collective MPI I/O Non-collective MPI I/O Non-collective MPI I/O 1600 1000 1400 1200 800 Bandwidth (MB/s) 1000 600 800 600 400 400 200 200 0 0 1x10 6 1x10 7 1x10 6 1x10 7 100000 100000 Non-contiguous Memory / Contiguous Storage Non-contiguous Memory / Non-contiguous Storage 1100 600 ASGARD ASGARD Collective MPI I/O Collective MPI I/O Non-collective MPI I/O Non-collective MPI I/O 1000 500 900 800 400 Bandwidth (MB/s) 700 600 300 500 200 400 300 100 200 100 0 1x10 6 1x10 7 1x10 6 1x10 7 100000 100000 Region Count Region Count Los Alamos National Laboratory 21

  21. Ceph Bandwidth ASGARD 400 300 200 100 0 Collective MPI I/O Bandwidth (MB/s) 400 300 200 100 0 Non-collective MPI I/O 400 300 200 100 0 0 500 1000 1500 2000 Time(s) Los Alamos National Laboratory 22

  22. Ceph Operations ASGARD 150 100 50 0 Collective MPI I/O Time(s) 150 100 50 0 Non-collective MPI I/O Time(s) 1100 Operations per second 800 400 0 0 500 1000 1500 2000 Time(s) Los Alamos National Laboratory 23

  23. Conclusions • ASGARD defines language and library independent data description • Compact transformation rules • Small transformation engine (3K LOC) with implementations in Go and C • Easy to integrate in storage systems and libraries • Questions: • is it the right level of data description? • does it make sense to push for general file systems? • what else did we miss? • do we need byte order (LSB, MSB) and/or primary type (IEEE 754, integer)? Los Alamos National Laboratory 24

Recommend


More recommend