unity unified memory and file space
play

UNITY: Unified Memory and File Space Terry Jones, ORNL June 27, - PowerPoint PPT Presentation

UNITY: Unified Memory and File Space Terry Jones, ORNL June 27, 2017 ORNL is managed by UT-Battelle Terry Jones, ORNL PI Michael Lang, LANL PI Ada Gavrilovska, GaTech PI for the US Department of Energy Talk Outline


  1. UNITY: Unified Memory and File Space Terry Jones, ORNL June 27, 2017 ORNL is managed by UT-Battelle Terry Jones, ORNL PI Michael Lang, LANL PI Ada Gavrilovska, GaTech PI for the US Department of Energy

  2. Talk Outline • Motivation • UNITY’s Architecture • Early Results • Conclusion ROSS 2017 – Terry Jones – June 27, 2017 2

  3. Timeline to a Predicament – APIs & Growing Memory Complexity • Problem – The simple bifurcated memory hierarchy of the 1950s has evolved into a much more complex hierarchy while interfaces have remained relatively fixed. – At the same time, computer architectures have evolved from single nodes to large parallel systems. • Solution – Update the interface to support a prescriptive (directive based) approach . – Manage dynamic placement & movement with a smart distributed runtime system • Impact – Enable domain scientist to focus on their specialty rather than requiring them to become experts on Fig 1: An early memory hierarchy. memory architectures. – Enable target independent programmability & target independent performance. SGI releases UNIX nVidia HBM NUMALink MPI-IO (originally Seagate first (stacked (non-posix API UNICS) mag disk for memory) IBM RAMAC 305 Multics for parallel I/O) microcomputers Magnetic disk operating CompactFlash HDDs reach (5MB) storage (5MB) system For consumer POSIX effort 1TB with UNITY is MIT Core IBM Intel DRAM Cheap 64K electronics begins 3.5” platters funded memory OS/360 memory DRAM 2000 1950 55 60 65 70 75 80 85 90 95 05 10 15 20 ROSS 2017 – Terry Jones – June 27, 2017 3

  4. Exascale and beyond promises to continue the trend towards complexity Extremely Fast Extremely Expensive Processor CPU Tiny Capacity (Registers) Faster CPU cache SRAM Expensive (Level 1,2,3 cache) Small Capacity Access Speed Byte-Addressable NVM DRAM DRAM Nonvolatile Volatile Main Memory Reasonably Fast Reasonably Fast Product Not Available Reasonably Priced (Random Access Memory) Denser than NAND Reasonable Capacity Byte Addressable Limited Lifetime (STTRAM, PCRAM, ReRAM) NAND Flash Non-byte Addressable Faster than Magnetic Disk Page based NVM Secondary Storage Level 1 Reasonably Cheaper than DRAM (NAND Flash) Limited Density Limited Lifetime Large Capacity Slow Magnetic Disk Secondary Storage Level 2 Cheap Very Large Capacity Tape Tertiary Storage Very Slow Very Cheap Capacity ROSS 2017 – Terry Jones – June 27, 2017 4

  5. The Traditional Dimensions… Access Speed Capacity ROSS 2017 – Terry Jones – June 27, 2017 5

  6. The Traditional Dimensions are Being Expanded… Access Speed Capacity ROSS 2017 – Terry Jones – June 27, 2017 6

  7. The Traditional Dimensions Are Being Expanded Into future directions Access Speed Compatibility with Legacy Apps Concurrency Resiliency Energy Capacity ROSS 2017 – Terry Jones – June 27, 2017 7

  8. …But The Exposed Path to Our Memory Hierarchy Remains Bifurcated User Space: Applications, Libraries memory access file IO mmap VFS Memory Traditional FS Mapping FS Buffer Cache Block Device Virtual to Physical Physical Device: Disks, SSDs Physical Device: NVM, DRAM ROSS 2017 – Terry Jones – June 27, 2017 8

  9. Implications • Complexities in managing User Space: Applications, Libraries power and resilience when memory access mmap file IO actions can be taken independently down the two File-based IO Memory-based IO paths • Results in application factoring in multiple data layouts for different architectural reasons Physical Device: Disks, SSDs Physical Device: NVM, DRAM • Computers are good at handling the details dynamically • Burst buffers, new memory layers, concurrency, power and resilience make data placement difficult for domain scientists. ROSS 2017 – Terry Jones – June 27, 2017 9

  10. UNITY Architecture Global-Distibuted Runtime Local node runtime Fragment Name Global Data Placement Server • Persistent deamon to handle subsequent accesses SNOFlake Aggregated system statistics • Also performs post-job security cleanup Node . . . Runtime Optimized “Dynamic” components Local Data Node Placement Runtime Optimized Local Data • Active throughout life of application Node Placement Runtime Optimized • Able to adjust strategies 1 Local Data 2 NVM 3 Placement RAM 7 • Incorporates COW optimizations 2 1 4 4 2 1 5 5 NVM 3 3 2 3 RAM 2 1 4 Local & Global optimizers NVM 5 5 3 2 3 DRAM 2 1 4 3 3 RAM • Directs data placement 1 3 7 • Global considers collective & machine status optimizations RAM 4 7 Application Versioned HBM Data 4 Fragments Nameserver for metadata management 7 2 Application . . . Application • Efficiently describes data mappings • Keeps track of published objects Fig 1: The UNITY architecture is designed for an • Persistent daemon at well known address environment that is (a) prescriptive; (b) distributed; (c) dynamic; (d) cooperative. ROSS 2017 – Terry Jones – June 27, 2017 10

  11. UNITY Design Objectives Global-Distibuted Runtime Local node runtime Fragment Name Global Data Placement Server A unified data environment based on a smart • Persistent deamon to handle subsequent accesses SNOFlake Aggregated system statistics runtime system: • Also performs post-job security cleanup Node . . . 1. frees applications from the complexity of Runtime Optimized “Dynamic” components Local Data Node Placement directly placing and moving data within Runtime Optimized Local Data • Active throughout life of application Node multi-tier storage hierarchies, Placement Runtime Optimized • Able to adjust strategies 1 Local Data 2 NVM 3 Placement RAM 7 • Incorporates COW optimizations 2 1 4 4 2. while still meeting application-prescribed 2 1 5 5 NVM 3 3 2 3 RAM 2 1 4 Local & Global optimizers requirements for data access performance, NVM 5 5 3 2 3 DRAM 2 1 4 efficient data sharing, and data durability. 3 3 RAM • Directs data placement 1 3 7 • Global considers collective & machine status optimizations RAM 4 7 Application Versioned HBM Data 4 Fragments Nameserver for metadata management 7 2 Application . . . Application • Efficiently describes data mappings • Keeps track of published objects Fig 1: The UNITY architecture is designed for an • Persistent daemon at well known address environment that is (a) prescriptive; (b) distributed; (c) dynamic; (d) cooperative. ROSS 2017 – Terry Jones – June 27, 2017 11

  12. Automated movement with Unity Data Placement Domains With UNITY RAM LEGEND 6 1 11 16 21 HBM UNITY 2 7 12 17 22 User NVM UNITY 3 8 13 18 23 Service SSD Existing 4 9 19 14 24 System SW Disk 5 10 15 20 25 Not Likely in future Tape Architectures 26 T ertiary Storage Compute IO Burst Storage Note: The orange triangle ( ) specifies the Node(s) Node(s) Buffer UNITY Memory Hierarchy Layer (MHL); ( e.g., HPSS) ( e.g. Lustre or GPFS) Lower numbers present faster access to the application while higher numbers present more aggre-gated capacity to the application. Compute Domain (e.g., Titan) Storage Domain (separate systems) ROSS 2017 – Terry Jones – June 27, 2017 12

  13. Providing a Prescriptive API IMD unity_create_object(“a”, objDescr); Scientific Achievement workingset = unity_attach_fragment(workFrag, flags); xghost = unity_attach_fragment(ghosttop, flags); A new API enables domain scientists to describe yghost = unity_attach_fragment(ghostleft, flags); how their data is to be used. This permits a smart runtime system to do the tedious work of for (x=0; x<1000; x++) { if ( x>0 ) { managing data placement and movement. reattach(xghost, x); reattach(yghost, x); Significance & Impact } // do work Vendors are providing multiple APIs to deal with unity_publish(workingset); their novel abilities. Through our co-design } oriented project, we provide a unifying way to Fig 1: The UNITY API is designed to be extensible and achieve what the domain scientists want in a flexible – here is an example with meshes & ghost cells. machine independent way. Research Details Currently we have a functional prototype that provides most of the functionality that will be visible to the application developers using the runtime. We have created a global name service that can be used to query datasets and where their data is located. We have created a runtime service that runs on each node and keeps track of the data available locally on the node. The runtime, in conjunction with the global name server create a distributed data store that can utilize both volatile and nonvolatile memory available on the supercomputer. We have designed and implemented hooks in the system, so intelligent data placement services can identify and update the data location and format in order to improve the overall performance. We have modified the SNAP proxy application to use the Unity’s API. We can checkpoint and restart the application and can demonstrate the advantages of Unity by checkpointing a N-rank SNAP job and restarting it as a M-rank job. Datasets/checkpoints can be pulled from multiple hosts over TCP or Infiniband. ROSS 2017 – Terry Jones – June 27, 2017 ‹#›

Recommend


More recommend