An Exploration into Object Storage Lance Evans Raghu Chandrasekar Office of the CTO Storage R&D
Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical facts. These statements are only predictions and actual results may materially vary from those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements. Dagstuhl Seminar #17202 2
Lexicon? Attribute Namespace Storage Persistent Cache Ephemeral File Object Consistent Tier Resilient Coherent Durable Reliable Authenticate Secure Dagstuhl Seminar #17202 3
Motivations • Transition from files to objects (wild west) • Convergence of analytics and HPC (which is which?) • Hardware evolution (SMR, NAND, PCM, TLAs) • Open-source movement (mostly commercial) • Dev cost and agility (10-yr gestation cycle) • Specialized frameworks (still require common infrastructure) • Flat namespaces (enable bazillion objects) • Scaled DBs (graph, schema-less, NewSQL) • Required scale (256-256k compute nodes, 64-64k storage devices) Dagstuhl Seminar #17202 4
HPC & Analytics Convergence POSIX Files, Spark RDDs, User Application User Application HDF5 Containers, K/V, or Other HPC File or Object Analytics Framework K/V with Local Caching with Optional Caching APPLICATION Discover, OBJECT Flash Flash Pmem Pmem Query INTERFACE Compute RDMA RDMA High-speed dragonfly fabric Store Transport Transport STORAGE OBJECT 256k Node INTERFACE Management, Scalable Monitor, Metadata Flash Flash Flash Flash Flash Flash Flash Flash Flash Flash Service Services Infrastructure Dagstuhl Seminar #17202 5
SAROJA: Architecture HPC codes and Analytics frameworks Native API POSIX HDF5/NetCDF RESTful/S3/CDMI (put/get semantics) SAROJA user-space library Control plugins Metadata plugins Data plugins “An Exploration into Object Storage for Exascale Supercomputers”, Raghunath Raja Chandrasekar, Lance Evans, Robert Wespetal, Cray User Group Conference (CUG) 2017 Dagstuhl Seminar #17202 6
The Client • libsaroja.so, Put/Get semantics, FUSE, and wrapfs (in-kernel) • Client-side intelligence • Pluggable backends: Metadata, data, control • Algorithmic data node selection • POSIX <-> KV/NoSQL metadata translation • Interface with consensus agents • Retain fidelity of structured data formats Dagstuhl Seminar #17202 7
“…performance degradation caused by FUSE can be completely imperceptible or as high as 83% even when optimized; and relative CPU utilization can increase by 31%...” Dagstuhl Seminar #17202 8
Metadata Services • Traditional HPC metadata services • Interface dependent • Strongly consistent and normalized • H/A based fault-tolerance • Desirable characteristics • Right Consistency, Availability, Partition Tolerance (CAP) balance • {Scalability and performance} vs {Strong consistency} tradeoff • API agnostic – Service POSIX, Objects, Structured datasets, etc. • Storage-device conscious (NVMe, PMEM) • Analytics on the metadata Dagstuhl Seminar #17202 9
NoSQL an option? • Distributed Hash Tables for data placement • Fault-tolerance with failure domains • Built-in consensus mechanisms • Log-Structured Merge Trees to optimize KV storage • Cassandra for our initial proof-of-concept • Scaled to thousands of nodes, 10PB of data, 1trillion requests/day • In use at 1500+ companies in production • Low-level APIs in C Dagstuhl Seminar #17202 10
Metadata models for POSIX KEY: /mnt/bar/b/file123 Pathname-as-key VAL: atime;mtime;size;xattrs /mnt /mnt/bar /mnt/foo KEY: 6789 KEY: 1234 Two-tier collections VAL: 1234; VAL: atime;mtime;size;xattrs /mnt/bar/a /mnt/bar/b (#6789) /mnt/bar/b/file123 KEY: 6789, part_key, hash(file123) IndexFS VAL: atime;mtime;size;xattrs; *data_obj; (#1234) Dagstuhl Seminar #17202 11
Scaling with POSIX-over-NoSQL File Creation rate (mdtest) 180000 160000 140000 • POSIX over SAROJA Creates per Second • 4480 MPI ranks 120000 • 2.2 million files total 100000 • TCP over GNI on XC40 80000 • Lots of cheating Peak Lustre file 60000 creation rate 40000 (w/o DNE) 20000 0 1 2 4 8 Number of Cassandra Servers Dagstuhl Seminar #17202 12
Data Path Services • Algorithmic mapping of object shards to distributed servers • Data resilience by means of replication or erasure-coding • On-the-fly recovery in the face of failures • Data movement between tiers within and outside the store • Efficient and granular use of underlying fabric/memory/storage Dagstuhl Seminar #17202 13
Ceph in the Data Path • Data plugin w/librados in C • Stripes of a BLOB/File/HDF5 container -> Ceph object • Static approach to track extents (currently), à la Lustre/DataWarp • Clients track servers based on ID+offset • Two-tier path: Replicated pmem tier and Erasure-coded flash tier • Ceph OSD backends: FileStore, BlueStore, PMStore, and others • Ceph Messengers Dagstuhl Seminar #17202 14
Co-design Concepts • NVMe Flash ALL POSSIBLE FROM • Intel SPDK: Polling vs Interrupts • Controller Memory Buffers USER-SPACE! • Persistent Memory • 64B, 512B, atomicity guarantees, “sector tearing” • NVML suite of libraries • DAX+mmap() • Open-Channel SSDs • LightNVM framework in kernel • Storage data structures like LSM-Trees become the FTL • Fabric concepts from MPI/SHMEM Dagstuhl Seminar #17202 15
Questions? Lance Evans Raghu Chandrasekar lance@cray.com raghu@cray.com
Recommend
More recommend