Bare Metal Library Abstractions for modern hardware Cyprien Noel
Plan 1. Modern Hardware? 2. New challenges & opportunities 3. Three use cases ○ Current solutions ○ Leveraging hardware ○ Simple abstraction
Myself ● High performance trading systems ○ Lock-free algos, distributed systems ● H2O ○ Distributed CPU machine learning, async SGD ● Flickr Scaling deep learning on GPU ⎼ Multi GPU Caffe ○ RDMA, multicast, distributed Hogwild ⎼ CaffeOnSpark ○ ● UC Berkeley ○ NCCL Caffe, GPU cluster tooling ○ Bare Metal
Modern Hardware?
Device-to-device networks
Moving from ms software to µs hardware Number crunching ➔ GPU FS, block io, virt mem ➔ Pmem Network stack ➔ RDMA RAID, replication ➔ Erasure codes Device mem ➔ Coherent fabrics And more: Video, crypto etc.
OS abstractions replaced by ● CUDA ● OFED ● Libpmem ● DPDK More powerful, but also more complex ● SPDK and non-interoperable ● Libfabric ● UCX ● VMA ● More every week...
Summary So Far ● Big changes coming! ○ At least for high-performance applications ● CPU should orchestrate ○ Not in critical path ○ Device-to-device networks ● Retrofitting existing architectures difficult ○ CPU-centric abstractions ○ ms software on µs hardware (e.g. 100s instructions per packet) ○ OK in some cases, e.g. VMA (kernel bypass sockets), but much lower acceleration, most features inexpressible
What do we do? ● Start from scratch? ○ E.g. Google Fushia - no fs, block io, network etc. ○ Very interesting but future work ● Use already accelerated frameworks? ○ E.g. PyTorch, BeeGFS ○ Not general purpose, no interop, not device-to-device ● Work incrementally from use cases ○ Look for simplest hardware solution ○ Hopefully useful abstractions will emerge
Use cases ● Build datasets ○ Add, update elements ○ Apply functions to sets, map-reduce ○ Data versioning ● Training & inference ○ Compute graphs, pipelines ○ Deployment ○ Model versioning
Datasets ● Typical solution ○ Protobuf messages ○ KV store ○ Dist. file system ● Limitations ○ Serialization granularity (x12) ○ Copies: kv log, kernel1, replication, kernel2, fs ○ Remote CPU involved, stragglers ○ Cannot place data in device
EC shard Datasets ● Simplest hardware implementation ○ Write protobuf in arena, like Flatbuffers ○ Pick an offset on disks, e.g. a namespace ○ Call ibv_exp_ec_encode_async ● Comments ○ Management, coordination, crash resiliency (x12) ○ Thin wrapper over HW: line rate perf. ● User abstraction? ○ Simple, familiar ○ Efficient, device friendly
mmap ● Extension to classic mmap ○ Distributed ○ Typed - Protobuf, other formats planned ● Protobuf is amazing ○ Forward and backward compatible ○ Lattice
mmap ● C++ const Test& test = mmap<Test>("/test"); int i = test.field(); ● Python test = Test() bm.mmap("/test", test) i = test.field()
mmap, recap ● Simple abstraction for data storage ● Fully accelerated, “mechanically friendly” ○ Thin wrapper over HW, device-to-device, zero copy ○ ~1.5x replication factor ○ Network automatically balanced ○ Solves straggler problem ○ No memory pinning or TLB thrashing, NUMA aware
Use cases ● Compute ○ Map-reduce, compute graphs, pipelines ● Typical setup ○ Spark, DL frameworks ○ Distribution using Akka, gRPC, MPI ○ Kubernetes or SLURM scheduling ● Limitations ○ No interop ○ Placement difficult ○ Inefficient resources allocation
Compute ● Simplest hardware implementation ○ Define a task, e.g. img. resize, CUDA kernel, PyTorch graph ○ Place tasks in queue ○ Work stealing - RDMA atomics ○ Device-to-device chaining - GPU Direct Async ● User abstraction?
task ● Python @bm.task def compute(x, y): return x * y # Runs locally compute(1, 2) # Might be rebalanced on cluster data = bm.list() bm.mmap("/data", data) compute(data, 2)
task, recap ● Simple abstraction for CPU and device kernels ● Work stealing instead of explicit schedule ○ No GPU hoarding ○ Better work balancing ○ Dynamic placement, HA ● Device-to-device chaining ○ Data placed directly in device memory ○ Efficient pipelines, even very short tasks ○ E.g. model parallelism, low latency inference
Use cases ● Versioning ○ Track datasets and models ○ Deploy / rollback models ● Typical setup ○ Copy before update ○ Symlinks as versions to data ○ Staging / production environments split
Versioning ● Simplest hardware implementation ○ Keep multiple write ahead logs ○ mmap updates ○ tasks queues ● User abstraction?
branch ● Like a git branch ○ But any size data ○ Simplifies collaboration, experimentation ○ Generalized staging / production split ● Simplifies HA ○ File system fsync, msync (Very hard! Rajimwale et al. DSN ‘11) ○ Replaces transactions, e.g. queues, persistent memory ○ Allows duplicate work merge
branch ● C++ Test* test = mutable_mmap<Test>("/test"); branch b; # Only visible in current branch test->set_field(12); ● Similar in Python
Summary ● mmap, task, and branch simplify hardware-acceleration ● Helps build pipelines, manage cluster resources etc. ● Early micro benchmarks suggest very high performance
Thank You! Will be open sourced BSD Contact me if interested - cyprien.noel@berkeley.edu Thanks to our sponsor
Recommend
More recommend