bare metal library
play

Bare Metal Library Abstractions for modern hardware Cyprien Noel - PowerPoint PPT Presentation

Bare Metal Library Abstractions for modern hardware Cyprien Noel Plan 1. Modern Hardware? 2. New challenges & opportunities 3. Three use cases Current solutions Leveraging hardware Simple abstraction Myself High


  1. Bare Metal Library Abstractions for modern hardware Cyprien Noel

  2. Plan 1. Modern Hardware? 2. New challenges & opportunities 3. Three use cases ○ Current solutions ○ Leveraging hardware ○ Simple abstraction

  3. Myself ● High performance trading systems ○ Lock-free algos, distributed systems ● H2O ○ Distributed CPU machine learning, async SGD ● Flickr Scaling deep learning on GPU ⎼ Multi GPU Caffe ○ RDMA, multicast, distributed Hogwild ⎼ CaffeOnSpark ○ ● UC Berkeley ○ NCCL Caffe, GPU cluster tooling ○ Bare Metal

  4. Modern Hardware?

  5. Device-to-device networks

  6. Moving from ms software to µs hardware Number crunching ➔ GPU FS, block io, virt mem ➔ Pmem Network stack ➔ RDMA RAID, replication ➔ Erasure codes Device mem ➔ Coherent fabrics And more: Video, crypto etc.

  7. OS abstractions replaced by ● CUDA ● OFED ● Libpmem ● DPDK More powerful, but also more complex ● SPDK and non-interoperable ● Libfabric ● UCX ● VMA ● More every week...

  8. Summary So Far ● Big changes coming! ○ At least for high-performance applications ● CPU should orchestrate ○ Not in critical path ○ Device-to-device networks ● Retrofitting existing architectures difficult ○ CPU-centric abstractions ○ ms software on µs hardware (e.g. 100s instructions per packet) ○ OK in some cases, e.g. VMA (kernel bypass sockets), but much lower acceleration, most features inexpressible

  9. What do we do? ● Start from scratch? ○ E.g. Google Fushia - no fs, block io, network etc. ○ Very interesting but future work ● Use already accelerated frameworks? ○ E.g. PyTorch, BeeGFS ○ Not general purpose, no interop, not device-to-device ● Work incrementally from use cases ○ Look for simplest hardware solution ○ Hopefully useful abstractions will emerge

  10. Use cases ● Build datasets ○ Add, update elements ○ Apply functions to sets, map-reduce ○ Data versioning ● Training & inference ○ Compute graphs, pipelines ○ Deployment ○ Model versioning

  11. Datasets ● Typical solution ○ Protobuf messages ○ KV store ○ Dist. file system ● Limitations ○ Serialization granularity (x12) ○ Copies: kv log, kernel1, replication, kernel2, fs ○ Remote CPU involved, stragglers ○ Cannot place data in device

  12. EC shard Datasets ● Simplest hardware implementation ○ Write protobuf in arena, like Flatbuffers ○ Pick an offset on disks, e.g. a namespace ○ Call ibv_exp_ec_encode_async ● Comments ○ Management, coordination, crash resiliency (x12) ○ Thin wrapper over HW: line rate perf. ● User abstraction? ○ Simple, familiar ○ Efficient, device friendly

  13. mmap ● Extension to classic mmap ○ Distributed ○ Typed - Protobuf, other formats planned ● Protobuf is amazing ○ Forward and backward compatible ○ Lattice

  14. mmap ● C++ const Test& test = mmap<Test>("/test"); int i = test.field(); ● Python test = Test() bm.mmap("/test", test) i = test.field()

  15. mmap, recap ● Simple abstraction for data storage ● Fully accelerated, “mechanically friendly” ○ Thin wrapper over HW, device-to-device, zero copy ○ ~1.5x replication factor ○ Network automatically balanced ○ Solves straggler problem ○ No memory pinning or TLB thrashing, NUMA aware

  16. Use cases ● Compute ○ Map-reduce, compute graphs, pipelines ● Typical setup ○ Spark, DL frameworks ○ Distribution using Akka, gRPC, MPI ○ Kubernetes or SLURM scheduling ● Limitations ○ No interop ○ Placement difficult ○ Inefficient resources allocation

  17. Compute ● Simplest hardware implementation ○ Define a task, e.g. img. resize, CUDA kernel, PyTorch graph ○ Place tasks in queue ○ Work stealing - RDMA atomics ○ Device-to-device chaining - GPU Direct Async ● User abstraction?

  18. task ● Python @bm.task def compute(x, y): return x * y # Runs locally compute(1, 2) # Might be rebalanced on cluster data = bm.list() bm.mmap("/data", data) compute(data, 2)

  19. task, recap ● Simple abstraction for CPU and device kernels ● Work stealing instead of explicit schedule ○ No GPU hoarding ○ Better work balancing ○ Dynamic placement, HA ● Device-to-device chaining ○ Data placed directly in device memory ○ Efficient pipelines, even very short tasks ○ E.g. model parallelism, low latency inference

  20. Use cases ● Versioning ○ Track datasets and models ○ Deploy / rollback models ● Typical setup ○ Copy before update ○ Symlinks as versions to data ○ Staging / production environments split

  21. Versioning ● Simplest hardware implementation ○ Keep multiple write ahead logs ○ mmap updates ○ tasks queues ● User abstraction?

  22. branch ● Like a git branch ○ But any size data ○ Simplifies collaboration, experimentation ○ Generalized staging / production split ● Simplifies HA ○ File system fsync, msync (Very hard! Rajimwale et al. DSN ‘11) ○ Replaces transactions, e.g. queues, persistent memory ○ Allows duplicate work merge

  23. branch ● C++ Test* test = mutable_mmap<Test>("/test"); branch b; # Only visible in current branch test->set_field(12); ● Similar in Python

  24. Summary ● mmap, task, and branch simplify hardware-acceleration ● Helps build pipelines, manage cluster resources etc. ● Early micro benchmarks suggest very high performance

  25. Thank You! Will be open sourced BSD Contact me if interested - cyprien.noel@berkeley.edu Thanks to our sponsor

Recommend


More recommend