Managing NVM in The Machine Rocky Craig, Master Linux Technologist Linux Foundation Vault 2016
The Machine Project from Hewlett Packard Enterprise Massive SoC pool Photonic fabric Massive memory pool http://www.labs.hpe.com/research/themachine/ “The Machine: A New Kind of Computer 2
Memory-Centric Computing: “No IO” from NVM persistence // Give me some space in a way I can find it again tomorrow int *vaddr = TheMachineVoodoo (...., identifier, ….., size, ….); // Use it *vaddr = 42; // Don't lose it exit(0); 3
The NVM Fabric of The Machine DRAM SoC SoC DRAM Fabric Bridge Fabric Bridge NVM NVM NVM NVM NVM NVM NVM NVM Fabric Switch NVM NVM NVM NVM NVM NVM NVM NVM Fabric Bridge Fabric Bridge DRAM SoC SoC DRAM 4
Hardware Point of View for Fabric-Attached Memory (FAM) –Basic unit of SoC HW memory access is still the page – Looks like DRAM, smells like DRAM... – But it's not identified as DRAM –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books, goal of 80 nodes –Memory-mapping operations provide direct load/store access – FAM on same node as SoC doing load/store is cache-coherent – FAM on a different node is not cache-coherent 5
Hardware Platform Basics Node 1 Node 2 Node N . . . . . . Linux Linux Linux on SoC on SoC on SoC Fabric Fabric Fabric Fabric Bridge Bridge Bridge Switches NVM NVM NVM
Single Load/Store Domain 1-4 TB Fabric SoC Bridge 256 GB DRAM Fabric-Attached Memory 1-4 TB Fabric SoC Bridge 256 GB DRAM Fabric-Attached Memory Fabric SoC Bridge 256 GB DRAM Fabric-Attached Memory 7
TheMachineVoodoo(): rough consensus and running code • Provide a new file system for FAM allocation • File system daemon – Runs on each node – File system API under a mount point, typically “/lfs” – Communicates to metadata server over SoC Ethernet – Provides access to FAM books for applications on SoC • Librarian – Running on Top of Rack Management Server (ToRMS) – FS metadata (“shelves” and attributes) managed in SQL database – Never sees actual book contents in FAM 8
Memory-Centric Computing under LFS fd = open("/lfs/bigone", O_CREAT | O_RDWR, 0666); ftruncate(fd, 10 * TB); int *vaddr = mmap(NULL, 10 * TB, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); *vaddr = 42; 9
Possible usage pattern ● open(.....) ● truncate(1 or 2 books) ● mmap() and use “briefly” ● read() or write() mixed in ● truncate(up or down) a lot ● close() ● copy it, unlink it, save it for later... ● open(....) ● truncate(1 or 2 books) ● lather rinse repeat especially across SoCs
Expected use patterns ● open() open(.....) ● ● truncate(thousands of books) truncate( 1 or 2 books) ● ● mmap() sections across many cores/SoCs mmap() and use “briefly” ● ● Run until solution convergence read() or write() mixed in ● close() ● Sporadically, truncate(increase size) ● unlink() ● open(....) ● Implications: truncate(1 or 2 books) ● ● Solution architectures need re-thinking lather rinse repeat ● ● It's not only about persistence ● File-system performance is not critical
NUMA and cache coherency DRAM SoC SoC DRAM Fabric Bridge Fabric Bridge NVM NVM NVM NVM NVM NVM NVM NVM Fabric Switch NVM NVM NVM NVM NVM NVM NVM NVM Fabric Bridge Fabric Bridge DRAM SoC SoC DRAM 12
LFS POSIX Extended File Attributes $ touch /lfs/myshelf $ getfattr -d /lfs/myshelf getfattr: Removing leading '/' from absolute path names # file: lfs/myshelf user.LFS.AllocationPolicy="RandomBooks" user.LFS.AllocationPolicyList="RandomBooks,LocalNode,Nearest,...." user.LFS.<other stuff but you get the idea> $ truncate -s40G /lfs/myshelf
Librarian and Librarian File System ToRMS One SoC Files under Ethernet librarian.py Books and lfs_fuse.py lfs_fuse.py SQL /lfs Shelves fuse.py Database is initialized with myprocess libfuse.so book layout and topology of all FS API nodes / enclosures / racks User system calls /dev/fuse Kernel During runtime it tracks shelves, VFS fuse.ko usage, and attributes Where's the beef?
Oh this one again Node 1 Node 2 Node N . . . . . . Linux Linux Linux on SoC on SoC on SoC Fabric Fabric Fabric Fabric Bridge Bridge Switches Bridge NVM NVM NVM
Developing without hardware Encapsulation 1 Encapsulation 2 Encapsulation N LAN librarian.py lfs_fuse.py lfs_fuse.py lfs_fuse.py Emulated Physical Physical Physical sharing memory memory memory
Early LFS development: self-hosted Shadow localhost librarian.py lfs_fuse.py lfs_fuse.py SQL File $ vi smalltm.ini # node count, book size, book total fuse.py $ create_db.py smalltm.ini smalltm.db myprocess libfuse.so $ librarian.py …. --db_file=smalltm.db FS API User system calls /dev/fuse $ truncate -s 16G /tmp/GlobalNVM Kernel VFS fuse.ko $ lfs_fuse.py --shadow_file /tmp/GlobalNVM localhost /mnt/lfs1 1 $ lfs_fuse.py --shadow_file /tmp/GlobalNVM localhost /mnt/lfs2 2 : :
Address Translations SOC Fabric Bridge: 14.9T of Apertures ARM Core ARM Core ~1900 PA → LA ~1900 PA → LA (worst case) Book Descriptors Book Descriptors VA: 48b (256 TB) VA -> PA 53b (8 PB) VA -> PA PA: 44 - 48b “Book space” (16 - 256 TB) Cache Cache Book fjrewall Book fjrewall Coherent interconnect PCI, PCI, etc Fabric requester etc Fabric requester DRAM DRAM max 1T max 1T Fabric space: 75b (32 ZB) 18
Page and book faults fd = open("/lfs/bigone"... Passthrough to lfs_fuse::open() ● lfs_fuse converse with Librarian – create a new shelf ● lfs_fuse return a file descriptor for VFS ftruncate(fd, 20 * TB); Passthrough to lfs_fuse::ftruncate() ● Requests keyed on fd ● lfs_fuse converse with Librarian – allocate books (LA) int *vaddr = mmap(... fd, ...); Stay in kernel (FuSE hook) ● Allocate VMA ● LFS changes: set up caching structures to assist faulting *vaddr = 42; Start in kernel LFS page fault handler ● If first fault in a book ● Overload getxattr() into lfs_fuse ● lfs_fuse converse with Librarian – get book LA info ● Kernel caches book LA ● Get book LA info from cache ● Select and program unused descriptor ● map with vma_insert_pfn()
Librarian File System – Data in FAM ToRMS One Node Ethernet librarian.py lfs_fuse.py lfs_fuse.py SQL tm-fuse.py myprocess tm-libfuse.so FS API User system calls /dev/fuse Kernel VFS tm-fuse.ko Hardware Fabric bridge FPGA
Descriptors are in short supply *(vaddr + 1G) = 43; Start in kernel LFS page fault handler ● If first fault in a book ● Overload getxattr hook to lfs_fuse ● lfs_fuse converse with Librarian – get book LA info ● Kernel caches book LA ● Get book LA info from cache ● Reuse previous descriptor/aperture as address base ● map with vma_insert_pfn() ( touch enough space to Lather rinse repeat use all descriptors) *onetoomany = 43; Need to reclaim a descriptor ● Select an LRU candidate ● For all VMAs mapped into that descriptor (book LA): ● flush caches ● zap_vma_pte() ● Reprogram selected descriptor with LA, vma_insert_pfn()
LFS & Driver Development on QEMU and IVSHMEM QEMU guest as node Apertures lfs_fuse.py * librarian.py IVSHMEM Modified Nahanni QEMU guest as node server manages Apertures as lfs_fuse.py file used as * backing store Global FAM QEMU guest as node Apertures lfs_fuse.py * * Guest-private IVSHMEM regions emulate bridge resource space
Platforms and environments The Machine Fabric-Attached The Machine Memory Emulation Architectural Simulator (Develop) (Validate) Application Application Application POSIX APIs New APIs POSIX APIs New APIs POSIX APIs New APIs LFS LFS LFS Drivers Librarian Drivers Librarian Drivers Librarian Firmware Hardware Firmware Hardware 23 Confjdential
libpmem –Part of http://pmem.io/nvml/ –API for controlling data persistence –Flushing SoC caches. –Clearing memory controller buffers –Accelerated APIs for persistent data movement –Non-temporal copies –Bypass SoC caches –Additions for The Machine –APIs for invalidating SoC caches 24
Fabric-Attached Memory Atomics –Native SoC atomic instructions are cache-dependent – Do not work between nodes –Bridge and switch hardware includes fabric-native atomic operations –Proprietary fam-atomic library provides API – Atomic read/write, compare/exchange, add, bitwise and/or – Cross-node Spin Locks – Depends on LFS for VA → PA → FA translations 25
LFS native block devices –Legacy applications or frameworks that need a block device – File-system dependent (ext4) – Ceph –Triggered via mknod –Simplifications for proof-of-concept – Plagiarize drivers/nvdimm/pmem.c – Avoid cache complications: node-local only – Lock the descriptors 26
The Future ● Short-term – Full integration into management infrastructure of The Machine – Frameworks / Middleware / Demos / Applications / Stress testing – Optimizations (i.e., huge pages) – Learn, learn, learn ● And beyond – More capable or specialized SoCs – Deeper integration of fabric – Enablement of NVM technologies at production scale – Harden proven software (i.e., replace FuSE with a “real” file system) – True concurrent file system – Eliminate separate ToRMS server – ???????
Recommend
More recommend