Managing NVM in The Machine Rocky Craig, Master Linux Technologist - PowerPoint PPT Presentation

Managing NVM in The Machine Rocky Craig, Master Linux Technologist Linux Foundation Vault 2016

The Machine Project from Hewlett Packard Enterprise Massive SoC pool Photonic fabric Massive memory pool http://www.labs.hpe.com/research/themachine/ “The Machine: A New Kind of Computer 2

Memory-Centric Computing: “No IO” from NVM persistence // Give me some space in a way I can find it again tomorrow int *vaddr = TheMachineVoodoo (...., identifier, ….., size, ….); // Use it *vaddr = 42; // Don't lose it exit(0); 3

The NVM Fabric of The Machine DRAM SoC SoC DRAM Fabric Bridge Fabric Bridge NVM NVM NVM NVM NVM NVM NVM NVM Fabric Switch NVM NVM NVM NVM NVM NVM NVM NVM Fabric Bridge Fabric Bridge DRAM SoC SoC DRAM 4

Hardware Point of View for Fabric-Attached Memory (FAM) –Basic unit of SoC HW memory access is still the page – Looks like DRAM, smells like DRAM... – But it's not identified as DRAM –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books, goal of 80 nodes –Memory-mapping operations provide direct load/store access – FAM on same node as SoC doing load/store is cache-coherent – FAM on a different node is not cache-coherent 5

Hardware Platform Basics Node 1 Node 2 Node N . . . . . . Linux Linux Linux on SoC on SoC on SoC Fabric Fabric Fabric Fabric Bridge Bridge Bridge Switches NVM NVM NVM

Single Load/Store Domain 1-4 TB Fabric SoC Bridge 256 GB DRAM Fabric-Attached Memory 1-4 TB Fabric SoC Bridge 256 GB DRAM Fabric-Attached Memory Fabric SoC Bridge 256 GB DRAM Fabric-Attached Memory 7

TheMachineVoodoo(): rough consensus and running code • Provide a new file system for FAM allocation • File system daemon – Runs on each node – File system API under a mount point, typically “/lfs” – Communicates to metadata server over SoC Ethernet – Provides access to FAM books for applications on SoC • Librarian – Running on Top of Rack Management Server (ToRMS) – FS metadata (“shelves” and attributes) managed in SQL database – Never sees actual book contents in FAM 8

Memory-Centric Computing under LFS fd = open("/lfs/bigone", O_CREAT | O_RDWR, 0666); ftruncate(fd, 10 * TB); int *vaddr = mmap(NULL, 10 * TB, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); *vaddr = 42; 9

Possible usage pattern ● open(.....) ● truncate(1 or 2 books) ● mmap() and use “briefly” ● read() or write() mixed in ● truncate(up or down) a lot ● close() ● copy it, unlink it, save it for later... ● open(....) ● truncate(1 or 2 books) ● lather rinse repeat especially across SoCs

Expected use patterns ● open() open(.....) ● ● truncate(thousands of books) truncate( 1 or 2 books) ● ● mmap() sections across many cores/SoCs mmap() and use “briefly” ● ● Run until solution convergence read() or write() mixed in ● close() ● Sporadically, truncate(increase size) ● unlink() ● open(....) ● Implications: truncate(1 or 2 books) ● ● Solution architectures need re-thinking lather rinse repeat ● ● It's not only about persistence ● File-system performance is not critical

NUMA and cache coherency DRAM SoC SoC DRAM Fabric Bridge Fabric Bridge NVM NVM NVM NVM NVM NVM NVM NVM Fabric Switch NVM NVM NVM NVM NVM NVM NVM NVM Fabric Bridge Fabric Bridge DRAM SoC SoC DRAM 12

LFS POSIX Extended File Attributes $ touch /lfs/myshelf $ getfattr -d /lfs/myshelf getfattr: Removing leading '/' from absolute path names # file: lfs/myshelf user.LFS.AllocationPolicy="RandomBooks" user.LFS.AllocationPolicyList="RandomBooks,LocalNode,Nearest,...." user.LFS.<other stuff but you get the idea> $ truncate -s40G /lfs/myshelf

Librarian and Librarian File System ToRMS One SoC Files under Ethernet librarian.py Books and lfs_fuse.py lfs_fuse.py SQL /lfs Shelves fuse.py Database is initialized with myprocess libfuse.so book layout and topology of all FS API nodes / enclosures / racks User system calls /dev/fuse Kernel During runtime it tracks shelves, VFS fuse.ko usage, and attributes Where's the beef?

Oh this one again Node 1 Node 2 Node N . . . . . . Linux Linux Linux on SoC on SoC on SoC Fabric Fabric Fabric Fabric Bridge Bridge Switches Bridge NVM NVM NVM

Developing without hardware Encapsulation 1 Encapsulation 2 Encapsulation N LAN librarian.py lfs_fuse.py lfs_fuse.py lfs_fuse.py Emulated Physical Physical Physical sharing memory memory memory

Early LFS development: self-hosted Shadow localhost librarian.py lfs_fuse.py lfs_fuse.py SQL File $ vi smalltm.ini # node count, book size, book total fuse.py $ create_db.py smalltm.ini smalltm.db myprocess libfuse.so $ librarian.py …. --db_file=smalltm.db FS API User system calls /dev/fuse $ truncate -s 16G /tmp/GlobalNVM Kernel VFS fuse.ko $ lfs_fuse.py --shadow_file /tmp/GlobalNVM localhost /mnt/lfs1 1 $ lfs_fuse.py --shadow_file /tmp/GlobalNVM localhost /mnt/lfs2 2 : :

Address Translations SOC Fabric Bridge: 14.9T of Apertures ARM Core ARM Core ~1900 PA → LA ~1900 PA → LA (worst case) Book Descriptors Book Descriptors VA: 48b (256 TB) VA -> PA 53b (8 PB) VA -> PA PA: 44 - 48b “Book space” (16 - 256 TB) Cache Cache Book fjrewall Book fjrewall Coherent interconnect PCI, PCI, etc Fabric requester etc Fabric requester DRAM DRAM max 1T max 1T Fabric space: 75b (32 ZB) 18

Page and book faults fd = open("/lfs/bigone"... Passthrough to lfs_fuse::open() ● lfs_fuse converse with Librarian – create a new shelf ● lfs_fuse return a file descriptor for VFS ftruncate(fd, 20 * TB); Passthrough to lfs_fuse::ftruncate() ● Requests keyed on fd ● lfs_fuse converse with Librarian – allocate books (LA) int *vaddr = mmap(... fd, ...); Stay in kernel (FuSE hook) ● Allocate VMA ● LFS changes: set up caching structures to assist faulting *vaddr = 42; Start in kernel LFS page fault handler ● If first fault in a book ● Overload getxattr() into lfs_fuse ● lfs_fuse converse with Librarian – get book LA info ● Kernel caches book LA ● Get book LA info from cache ● Select and program unused descriptor ● map with vma_insert_pfn()

Librarian File System – Data in FAM ToRMS One Node Ethernet librarian.py lfs_fuse.py lfs_fuse.py SQL tm-fuse.py myprocess tm-libfuse.so FS API User system calls /dev/fuse Kernel VFS tm-fuse.ko Hardware Fabric bridge FPGA

Descriptors are in short supply *(vaddr + 1G) = 43; Start in kernel LFS page fault handler ● If first fault in a book ● Overload getxattr hook to lfs_fuse ● lfs_fuse converse with Librarian – get book LA info ● Kernel caches book LA ● Get book LA info from cache ● Reuse previous descriptor/aperture as address base ● map with vma_insert_pfn() ( touch enough space to Lather rinse repeat use all descriptors) *onetoomany = 43; Need to reclaim a descriptor ● Select an LRU candidate ● For all VMAs mapped into that descriptor (book LA): ● flush caches ● zap_vma_pte() ● Reprogram selected descriptor with LA, vma_insert_pfn()

LFS & Driver Development on QEMU and IVSHMEM QEMU guest as node Apertures lfs_fuse.py * librarian.py IVSHMEM Modified Nahanni QEMU guest as node server manages Apertures as lfs_fuse.py file used as * backing store Global FAM QEMU guest as node Apertures lfs_fuse.py * * Guest-private IVSHMEM regions emulate bridge resource space

Platforms and environments The Machine Fabric-Attached The Machine Memory Emulation Architectural Simulator (Develop) (Validate) Application Application Application POSIX APIs New APIs POSIX APIs New APIs POSIX APIs New APIs LFS LFS LFS Drivers Librarian Drivers Librarian Drivers Librarian Firmware Hardware Firmware Hardware 23 Confjdential

libpmem –Part of http://pmem.io/nvml/ –API for controlling data persistence –Flushing SoC caches. –Clearing memory controller buffers –Accelerated APIs for persistent data movement –Non-temporal copies –Bypass SoC caches –Additions for The Machine –APIs for invalidating SoC caches 24

Fabric-Attached Memory Atomics –Native SoC atomic instructions are cache-dependent – Do not work between nodes –Bridge and switch hardware includes fabric-native atomic operations –Proprietary fam-atomic library provides API – Atomic read/write, compare/exchange, add, bitwise and/or – Cross-node Spin Locks – Depends on LFS for VA → PA → FA translations 25

LFS native block devices –Legacy applications or frameworks that need a block device – File-system dependent (ext4) – Ceph –Triggered via mknod –Simplifications for proof-of-concept – Plagiarize drivers/nvdimm/pmem.c – Avoid cache complications: node-local only – Lock the descriptors 26

The Future ● Short-term – Full integration into management infrastructure of The Machine – Frameworks / Middleware / Demos / Applications / Stress testing – Optimizations (i.e., huge pages) – Learn, learn, learn ● And beyond – More capable or specialized SoCs – Deeper integration of fabric – Enablement of NVM technologies at production scale – Harden proven software (i.e., replace FuSE with a “real” file system) – True concurrent file system – Eliminate separate ToRMS server – ???????

Managing NVM in The Machine Rocky Craig, Master Linux Technologist - PowerPoint PPT Presentation

Managing NVM in The Machine Rocky Craig, Master Linux Technologist Linux Foundation Vault 2016 The Machine Project from Hewlett Packard Enterprise Massive SoC pool Photonic fabric Massive memory pool

YMMV Ov Overv erview iew In Inte tel NV l NVM M Em Emul ulat ator or

Rethinking Applications in the NVM Era Amitabha Roy ex- Intel Research NVM = Non Volatile

Designing a User-Friendly Java NVM Framework Thomas Shull , Jian Huang, Josep Torrellas University

YMMV 2013 2013 2013 2013 Prison Life GOOD EVIL NVM OLTP DRAM SSD/HDD Pr Projec oject

YMMV The The Las Last Si t Six Mon x Months ths Prison Life GOOD EVIL NVM OLTP DRAM

MANAGING SOIL FOR MANAGING SOIL FOR MANAGING SOIL FOR MANAGING SOIL FOR ADVANCING FOOD

MANAGING IMPERFECTLY MANAGING IMPERFECTLY MANAGING IMPERFECTLY MANAGING IMPERFECTLY OBSERVED

System Security Overview with an Emphasis on Security Issues for Storage and Emerging NVM (Part

New Mul(-Time Programmable Embedded NVM IP Provides SoC

System Security Overview with an Emphasis on Security Issues for Storage and Emerging NVM (Part

Using NVM Express SSDs and CAPI to Accelerate Data Center Applications in OpenPOWER Systems

Non-Volatile Memory (NVM) NAND STT-MRAM PCM DRAM Non-volatility o o o x 2.5 X 10 4

Proteus: A Flexible and Fast Software supported Hardware Logging approach for NVM Seunghee Shin,

Reducing Write Amplification of Flash Storage through Cooperative Data Management with NVM 32nd

Emerging NVM Enabled Storage Architecture: From Evolution to Revolution. Yiran Chen Electrical and

NVM OVE: Helping Programmers Move to Byte-based Persistence NVMOVE Himanshu Chauhan with Irina

Path to effective electroweak operators

Interacting with Jrg Steffens, Bareos GmbH & Co. KG Agenda Bareos Overview Interaction

Reduced Aggregate Jan Novk Scattering Operators Ralf Habel s s Derek Nowrouzezahrai for

CSE543 - Introduction to Computer and Network Security Module: Safe Programming Professor Trent

Changelog Changes made in this version not seen in fjrst lecture: 6 September: fjx stray @s on

EE456 Digital Communications Professor Ha Nguyen September 2015 EE456 Digital

Brian Hickmann, Dennis Bradford Motivation AI is driving development of several new

EE 457 Unit 2a Unsigned 2s Complement Sign and Zero Extension Fixed Point Systems and

Managing NVM in The Machine Rocky Craig, Master Linux Technologist - PowerPoint PPT Presentation

Managing NVM in The Machine Rocky Craig, Master Linux Technologist Linux Foundation Vault 2016 The Machine Project from Hewlett Packard Enterprise Massive SoC pool Photonic fabric Massive memory pool

YMMV Ov Overv erview iew In Inte tel NV l NVM M Em Emul ulat ator or

Rethinking Applications in the NVM Era Amitabha Roy ex- Intel Research NVM = Non Volatile

Designing a User-Friendly Java NVM Framework Thomas Shull , Jian Huang, Josep Torrellas University

YMMV 2013 2013 2013 2013 Prison Life GOOD EVIL NVM OLTP DRAM SSD/HDD Pr Projec oject

YMMV The The Las Last Si t Six Mon x Months ths Prison Life GOOD EVIL NVM OLTP DRAM

MANAGING SOIL FOR MANAGING SOIL FOR MANAGING SOIL FOR MANAGING SOIL FOR ADVANCING FOOD

MANAGING IMPERFECTLY MANAGING IMPERFECTLY MANAGING IMPERFECTLY MANAGING IMPERFECTLY OBSERVED

System Security Overview with an Emphasis on Security Issues for Storage and Emerging NVM (Part

New Mul(-Time Programmable Embedded NVM IP Provides SoC

System Security Overview with an Emphasis on Security Issues for Storage and Emerging NVM (Part

Using NVM Express SSDs and CAPI to Accelerate Data Center Applications in OpenPOWER Systems

Non-Volatile Memory (NVM) NAND STT-MRAM PCM DRAM Non-volatility o o o x 2.5 X 10 4

Proteus: A Flexible and Fast Software supported Hardware Logging approach for NVM Seunghee Shin,

Reducing Write Amplification of Flash Storage through Cooperative Data Management with NVM 32nd

Emerging NVM Enabled Storage Architecture: From Evolution to Revolution. Yiran Chen Electrical and

NVM OVE: Helping Programmers Move to Byte-based Persistence NVMOVE Himanshu Chauhan with Irina

Path to effective electroweak operators

Interacting with Jrg Steffens, Bareos GmbH &amp; Co. KG Agenda Bareos Overview Interaction

Reduced Aggregate Jan Novk Scattering Operators Ralf Habel s s Derek Nowrouzezahrai for

CSE543 - Introduction to Computer and Network Security Module: Safe Programming Professor Trent

Changelog Changes made in this version not seen in fjrst lecture: 6 September: fjx stray @s on

EE456 Digital Communications Professor Ha Nguyen September 2015 EE456 Digital

Brian Hickmann, Dennis Bradford Motivation AI is driving development of several new

EE 457 Unit 2a Unsigned 2s Complement Sign and Zero Extension Fixed Point Systems and

Interacting with Jrg Steffens, Bareos GmbH & Co. KG Agenda Bareos Overview Interaction