NUMA and VM Scalability Mark Johnston markj@FreeBSD.org FreeBSD - PowerPoint PPT Presentation

NUMA and VM Scalability Mark Johnston markj@FreeBSD.org FreeBSD Developer Summit MeetBSD 2018 October 18, 2018

Non-Uniform Memory Access Motivation ◮ Scalable multiprocessing ◮ Target commodity systems Assumptions ◮ CPU caches are coherent ◮ Small number of NUMA domains (usually 2 or 4) ◮ Low NUMA factor (20-50%) ◮ NUMA domains are balanced

OS Goals ◮ Balance resource (memory controller) utilization ◮ Sane default NUMA allocation policies ◮ Allow applications to declare intent ◮ DTRT for static allocations (per-CPU data, DMA, etc.) ◮ Handle local memory shortages gracefully

OS Support NUMA awareness: ◮ CPU scheduler ◮ cpuset(2) ◮ busdma(9) ◮ Memory allocators: UMA, malloc(9) , kmem malloc(9) , kstacks, etc. SMP scalability: ◮ Page allocator ◮ Page queues ◮ Buffer cache

FreeBSD History ◮ SRAT parser and vm phys domain awareness ◮ r210550 , r210552 (2010) ◮ First-touch allocation policy, useful with CPU pinning ◮ Changed to round-robin in r250601 (2013) ◮ Per-domain page queues ◮ r254065 (2013) ◮ projects/numa (2014) ◮ VM NUMA ALLOC , numactl(8) ◮ r285387 (2015) ◮ First attempt at user-configurable policies ◮ Included a SLIT parser, currently not used by the kernel

NUMA/Scalability project ◮ 2017/2018, many commits ◮ Work by Jeff Roberson, sponsored by Limelight, Netflix, Isilon ◮ Plumb int domain through various layers ◮ Define NUMA allocation policy abstraction ◮ Provide userland interface for specifying allocation policy ◮ Address VM and buffer cache bottlenecks

domainset(9) ◮ Structure defining a domain selection policy ◮ Immutable ◮ Iterator state is defined externally ( struct domainset ref ) ◮ Contains a pointer to a domainset ◮ Embedded in struct thread and vm object t ◮ vm domainset *() applies a domainset to an iterator ◮ Can restrict to a subset of system’s domains ◮ Some predefined policies can be used ◮ DOMAINSET PREF(1) : “Allocate from domain 1 or fall back” ◮ DOMAINSET RR() : Global round-robin

domainset(9) policies DOMAINSET POLICY ROUNDROBIN ◮ Cycles through domains: d = iter++ % ds->ds cnt ◮ 0, 1, 2, 3, 0, 1, 2, 3, 0, ... DOMAINSET POLICY FIRSTTOUCH ◮ Pick the domain of the current CPU: d = PCPU GET(domain) DOMAINSET POLICY PREFER ◮ Pick the domain specified in the policy: d = ds->ds prefer ◮ Fall back to round-robin when free pages are scarce DOMAINSET POLICY INTERLEAVE ◮ Domain is a function of the pindex ◮ Round-robin with a stride, for successive indices ◮ 0, 0, ..., 0, 1, 1, ..., 1, 0, 0, ... ◮ Superpage-friendly: use a stride of 512

vm domainset vm_domainset_iter_page_init(&di, obj, pindex, &domain, &flags); do { m = vm_page_alloc_domain(obj, pindex, domain, flags); if (m != NULL) break; } while (vm_domainset_iter_page(&di, obj, &domain) == 0); return (m);

Userland interface ◮ Domain selection policies integrated into cpuset(1) ◮ Each cpuset has an associated struct domainset ◮ Allows specification of a policy for a thread, process, jail ◮ cpuset -n rr:0,2 make buildworld ◮ cpuset -g -s 0 ◮ cpuset getdomain(2) , cpuset setdomain(2) ◮ Userland threads default to first-touch ◮ Domain selection overridden to preserve superpage reservations

Memory allocators (1) UMA, malloc(9) ◮ No policy at the caching layer (fast path) ◮ Default round-robin policy at the slab layer (zone iterator) ◮ UMA zone policy: UMA ZONE NUMA for first-touch ◮ uma zalloc domain(2) , malloc domain(2) kmem malloc(9) and friends ◮ Round-robin policy (thread iterator) ◮ Multiple vmem(9) arenas provide striping for superpages busdma(9) ◮ Bus can be queried for domain affinity ( PXM method) ◮ DMA tags cache local domain index ◮ DMA allocations use malloc domain(9) with local domain

Memory allocators (2) vm page alloc() and friends ◮ Source of user memory allocations (page faults, etc.) ◮ Not always under user control (e.g., libc.so ) ◮ Policy specified by VM object (may be absent), or thread ◮ vm page alloc domain() Kernel stacks ◮ Global round-robin policy (thread iterator) ◮ Kernel stacks are cached ◮ We can do better (e.g., ithread kstacks)

Low memory handling ◮ Each domain has page queues, page daemon, laundry thread ◮ Page domains are mostly independent ◮ Per-domain free page targets, laundry targets ◮ OOM kills occur only when all domains are depleted ◮ Does not work well if most of a domain is wired (e.g., by ARC) ◮ vm wait doms() : sleep until one of the specified domains has some free pages

Scalability improvements ◮ PID controller for free page target ◮ Split free page mutex and add per-CPU free page cache ◮ Fine-grained reservation locking ◮ Lockless page daemon wakeups and v free count updates ◮ Per-CPU v wire count accounting ◮ Page queue batching ◮ Lazy dequeue of wired pages ◮ Buffer cache sharding, locking improvements

Future Work NUMA: ◮ Non-x86 support (arm64 and powerpc64) ◮ Statistics collection ◮ libnuma, msetdomain(2) ◮ Static allocations ( pcpu(9) , kernel thread stacks, etc.) ◮ More affinity plumbing (per-mountpoint policy?) ◮ ZFS integration ◮ taskqueue(9) integration Scalability: ◮ Split user ( mlock(2) ) and kernel wired page accounting ◮ Lockless per-page queue state ◮ Lockless vm page hold() ◮ Improve PQ ACTIVE scalability in the page fault handler

NUMA and VM Scalability Mark Johnston markj@FreeBSD.org FreeBSD - PowerPoint PPT Presentation

NUMA and VM Scalability Mark Johnston markj@FreeBSD.org FreeBSD Developer Summit MeetBSD 2018 October 18, 2018 Non-Uniform Memory Access Motivation Scalable multiprocessing Target commodity systems Assumptions CPU caches are

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 Scalability 2 Scalability

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

Root zone scalability model Bart Gijsen October 28, 2009 Root zone scalability model

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

Versioning of Topic Map Templates Structuring Versioning and Scalability Scalability Proc.

Linux NUMA evolution survival of the quickest or: related information on lwn.net, lkml.org and

NUMA obliviousness through memory mapping Mrunal Gawade Martin Kersten CWI, Amsterdam

Features of edge states and domain walls in chiral superconductors Manfred Sigrist NQS2017, YITP

How$to$Record$Quantum$Queries$ and$Applications$to$Quantum$Indifferentiability Mark%Zhandry

Tasks #10-13: Iden0fy Data Flows, Incoming PI, Internally Generated PI and Outgoing PI 1 PMRM

TractableConstraintsinFinite Semila2ces

High-Dimensional Covariance Decomposition into Sparse Markov and Independence Domains Majid

Nonlinear Control Lecture # 3 Stability of Equilibrium Points Nonlinear Control Lecture # 3

CS675: Convex and Combinatorial Optimization Fall 2019 Duality of Convex Optimization Problems

BradChamberlain,SungEunChoi,SteveDeitz,