Persistent Memory Architecture Research at UCSC – Workload Characterization and Hardware Support for Persistence Jishen Zhao jishen.zhao@ucsc.edu Computer Engineering UC Santa Cruz July 12, 2016
What is persistent memory? NVRAM • Persistent memory memory storage 2
NVRAM is here … STT-RAM, PCM, ReRAM, NVDIMM, 3D Xpoint, etc. 2016 NVRAM 3
Design Opportunities with NVRAM Memory CPU CPU Load/store DRAM NVRAM Not persistent Persistent memory Storage Disk/Flash Load/store Fopen(), fread(), fwrite(), … Persistent Persistent • Allow in-memory data structures to become permanent immediately • Demonstrated 32x speedup compared with using storage devices [Condit+ SOSP’09, Volos+ ASPLOS’11, Coburn+ ASPLOS’11, Venkataraman+ FAST’11] 4
Executing Applications in Persistent Memory open() mmap() 5 Jeff Moyer, “Persistent memory in Linux,” SNIA NVM Summit, 2016.
Our research – At the software/hardware boundary • Workload characterization Applications • Exploring persistent memory use cases • Identifying system bottlenecks System Software (VM, File System, • Implications to software/hardware Database System) design • System software ISA • Efficient fault tolerance and data CPU persistence mechanisms DRAM NVRAM • Hardware • Developing storage accelerators SSD/HDD • Redefining the boundary between 6 software and hardware
Workload Characterization from a hardware perspective • Motivation • Persistent memory is managed by both hardware and software • Most prior works only profile software statistics, e.g., system throughput • Objectives • Help system designers better understand performance bottlenecks • Help application designers better utilize persistent memory hardware • Approach • Profile hardware and software counter statistics • Instrument application and system software to obtain 7 insights at micro-architecture level
Hardware and software configurations • CPU: Intel Xeon CPU E5-2620 v3 • Memory: 12GB of pmem + 4GB of main memory partitioned on DRAM (memmap) • Operating system: Linux 4.4.0 kernel • Profiling Tools • Linux Perf: collecting software and hardware counter statistics • Intel Pin 3.0 instrumentation tool with in-house Pintools • File systems evaluated • Ext4 : Journaling of metadata, running on RAMDisk • Ext4-DAX : • Journaling of metadata and bypass page cache with DAX • NOVA 8 • Nonvolatile accelerated log-structured file system [Li+ FAST’16]
About DAX • What is DAX? • “Direct Access” • Enabling efficient Linux support for persistent memory • Allowing file system requests to bypass the page cache allocated in DRAM and directly access NVRAM via loads and stores • How does Ext4-DAX work? • DAX maps storage components directly into userspace • * True DAX is not supported in Linux yet – accesses still go through DRAM, i.e., directly swaps the pages between DRAM main memory and NVRAM storage. • Example of file systems with DAX capability • Ext4-DAX, XFS-DAX, Btrfs-DAX à Fedora • Intel PMFS 9 • NOVA
Current workloads • Filebench (a widely-used benchmark suite designed for evaluating file system performance) • Fileserver, Webproxy, WebServer, Varmail • FFSB (Flexible Filesystem benchmark) • Can configure read/write ratio and number of threads • Bonnie • measuring file system performance by invoking putc() and getc() • File compression/decompression: tar/untar, zip/unzip • TPC-C running with MySQL • A database online transaction processing workload • Write intensive, with 63.7% of writes • In-house micro-benchmarks • * Applications are compiled with static linking and stored in NVRAM 10 (pmem) region
Workload throughput (opera-ons per second) 21000 ext4 ext4-DAX NOVA 20000 Throughput 19000 18000 17000 16000 15000 14000 Fileserver Webproxy Webserver Varmail Execu&on &me in nanoseocnds 5E+09 Transac6ons per ten seconds NOVA EXT4-DAX EXT4 120 NOVA EXT4-DAX EXT4 4E+09 100 80 3E+09 60 2E+09 40 1E+09 20 0 0 TPC-C UNTAR TAR 11
Correlation between system performance and hardware behavior dTLB miss rate iTLB miss rate LLC load miss rate CorrelaFon Coefficient LLC store miss rate Page fault rate 1.5 1 Highly correlated 0.5 (standard error within 8%) 0 -0.5 -1 -1.5 Fileserver Webproxy Webserver Varmail Zip Unzip FFSB 12
Throughput vs. Write Intensity 2400000 (Transac>ons/s) ext4 ext4-DAX NOVA FFSB Throughput 2000000 1600000 1200000 800000 400000 0 R=100%, R=90%, R=80%, R=70%, R=60%, R=0%, W=0% W=10% W=20% W=30% W=40% W=100% Normalized Throughput 3.0 ext4-dax ext4 nova 2.5 Bonnie (read:write = 1:1) 2.0 1.5 1.0 0.5 0.0 putc() Block Block getc() Efficient Effec@ve 13 throughput writes create block reads random change seek rate rewrite
The impact of workload locality • NVRAM devices may or may not have an on-chip buffer TransacIons per second � 21000 ext4 ext4-DAX NOVA 19000 17000 15000 13000 DRAM classic 50% 60% 70% 80% 90% NVM Buffer hit rate in revised NVRAM model � model TransacHons per second � 21000 ext4 ext4-DAX NOVA 19000 17000 15000 13000 14 4KB DRAM 4KB 2KB 1KB 512B 256B Buffer size in revised NVRAM model �
Our research – At the software/hardware boundary • Workload characterization Applications • Exploring persistent memory use cases • Identifying system bottlenecks System Software (VM, File System, • Implications to software/hardware Database System) design • System software ISA • Efficient fault tolerance and data CPU persistence mechanisms DRAM NVRAM • Hardware • Developing storage accelerators SSD/HDD • Redefining the boundary between 15 software and hardware
Logging Acceleration (executive summary) • Problem • Traditional software-based logging imposes substantial overhead in persistent memory • Even with either undo or redo logging • Not to say undo+redo logging as used in many modern database systems • Changes in software interface add burden on programmers • Solution • Hardware-based logging accelerators • Leverage existing hardware information (otherwise largely wasted) • Results • 3.3X performance improvement • Simplified software interface 16 • Low hardware overhead
Logging (Journaling) in Persistent Memory (Maintaining Atomicity) NVRAM Memory Root Root Root Barrier A A A B C D B C D B C D C’ D’ C’ D’ Log Size of one store 17
Performance overhead of software logging Zhao+, “Kiln: Closing the performance gap between systems with and without persistence support,” MICRO 2013. 18
Software interface of software logging • Memory barriers, strict ordering constraints, and cache flushing all needed for ensuring data persistence 19
Our software interface • Memory barriers, strict ordering constraints, and cache flushing all needed for ensuring data persistence Hardware support for 20
How does it work? L1 cache hit – we get all that needed for undo+redo log • Writes to persistent memory automatically trigger a write to the log – a software-allocated circular buffer • Log information includes TxID, address, undo cache line value, and redo cache line value • Leveraging cache hit/miss handling process to update the log • Log updates get buffered in the processor Processor Core Processor 5 … (Volatile) Core Core A ’ 1 hit 1 Tx_commit L1$ L1$ Controllers L1$ A 1 Log … … Cache 4 Buffer Bypass Last-level Cache 2 2 Caches Log Buffer (FIFO) A ’ 1 A 1 A ’ 2 A 2 Memory Controllers ze 3 TxID, addr(A) Cache line size Cache line size 22 Log Log NVRAM (circular (circular DRAM NVRAM (Nonvolatile) buffer) buffer) (b)
How does it work? L1 cache miss – we get all that needed during “write-allocate” • Writes to persistent memory automatically trigger a write to the log – a software-allocated circular buffer • Log information includes TxID, address, undo cache line value, and redo cache line value • Leveraging cache hit/miss handling process to update the log • Log updates get buffered in the processor Core Processor 5 … Core Core A ’ 1 miss 1 Tx_commit L1$ L1$ Controllers L1$ Log … … Cache 4 Buffer 2 Write-allocate Hit in a Last-level Cache lower-level A 1 Lower-level$ Bypass cache 2 Caches 2 Log Buffer (FIFO) Memory Controllers A ’ 1 A 1 A ’ 2 A 2 Cache line size 3 TxID, addr(A) Cache line size 23 Log Log NVRAM (circular (circular DRAM NVRAM (Nonvolatile) buffer) buffer) (c)
Force cache writeback when necessary • Need to flush CPU caches, when • A log entry is almost overwritten by new log updates • But the associated data still remains in CPU caches head Circular Log Buffer tail 24
Results • McSimA+ simulator running • Persistent memory micro-benchmarks • A real workload – a persistent version of memcached • System throughput improved by 1.45x~1.60x on average • Memcached throughput improved by 3.3x • Memory traffic reduced by 2.36x~3.12x • Dynamic memory energy improvement by 1.53x~1.72x • Hardware overhead • 17 bytes of flip-flops • 1-bit cache tag information per cache line • Multiplexers 25
Recommend
More recommend