persistent memory architecture research at ucsc workload
play

Persistent Memory Architecture Research at UCSC Workload - PowerPoint PPT Presentation

Persistent Memory Architecture Research at UCSC Workload Characterization and Hardware Support for Persistence Jishen Zhao jishen.zhao@ucsc.edu Computer Engineering UC Santa Cruz July 12, 2016 What is persistent memory? NVRAM


  1. Persistent Memory Architecture Research at UCSC – Workload Characterization and Hardware Support for Persistence Jishen Zhao jishen.zhao@ucsc.edu Computer Engineering UC Santa Cruz July 12, 2016

  2. What is persistent memory? NVRAM • Persistent memory memory storage 2

  3. NVRAM is here … STT-RAM, PCM, ReRAM, NVDIMM, 3D Xpoint, etc. 2016 NVRAM 3

  4. Design Opportunities with NVRAM Memory CPU CPU Load/store DRAM NVRAM Not persistent Persistent memory Storage Disk/Flash Load/store Fopen(), fread(), fwrite(), … Persistent Persistent • Allow in-memory data structures to become permanent immediately • Demonstrated 32x speedup compared with using storage devices [Condit+ SOSP’09, Volos+ ASPLOS’11, Coburn+ ASPLOS’11, Venkataraman+ FAST’11] 4

  5. Executing Applications in Persistent Memory open() mmap() 5 Jeff Moyer, “Persistent memory in Linux,” SNIA NVM Summit, 2016.

  6. Our research – At the software/hardware boundary • Workload characterization Applications • Exploring persistent memory use cases • Identifying system bottlenecks System Software (VM, File System, • Implications to software/hardware Database System) design • System software ISA • Efficient fault tolerance and data CPU persistence mechanisms DRAM NVRAM • Hardware • Developing storage accelerators SSD/HDD • Redefining the boundary between 6 software and hardware

  7. Workload Characterization from a hardware perspective • Motivation • Persistent memory is managed by both hardware and software • Most prior works only profile software statistics, e.g., system throughput • Objectives • Help system designers better understand performance bottlenecks • Help application designers better utilize persistent memory hardware • Approach • Profile hardware and software counter statistics • Instrument application and system software to obtain 7 insights at micro-architecture level

  8. Hardware and software configurations • CPU: Intel Xeon CPU E5-2620 v3 • Memory: 12GB of pmem + 4GB of main memory partitioned on DRAM (memmap) • Operating system: Linux 4.4.0 kernel • Profiling Tools • Linux Perf: collecting software and hardware counter statistics • Intel Pin 3.0 instrumentation tool with in-house Pintools • File systems evaluated • Ext4 : Journaling of metadata, running on RAMDisk • Ext4-DAX : • Journaling of metadata and bypass page cache with DAX • NOVA 8 • Nonvolatile accelerated log-structured file system [Li+ FAST’16]

  9. About DAX • What is DAX? • “Direct Access” • Enabling efficient Linux support for persistent memory • Allowing file system requests to bypass the page cache allocated in DRAM and directly access NVRAM via loads and stores • How does Ext4-DAX work? • DAX maps storage components directly into userspace • * True DAX is not supported in Linux yet – accesses still go through DRAM, i.e., directly swaps the pages between DRAM main memory and NVRAM storage. • Example of file systems with DAX capability • Ext4-DAX, XFS-DAX, Btrfs-DAX à Fedora • Intel PMFS 9 • NOVA

  10. Current workloads • Filebench (a widely-used benchmark suite designed for evaluating file system performance) • Fileserver, Webproxy, WebServer, Varmail • FFSB (Flexible Filesystem benchmark) • Can configure read/write ratio and number of threads • Bonnie • measuring file system performance by invoking putc() and getc() • File compression/decompression: tar/untar, zip/unzip • TPC-C running with MySQL • A database online transaction processing workload • Write intensive, with 63.7% of writes • In-house micro-benchmarks • * Applications are compiled with static linking and stored in NVRAM 10 (pmem) region

  11. Workload throughput (opera-ons per second) 21000 ext4 ext4-DAX NOVA 20000 Throughput 19000 18000 17000 16000 15000 14000 Fileserver Webproxy Webserver Varmail Execu&on &me in nanoseocnds 5E+09 Transac6ons per ten seconds NOVA EXT4-DAX EXT4 120 NOVA EXT4-DAX EXT4 4E+09 100 80 3E+09 60 2E+09 40 1E+09 20 0 0 TPC-C UNTAR TAR 11

  12. Correlation between system performance and hardware behavior dTLB miss rate iTLB miss rate LLC load miss rate CorrelaFon Coefficient LLC store miss rate Page fault rate 1.5 1 Highly correlated 0.5 (standard error within 8%) 0 -0.5 -1 -1.5 Fileserver Webproxy Webserver Varmail Zip Unzip FFSB 12

  13. Throughput vs. Write Intensity 2400000 (Transac>ons/s) ext4 ext4-DAX NOVA FFSB Throughput 2000000 1600000 1200000 800000 400000 0 R=100%, R=90%, R=80%, R=70%, R=60%, R=0%, W=0% W=10% W=20% W=30% W=40% W=100% Normalized Throughput 3.0 ext4-dax ext4 nova 2.5 Bonnie (read:write = 1:1) 2.0 1.5 1.0 0.5 0.0 putc() Block Block getc() Efficient Effec@ve 13 throughput writes create block reads random change seek rate rewrite

  14. The impact of workload locality • NVRAM devices may or may not have an on-chip buffer TransacIons per second � 21000 ext4 ext4-DAX NOVA 19000 17000 15000 13000 DRAM classic 50% 60% 70% 80% 90% NVM Buffer hit rate in revised NVRAM model � model TransacHons per second � 21000 ext4 ext4-DAX NOVA 19000 17000 15000 13000 14 4KB DRAM 4KB 2KB 1KB 512B 256B Buffer size in revised NVRAM model �

  15. Our research – At the software/hardware boundary • Workload characterization Applications • Exploring persistent memory use cases • Identifying system bottlenecks System Software (VM, File System, • Implications to software/hardware Database System) design • System software ISA • Efficient fault tolerance and data CPU persistence mechanisms DRAM NVRAM • Hardware • Developing storage accelerators SSD/HDD • Redefining the boundary between 15 software and hardware

  16. Logging Acceleration (executive summary) • Problem • Traditional software-based logging imposes substantial overhead in persistent memory • Even with either undo or redo logging • Not to say undo+redo logging as used in many modern database systems • Changes in software interface add burden on programmers • Solution • Hardware-based logging accelerators • Leverage existing hardware information (otherwise largely wasted) • Results • 3.3X performance improvement • Simplified software interface 16 • Low hardware overhead

  17. Logging (Journaling) in Persistent Memory (Maintaining Atomicity) NVRAM Memory Root Root Root Barrier A A A B C D B C D B C D C’ D’ C’ D’ Log Size of one store 17

  18. Performance overhead of software logging Zhao+, “Kiln: Closing the performance gap between systems with and without persistence support,” MICRO 2013. 18

  19. Software interface of software logging • Memory barriers, strict ordering constraints, and cache flushing all needed for ensuring data persistence 19

  20. Our software interface • Memory barriers, strict ordering constraints, and cache flushing all needed for ensuring data persistence Hardware support for 20

  21. How does it work? L1 cache hit – we get all that needed for undo+redo log • Writes to persistent memory automatically trigger a write to the log – a software-allocated circular buffer • Log information includes TxID, address, undo cache line value, and redo cache line value • Leveraging cache hit/miss handling process to update the log • Log updates get buffered in the processor Processor Core Processor 5 … (Volatile) Core Core A ’ 1 hit 1 Tx_commit L1$ L1$ Controllers L1$ A 1 Log … … Cache 4 Buffer Bypass Last-level Cache 2 2 Caches Log Buffer (FIFO) A ’ 1 A 1 A ’ 2 A 2 Memory Controllers ze 3 TxID, addr(A) Cache line size Cache line size 22 Log Log NVRAM (circular (circular DRAM NVRAM (Nonvolatile) buffer) buffer) (b)

  22. How does it work? L1 cache miss – we get all that needed during “write-allocate” • Writes to persistent memory automatically trigger a write to the log – a software-allocated circular buffer • Log information includes TxID, address, undo cache line value, and redo cache line value • Leveraging cache hit/miss handling process to update the log • Log updates get buffered in the processor Core Processor 5 … Core Core A ’ 1 miss 1 Tx_commit L1$ L1$ Controllers L1$ Log … … Cache 4 Buffer 2 Write-allocate Hit in a Last-level Cache lower-level A 1 Lower-level$ Bypass cache 2 Caches 2 Log Buffer (FIFO) Memory Controllers A ’ 1 A 1 A ’ 2 A 2 Cache line size 3 TxID, addr(A) Cache line size 23 Log Log NVRAM (circular (circular DRAM NVRAM (Nonvolatile) buffer) buffer) (c)

  23. Force cache writeback when necessary • Need to flush CPU caches, when • A log entry is almost overwritten by new log updates • But the associated data still remains in CPU caches head Circular Log Buffer tail 24

  24. Results • McSimA+ simulator running • Persistent memory micro-benchmarks • A real workload – a persistent version of memcached • System throughput improved by 1.45x~1.60x on average • Memcached throughput improved by 3.3x • Memory traffic reduced by 2.36x~3.12x • Dynamic memory energy improvement by 1.53x~1.72x • Hardware overhead • 17 bytes of flip-flops • 1-bit cache tag information per cache line • Multiplexers 25

Recommend


More recommend