Logging in Persistent Memory: to Cache, or Not to Cache? Mengjie Li, Matheus Ogleari , Jishen Zhao
Persistent Memory STT-RAM, PCM, Memory CPU CPU ReRAM, NVDIMM, Battery-backed Load/store DRAM NVRAM DRAM, etc. Not persistent Persistent memory Storage Disk/Flash Load/store Fopen, fread, fwrite , … Persistent Persistent These nonvolatile devices are able to retain the data in a consistent state in case of power loss. 2
Logging in Persistent Memory Update persistent memory with transactions Tx_begin Core Core Core Core … … do some reads L1 L1 L1 L1 do some computation Memory LLC LLC Rlog ( addr(C), new_val(C) ) Barrier memory_barrier NVRAM Root NVRAM Root write C Tx_commit A A C’ B B C D D Micro-ops: Log_C ’ Log_C ’ Log Log s tore C’ 1 C 1 ’ s tore C’ 2 ... Time 3
To cache, or not cache? That is the question. [Mengjie Li+, Memsys 2017] 4
Experimental Setup • Desktop – Dell OptiPlex 7040 Tower o CPU – 4-core 3.4GHz Intel Core-i7 o Cache – 8 MB last-level cache • Measurement Tools – Perf & rdtsc • Micro-benchmarks – run 20 times and report the average performance without initialization time o Various working set sizes o Various transaction sizes and write intensity o Various data structures: hashtable, rbtree, array, … 5
Microbenchmarks Example //initialization Create an array of strings //Uncacheable log //Cacheable log for (i = 0; i < array_size; ++i) { for (i = 0; i < array_size; ++i) { value = random_string; value = random_string; key = i; key = i; // Log updates // Log updates // Intrinsic functions to invoke movnti log[2 * i] = key; _mm_stream_si32(&log[2 * i], key); log[2 * i + 1] = value; asm volatile (“ sfence ”); _mm_stream_si32(&log[2 * i + 1], value); asm volatile (“ sfence ”); array[i] = value; array[i] = value; } } 6
Issue with Cacheable log Core Core L1i Cache L1d Cache L1i Cache L1d Cache Log Cache pollution ... ... Log Last-Level Cache Log Memory Bus DRAM NVM Log 7
LLC Miss Rate and Execution Time Execution Time (Million Cycles) LLC Miss Rate Execution Time 90% 1.4 85% 1.2 LLC Miss Rate 80% 1.0 75% 0.8 70% 0.6 65% 0.4 60% 0.2 55% 50% 0.0 Uncacheable Cacheable 8
How about uncacheable log performance? 9
How do we make log uncacheable? Example: x86 processors provide uncacheable write instructions (movnti, movntg, etc) Instructions can be invoked by • Inline functions (__asm__()) • Intrinsic functions(_mm_stream_si32) 10
Write Combining Buffer (WCB) 4-6 cache lines Core Core WCB Log WCB L1 Cache L1 Cache ... ... Last-Level Cache Memory Bus DRAM NVM Log 11
Issues with Uncacheable Log • Existing uncacheable writing schemes are sub-optimal o Partial writes in WCB o Overhead of uncacheable write instructions o Limited WCB size 12
Partial Writes in WCB Full write Partial write 64B < 64B WCB 1 bus clock 1 bus clock Memory Partial writes are inefficient, because they underutilize the memory bus bandwidth 13
Execution Time vs. Transaction Size — Partial Writes Partial Writes Full Writes 1.28E09 Cycles 1.15E08 Cycles 100% Partial writes: 90% String Size – 4B Execution Time 80% Iterations – 2097152 70% Total Data – 8MB 60% 50% 40% Full wirtes: 30% String Size – 64B 20% 10% Iterations – 131072 0% Total Data – 8MB Uncacheable Cacheable 14
Overhead of Uncacheable Write Instructions / /U n c a c h e a b l e lo g fo r ( i = 0 ; i < a r r a y _ s i z e ; + + i) { v a l u e = r a n d o m _ s tr i n g ; k e y = i ; / / L o g u p d a te s / / In tr i n s i c fu n c ti o n s to i n v o k e m o v n ti _ m m _ s tr e a m _ s i 3 2 ( & lo g [ 2 * i] , k e y ) ; e ( “ ” ) ; _ m m _ s tr e a m _ s i 3 2 ( & lo g [ 2 * i + 1 ] , v a l u e ) ; e ( “ ” ) ; a s m v o l a ti l s fe n c e a r r a y [ i] = v a l u e ; } / /C a c h e a b l e lo g fo r ( i = 0 ; i < a r r a y _ s i z e ; + + i) { v a l u e = r a n d o m _ s tr i n g ; k e y = i ; / / L o g u p d a te s lo g [ 2 * i] = k e y ; lo g [ 2 * i + 1 ] = v a l u e ; e ( “ ” ) ; a s m v o l a ti l s fe n c e e ( “ ” ) ; a r r a y [ i] = v a l u e ; 15 } 6
Overhead of Uncacheable Write Instructions More overhead to do type casting, if the type of data written is not integer void _mm_stream_si32 (int *p, int a) asm (” movnti %1, %0” : “=m” (*p) : “r”(v)); // int * p, int v; 16
Issues with Limited WCB Size Log updates among transactions issued by program WCB NVRAM bus 17
Inefficiencies of Uncacheable Log String size iterations (Bytes) uncacheable cacheable speedup 4 2097152 3.5 1.6 Partial writes 8 1048576 Execution Time (Billion cycles) 3.0 and sfence Speedup 16 524288 WCB size limit 2.5 1.4 32 262144 2.0 – – 64 131072 1.5 1.2 1.0 128 65536 0.5 256 32768 0.0 1.0 4 8 16 32 64 128 256 String size (Bytes) 18
Summary • Tradeoff between cacheable and uncacheable log o Issues with cacheable log – cache contamination o Issues with uncacheable log – sub-optimal design in • Uncacheable write instructions and programming interface • Hardware components, e.g., write-combining buffer design and the way it is used • More results o Sensitivity study on read/write ratio in transactions o Sensitivity study on transaction size o Other data structures: hash table, rbtree, b+tree, etc. 19
Logging in Persistent Memory: to Cache, or Not to Cache? Mengjie Li, Matheus Ogleari , Jishen Zhao
Recommend
More recommend