PASTE: A Network Programming Interface for Non-Volatile Main Memory - - PowerPoint PPT Presentation

paste a network programming interface for non volatile
SMART_READER_LITE
LIVE PREVIEW

PASTE: A Network Programming Interface for Non-Volatile Main Memory - - PowerPoint PPT Presentation

PASTE: A Network Programming Interface for Non-Volatile Main Memory Michio Honda (NEC Laboratories Europe) Giuseppe Lettieri (Universit di Pisa) Lars Eggert and Douglas Santry (NetApp) USENIX NSDI 2018 Review: Memory Hierarchy Slow,


slide-1
SLIDE 1

PASTE: A Network Programming Interface for Non-Volatile Main Memory

Michio Honda (NEC Laboratories Europe) Giuseppe Lettieri (Università di Pisa) Lars Eggert and Douglas Santry (NetApp) USENIX NSDI 2018

slide-2
SLIDE 2

Review: Memory Hierarchy

Slow, block-oriented persistence

CPU Caches HDD / SSD Block access w/ system calls Byte access w/ load/store 100-1000s us 70 ns 5-50 ns Main Memory

slide-3
SLIDE 3

Review: Memory Hierarchy

Fast, byte-addressable persistence

CPU Caches Block access w/ system calls Byte access w/ load/store 100-1000s us 70 ns 5-50 ns

  • 1000s ns

Main Memory HDD / SSD

slide-4
SLIDE 4

Networking is faster than disks/SSDs

1.2KB durable write over TCP/HTTP

Client Server SSD Syscall, PCIe bus, physical media Cables, NICs, TCP/IP, socket API 23us 1300us

slide-5
SLIDE 5

Networking is slower than NVMM

1.2KB durable write over TCP/HTTP

23us 2us Client Server NVMM Memcpy, memory bus, physical media Cables, NICs, TCP/IP, socket API

slide-6
SLIDE 6

Networking is slower than NVMM

1.2KB durable write over TCP/HTTP

Client Server NVMM Memcpy, memory bus, physical media Cables, NICs, TCP/IP, socket API Client Client nevts = epoll_wait(fds) for (i =0; i < nevts; i++) { read(fds[i], buf); ... memcpy(nvmm, buf); ... write(fds[i], reply) }

slide-7
SLIDE 7

Innovations at both stacks

MegaPipe [OSDI’12] Seastar mTCP [NSDI’14] IX [OSDI’14] Stackmap [ATC’16] NVTree [FAST’15] NVWal [ASPLOS’16] NOVA [FAST’16] Decibel [NSDI’17] LSNVMM [ATC’17]

Network stack Storage stack

slide-8
SLIDE 8

Stacks are isolated

MegaPipe [OSDI’12] Seastar mTCP [NSDI’14] IX [OSDI’14] Stackmap [ATC’16] NVTree [FAST’15] NVWal [ASPLOS’16] NOVA [FAST’16] Decibel [NSDI’17] LSNVMM [ATC’17]

Network stack Storage stack Costs of moving data

slide-9
SLIDE 9

Bridging the gap

MegaPipe [OSDI’12] Seastar mTCP [NSDI’14] IX [OSDI’14] Stackmap [ATC’16] NVTree [FAST’15] NVWal [ASPLOS’16] NOVA [FAST’16] Decibel [NSDI’17] LSNVMM [ATC’17]

Network stack Storage stack PASTE

slide-10
SLIDE 10

PASTE Design Goals

  • Durable zero copy

○ DMA to NVMM

  • Selective persistence

○ Exploit modern NIC’s DMA to L3 cache

  • Persistent data structures

○ Indexed, named packet buffers backed fy a file

  • Generality and safety

○ TCP/IP in the kernel and netmap API

  • Best practices from modern network stacks

○ Run-to-completion, blocking, busy-polling, batching etc

slide-11
SLIDE 11

PASTE in Action

20 Pring [7] App thread slot [0] NIC TCP/IP File system /mnt/pm len

  • ff

pbuf Plog /mnt/pm/plog Zero copy

user kernel

cur Ppool (shared memory) /mnt/pm/pp 21 22 23 24 25 26 27 [0] [4] [8] Pbufs

slide-12
SLIDE 12

PASTE in Action

20 Pring [7] App thread slot [0] NIC TCP/IP

File system /mnt/pm

len

  • ff

pbuf Plog /mnt/pm/plog Zero copy

user kernel

cur Ppool (shared memory) /mnt/pm/pp 21 22 23 24 25 26 27 [0] [4] [8] Pbufs

slide-13
SLIDE 13

PASTE in Action

  • poll() system call

20 Pring [7] App thread slot [0]

  • 1. Run NIC I/O and TCP/IP

NIC TCP/IP File system /mnt/pm len

  • ff

pbuf Plog /mnt/pm/plog Zero copy

user kernel

cur Ppool (shared memory) /mnt/pm/pp 21 22 23 24 25 26 27 [0] [4] [8] Pbufs

slide-14
SLIDE 14

PASTE in Action

  • poll() system call

○ Got 6 in-order TCP segments

20 Pring [7] App thread slot [0]

  • 1. Run NIC I/O and TCP/IP

NIC TCP/IP File system /mnt/pm len

  • ff

pbuf Plog /mnt/pm/plog Zero copy

user kernel

cur Ppool (shared memory) /mnt/pm/pp 21 22 23 24 25 26 27 [0] [4] [8] Pbufs

slide-15
SLIDE 15

PASTE in Action

  • poll() system call

They are set to Pring slots

Pring [7] App thread slot [0]

  • 1. Run NIC I/O and TCP/IP

NIC TCP/IP File system /mnt/pm len

  • ff

pbuf Plog /mnt/pm/plog Zero copy

user kernel

cur Ppool (shared memory) /mnt/pm/pp 1 2 3 4 5 6 27 [0] [4] [8] tail Pbufs

slide-16
SLIDE 16

PASTE in Action

  • Return from poll()

Pring [7] App thread slot [0]

  • 1. Run NIC I/O and TCP/IP

NIC TCP/IP File system /mnt/pm len

  • ff

pbuf Plog /mnt/pm/plog Zero copy

user kernel

cur Ppool (shared memory) /mnt/pm/pp 1 2 3 4 5 6 27 [0] [4] [8] tail Pbufs

slide-17
SLIDE 17

PASTE in Action

Pring [7] App thread slot [0]

  • 1. Run NIC I/O and TCP/IP
  • 2. Read data on Pring

NIC TCP/IP File system /mnt/pm len

  • ff

pbuf Plog /mnt/pm/plog

user kernel

cur Ppool (shared memory) /mnt/pm/pp 1 2 3 4 5 6 27 [0] [4] [8] tail Zero copy Pbufs

slide-18
SLIDE 18

PASTE in Action

  • flush Pbuf data from

CPU cache to DIMM

○ clflush(opt) instruction

Pring [7] App thread slot [0]

  • 1. Run NIC I/O and TCP/IP
  • 2. Read data on Pring
  • 3. Flush Pbuf(s)

NIC TCP/IP File system /mnt/pm len

  • ff

pbuf Plog /mnt/pm/plog

user kernel

cur Ppool (shared memory) /mnt/pm/pp 1 2 3 4 5 6 27 [0] [4] [8] tail Zero copy Pbufs

slide-19
SLIDE 19

PASTE in Action

  • Pbuf is persistent

data representation

○ Base address is static i.e., file (/mnt/pm/pp) ○ Buffers can be recovered after reboot

Pring [7] App thread slot [0]

  • 1. Run NIC I/O and TCP/IP
  • 2. Read data on Pring
  • 3. Flush Pbuf(s)
  • 4. Flush Plog entry(ies)

NIC TCP/IP File system /mnt/pm len

  • ff

pbuf Plog /mnt/pm/plog

user kernel

cur Ppool (shared memory) /mnt/pm/pp 1 2 3 4 5 6 27 [0] [4] [8] tail Zero copy 1 120 96 Pbufs

slide-20
SLIDE 20

PASTE in Action

  • Prevent the kernel

from recycling the buffer

Pring [7] App thread slot [0]

  • 1. Run NIC I/O and TCP/IP
  • 2. Read data on Pring
  • 3. Flush Pbuf(s)
  • 4. Flush Plog entry(ies)
  • 5. Swap out Pbuf(s)

NIC TCP/IP File system /mnt/pm len

  • ff

pbuf Plog /mnt/pm/plog

user kernel

cur Ppool (shared memory) /mnt/pm/pp 8 2 3 4 5 6 27 [0] [4] [8] tail Zero copy 1 120 96 Pbufs

slide-21
SLIDE 21

PASTE in Action

  • Same for Pbuf 2 and 6

Pring [7] App thread slot [0]

  • 1. Run NIC I/O and TCP/IP
  • 2. Read data on Pring
  • 3. Flush Pbuf(s)
  • 4. Flush Plog entry(ies)
  • 5. Swap out Pbuf(s)

NIC TCP/IP File system /mnt/pm len

  • ff

pbuf Plog /mnt/pm/plog

user kernel

cur Ppool (shared memory) /mnt/pm/pp 8 9 3 4 5 10 27 [0] [4] [8] tail Zero copy 1 120 96 2 6 768 987 96 96 Pbufs

slide-22
SLIDE 22

PASTE in Action

  • Advance cur

Return buffers in slot 0-6 to the kernel at next poll()

App thread

  • 1. Run NIC I/O and TCP/IP
  • 2. Read data on Pring
  • 3. Flush Pbuf(s)
  • 4. Flush Plog entry(ies)
  • 5. Swap out Pbuf(s)
  • 6. Update Pring

NIC TCP/IP File system /mnt/pm len

  • ff

pbuf Plog /mnt/pm/plog

user kernel

Ppool (shared memory) /mnt/pm/pp [0] [4] [8] 1 120 96 Zero copy 2 6 768 987 96 96 Pring [7] slot [0] 8 9 3 4 5 10 27 tail cur Pbufs

slide-23
SLIDE 23

PASTE in Action

App thread

  • 1. Run NIC I/O and TCP/IP
  • 2. Read data on Pring
  • 3. Flush Pbuf(s)
  • 4. Flush Plog entry(ies)
  • 5. Swap out Pbuf(s)
  • 6. Update Pring

NIC TCP/IP File system /mnt/pm len

  • ff

pbuf Plog /mnt/pm/plog

user kernel

Ppool (shared memory) /mnt/pm/pp [0] [4] [8] 1 120 96 Zero copy 2 6 768 987 96 96 Pring [7] slot [0] 8 9 3 4 5 10 27 tail cur Pbufs

Write-Ahead Logs

slide-24
SLIDE 24

PASTE in Action

  • We can organize various

data structures in Plog

App thread

  • 1. Run NIC I/O and TCP/IP
  • 2. Read data on Pring
  • 3. Flush Pbuf(s)
  • 4. Flush Plog entry(ies)
  • 5. Swap out Pbuf(s)
  • 6. Update Pring

NIC TCP/IP File system /mnt/pm Ppool (shared memory) /mnt/pm/pp [0] [4] [8] Zero copy Pring [7] slot [0] 8 9 3 4 5 10 27 tail cur Pbufs

5 3 5 7 (1, 96, 120) (2, 96, 987) (6, 96, 512)

Plog /mnt/pm/plog

user kernel

B+tree

slide-25
SLIDE 25

Evaluation

  • 1. How does PASTE outperform existing systems?
  • 2. Is PASTE applicable to existing applications?
  • 3. Is PASTE useful for systems other than file/DB storage?
slide-26
SLIDE 26

How does PASTE outperform existing systems?

WAL B+tree (all writes) 64B 1280B

What if we use more complex data structures?

slide-27
SLIDE 27

How does PASTE outperform existing systems?

WAL B+tree (all writes) 64B 1280B

slide-28
SLIDE 28

Is PASTE applicable to existing applications?

  • Redis

YCSB (read mostly) YCSB (update heavy)

slide-29
SLIDE 29

Is PASTE useful for systems other than DB/file storage?

  • Packet logging prior to forwarding

○ Fault-tolerant middlebox [Sigcomm’15] ○ Traffic recording

  • Extend mSwitch [SOSR’15]

○ Scalable NFV backend switch

slide-30
SLIDE 30

Conclusion

  • PASTE is a network programming interface that:

○ Enables durable zero copy to NVMM ○ Helps apps organize persistent data structures on NVMM ○ Lets apps use TCP/IP and be protected

Offers high-performance network stack even w/o NVMM

https://github.com/luigirizzo/netmap/tree/paste micchie@sfc.wide.ad.jp or @michioh

slide-31
SLIDE 31

Multicore Scalability

  • WAL throughput
slide-32
SLIDE 32

Further Opportunity with Co-designed Stacks

  • What if we use higher access latency NVMM?

○ e.g., 3D-Xpoint

  • Overlap flushes and processing with clflushopt and

mfence before system call (triggers packet I/O)

○ See the paper for results

Systemcall time clflushopt mfence Systemcall Receive new requests Send responses Wait for flushes done Examine request clflushopt Examine request

slide-33
SLIDE 33

Experiment Setup

  • Intel Xeon E5-2640v4 (2.4 Ghz)
  • HPE 8GB NVDIMM (NVDIMM-N)
  • Intel X540 10 GbE NIC
  • Comparison

○ Linux and Stackmap [ATC’15] (current state-of-the art) ○ Fair to use the same kernel TCP/IP implementation