Virtual Machine Fault-tolerance Cheng Wang, Xusheng Chen, Weiwei - - PowerPoint PPT Presentation

virtual machine fault tolerance
SMART_READER_LITE
LIVE PREVIEW

Virtual Machine Fault-tolerance Cheng Wang, Xusheng Chen, Weiwei - - PowerPoint PPT Presentation

PLOVER: Fast, Multi-core Scalable Virtual Machine Fault-tolerance Cheng Wang, Xusheng Chen, Weiwei Jia, Boxuan Li, Haoran Qiu, Shixiong Zhao, and Heming Cui The University of Hong Kong 1 Virtual machines are pervasive in datacenters Physical


slide-1
SLIDE 1

PLOVER: Fast, Multi-core Scalable Virtual Machine Fault-tolerance

Cheng Wang, Xusheng Chen, Weiwei Jia, Boxuan Li, Haoran Qiu, Shixiong Zhao, and Heming Cui The University of Hong Kong

1

slide-2
SLIDE 2

Virtual machines are pervasive in datacenters

Physical machine Guest VM Guest VM VMM

VM fault tolerance is crucial!

2

Physical machine Guest VM Guest VM VMM

Hardware Failure

slide-3
SLIDE 3

Classic VM replication - primary/backup approach

3

Remus [NSDI’08]

ACK

Primary Guest VM service VMM memory pages client

slide-4
SLIDE 4

Classic VM replication - primary/backup approach

3

Remus [NSDI’08]

ACK

Primary Guest VM service VMM memory pages backup Guest VM service VMM memory pages client

slide-5
SLIDE 5

Classic VM replication - primary/backup approach

3

Remus [NSDI’08]

ACK

Primary Guest VM service VMM memory pages backup Guest VM service VMM memory pages client

slide-6
SLIDE 6

Classic VM replication - primary/backup approach

3

Remus [NSDI’08]

ACK

Primary Guest VM service VMM memory pages backup Guest VM service VMM memory pages client

Synchronize primary/backup every 25ms

  • 1. Pause primary VM (every 25ms) and

transmit all changed state (e.g., memory pages) to backup.

slide-7
SLIDE 7

Classic VM replication - primary/backup approach

3

Remus [NSDI’08]

ACK

Primary Guest VM service VMM memory pages backup Guest VM service VMM memory pages client

Synchronize primary/backup every 25ms

  • 1. Pause primary VM (every 25ms) and

transmit all changed state (e.g., memory pages) to backup.

slide-8
SLIDE 8

Classic VM replication - primary/backup approach

3

Remus [NSDI’08]

ACK

Primary Guest VM service VMM memory pages backup Guest VM service VMM memory pages client

Synchronize primary/backup every 25ms

  • 1. Pause primary VM (every 25ms) and

transmit all changed state (e.g., memory pages) to backup.

slide-9
SLIDE 9

Classic VM replication - primary/backup approach

3

Remus [NSDI’08]

ACK

Primary Guest VM service VMM Output buffer memory pages backup Guest VM service VMM memory pages client

Synchronize primary/backup every 25ms

  • 1. Pause primary VM (every 25ms) and

transmit all changed state (e.g., memory pages) to backup.

slide-10
SLIDE 10

Classic VM replication - primary/backup approach

3

Remus [NSDI’08]

ACK

Primary Guest VM service VMM Output buffer memory pages backup Guest VM service VMM memory pages client

Synchronize primary/backup every 25ms

  • 1. Pause primary VM (every 25ms) and

transmit all changed state (e.g., memory pages) to backup.

slide-11
SLIDE 11

Classic VM replication - primary/backup approach

3

Remus [NSDI’08]

ACK

Primary Guest VM service VMM Output buffer memory pages backup Guest VM service VMM memory pages client

Synchronize primary/backup every 25ms

  • 1. Pause primary VM (every 25ms) and

transmit all changed state (e.g., memory pages) to backup.

slide-12
SLIDE 12

Classic VM replication - primary/backup approach

3

Remus [NSDI’08]

ACK

Primary Guest VM service VMM Output buffer memory pages backup Guest VM service VMM memory pages client

Synchronize primary/backup every 25ms

  • 1. Pause primary VM (every 25ms) and

transmit all changed state (e.g., memory pages) to backup.

  • 2. Backup acknowledges to the primary

when complete state has been received.

slide-13
SLIDE 13

Classic VM replication - primary/backup approach

3

Remus [NSDI’08]

ACK

Primary Guest VM service VMM Output buffer memory pages backup Guest VM service VMM memory pages client

ACK

Synchronize primary/backup every 25ms

  • 1. Pause primary VM (every 25ms) and

transmit all changed state (e.g., memory pages) to backup.

  • 2. Backup acknowledges to the primary

when complete state has been received.

slide-14
SLIDE 14

Classic VM replication - primary/backup approach

3

Remus [NSDI’08]

ACK

Primary Guest VM service VMM Output buffer memory pages backup Guest VM service VMM memory pages client

ACK

Synchronize primary/backup every 25ms

  • 1. Pause primary VM (every 25ms) and

transmit all changed state (e.g., memory pages) to backup.

  • 2. Backup acknowledges to the primary

when complete state has been received.

  • 3. Primary’s buffered network output is

released.

slide-15
SLIDE 15

Classic VM replication - primary/backup approach

3

Remus [NSDI’08]

ACK

Primary Guest VM service VMM Output buffer memory pages backup Guest VM service VMM memory pages client

ACK

Synchronize primary/backup every 25ms

  • 1. Pause primary VM (every 25ms) and

transmit all changed state (e.g., memory pages) to backup.

  • 2. Backup acknowledges to the primary

when complete state has been received.

  • 3. Primary’s buffered network output is

released.

slide-16
SLIDE 16

Two limitations of primary/backup approach (1)

  • Too many memory pages have to be copied and transferred, greatly

ballooned client-perceived latency

4

# of concurrent clients Page transfer size (MB) 16 20.9 48 68.4 80 110.5

100 200 300 400 500 600 16 48 80

Latency (us) Number of concurrent clients

Redis latency with varied # of clients (4 vCPUs per VM)

unreplicated Remus (25ms synchronization interval)

slide-17
SLIDE 17

Two limitations of primary/backup approach (2)

  • The split-brain problem

5

ACK

Primary Guest VM KVS VMM Output buffer page Backup Guest VM KVS VMM page client1 client2

slide-18
SLIDE 18

Two limitations of primary/backup approach (2)

  • The split-brain problem

5

ACK

Primary Guest VM KVS VMM Output buffer page Backup Guest VM KVS VMM page client1 client2

slide-19
SLIDE 19

Two limitations of primary/backup approach (2)

  • The split-brain problem

5

ACK

Primary Guest VM KVS VMM Output buffer page Backup Guest VM KVS VMM page client1 client2 Outdated primary New primary

slide-20
SLIDE 20

Two limitations of primary/backup approach (2)

  • The split-brain problem

5

ACK

Primary Guest VM KVS VMM Output buffer page Backup Guest VM KVS VMM page client1 client2

X=5

x=7 Outdated primary New primary

slide-21
SLIDE 21

Two limitations of primary/backup approach (2)

  • The split-brain problem

5

ACK

Primary Guest VM KVS VMM Output buffer page Backup Guest VM KVS VMM page client1 client2

X=5

x=7 Outdated primary New primary

slide-22
SLIDE 22

Two limitations of primary/backup approach (2)

  • The split-brain problem

5

ACK

Primary Guest VM KVS VMM Output buffer page Backup Guest VM KVS VMM page client1 client2

x =5 x =7

X=5

x=7 Outdated primary New primary

slide-23
SLIDE 23

State Machine Replication (SMR): Powerful

6

service backup client1 client2 service primary service backup consensus log consensus log consensus log

slide-24
SLIDE 24

State Machine Replication (SMR): Powerful

6

service backup client1 client2 service primary service backup consensus log consensus log consensus log

slide-25
SLIDE 25

State Machine Replication (SMR): Powerful

6

service backup client1 client2 service primary service backup consensus log consensus log consensus log

slide-26
SLIDE 26

State Machine Replication (SMR): Powerful

6

service backup client1 client2 service primary service backup consensus log consensus log consensus log

  • SMR systems: Chubby, Zookeeper, Raft [ATC’14], Consensus in a box [NSDI’15], NOPaxos[OSDI’16], APUS [SoCC’17]
  • Ensure same execution states
slide-27
SLIDE 27

State Machine Replication (SMR): Powerful

6

service backup client1 client2 service primary service backup consensus log consensus log consensus log

  • SMR systems: Chubby, Zookeeper, Raft [ATC’14], Consensus in a box [NSDI’15], NOPaxos[OSDI’16], APUS [SoCC’17]
  • Ensure same execution states
  • Strong fault tolerance guarantee without split-brain problem
slide-28
SLIDE 28

State Machine Replication (SMR): Powerful

6

service backup client1 client2 service primary service backup consensus log consensus log consensus log

  • SMR systems: Chubby, Zookeeper, Raft [ATC’14], Consensus in a box [NSDI’15], NOPaxos[OSDI’16], APUS [SoCC’17]
  • Ensure same execution states
  • Strong fault tolerance guarantee without split-brain problem
  • Need to handle non-determinism
  • Deterministic multithreading (e.g., CRANE [SOSP’15]) - slow
  • Manually annotate service code to capture non-determinism (e.g., Eve [OSDI’12]) - error prone
slide-29
SLIDE 29

Making a choice

7

Primary/backup approach Pros:

  • Automatically handle non-determinism

Cons:

  • Unsatisfactory performance due to transferring

large amount of state

  • Have the split-brain problem

State machine replication Pros:

  • Good performance by ensuring the same

execution states

  • Solve the split-brain problem

Cons:

  • Hard to handle non-determinism
slide-30
SLIDE 30

PLOVER: Combining SMR and primary/backup

  • Simple to achieve by carefully designing the consensus protocol
  • Step 1: Use Paxos to ensure the same total order of requests for replicas
  • Step 2: Invoke VM synchronization periodically and then release replies
  • Combines the benefits of SMR and primary/backup
  • Step 1 makes primary/backup have mostly the same memory (up to 97%), then

PLOVER need only copy and transfer a small portion of the memory

  • Step 2 automatically addresses non-determinism and ensures external consistency
  • Challenges:
  • How to achieve consensus and synchronize VM efficiently?
  • When to do the VM synchronization for primary/backup to maximize the same

memory pages?

8

slide-31
SLIDE 31

PLOVER architecture

9

Primary Backup Witness VM Sync VM

consensus

Output buffer VMM

service

Sync VM VM

service

Client

log log

page page

consensus Output buffer consensus

VMM

slide-32
SLIDE 32

PLOVER architecture

9

Primary Backup Witness VM Sync VM

consensus

Output buffer VMM

service

Sync VM VM

service

Client

log log

page page

consensus Output buffer consensus

VMM

slide-33
SLIDE 33

PLOVER architecture

9

Primary Backup Witness VM Sync VM

consensus

Output buffer VMM

service

Sync VM VM

service

Client

log log

page page

consensus Output buffer consensus RDMA-based input consensus:

  • Primary: propose request and execute
  • Backup: agree on request and execute
  • Witness: agree on request and ignore

VMM

slide-34
SLIDE 34

PLOVER architecture

9

Primary Backup Witness VM Sync VM

consensus

Output buffer VMM

service

Sync VM VM

service

Client

log log

page page

consensus Output buffer consensus

RDMA (<10us)

RDMA-based input consensus:

  • Primary: propose request and execute
  • Backup: agree on request and execute
  • Witness: agree on request and ignore

VMM

slide-35
SLIDE 35

PLOVER architecture

9

Primary Backup Witness VM Sync VM

consensus

Output buffer VMM

service

Sync VM VM

service

Client

log log

page page

consensus Output buffer consensus

RDMA (<10us)

RDMA-based input consensus:

  • Primary: propose request and execute
  • Backup: agree on request and execute
  • Witness: agree on request and ignore

VMM

slide-36
SLIDE 36

PLOVER architecture

9

Primary Backup Witness VM Sync VM

consensus

Output buffer VMM

service

Sync VM VM

service

Client

log log

page page

consensus Output buffer consensus

RDMA (<10us)

RDMA-based input consensus:

  • Primary: propose request and execute
  • Backup: agree on request and execute
  • Witness: agree on request and ignore

VMM

slide-37
SLIDE 37

PLOVER architecture

9

Primary Backup Witness VM Sync VM

consensus

Output buffer VMM

service

Sync VM VM

service

Client

log log

page page

RDMA-based VM synchronization: consensus Output buffer consensus

RDMA (<10us)

RDMA-based input consensus:

  • Primary: propose request and execute
  • Backup: agree on request and execute
  • Witness: agree on request and ignore

VMM

slide-38
SLIDE 38

PLOVER architecture

9

Primary Backup Witness VM Sync VM

consensus

Output buffer VMM

service

Sync VM VM

service

Client

log log

page page

RDMA-based VM synchronization:

  • 1. Exchange and union dirty page bitmap
  • 2. Compute hash of each dirty page
  • 3. Compare hashes
  • 4. Transfer divergent pages

consensus Output buffer consensus

RDMA (<10us) RDMA

RDMA-based input consensus:

  • Primary: propose request and execute
  • Backup: agree on request and execute
  • Witness: agree on request and ignore

VMM

slide-39
SLIDE 39

PLOVER architecture

9

Primary Backup Witness VM Sync VM

consensus

Output buffer VMM

service

Sync VM VM

service

Client

log log

page page

RDMA-based VM synchronization:

  • 1. Exchange and union dirty page bitmap
  • 2. Compute hash of each dirty page
  • 3. Compare hashes
  • 4. Transfer divergent pages

consensus Output buffer consensus

RDMA (<10us) RDMA

RDMA-based input consensus:

  • Primary: propose request and execute
  • Backup: agree on request and execute
  • Witness: agree on request and ignore

VMM

slide-40
SLIDE 40

PLOVER architecture

9

Primary Backup Witness VM Sync VM

consensus

Output buffer VMM

service

Sync VM VM

service

Client

log log

page page

RDMA-based VM synchronization:

  • 1. Exchange and union dirty page bitmap
  • 2. Compute hash of each dirty page
  • 3. Compare hashes
  • 4. Transfer divergent pages

consensus Output buffer consensus

RDMA (<10us) RDMA

RDMA-based input consensus:

  • Primary: propose request and execute
  • Backup: agree on request and execute
  • Witness: agree on request and ignore

VMM

slide-41
SLIDE 41

When to decide VM synchronization period?

10

Primary Backup VM Sync VM VMM

service

Sync VM VM

service

page page VMM

slide-42
SLIDE 42

When to decide VM synchronization period?

10

Primary Backup VM Sync VM VMM

service

Sync VM VM

service

page page VMM

slide-43
SLIDE 43

When to decide VM synchronization period?

10

Primary Backup VM Sync VM VMM

service

Sync VM VM

service

page page VMM

slide-44
SLIDE 44

When to decide VM synchronization period?

10

Primary Backup VM Sync VM VMM

service

Sync VM VM

service

page page VMM

Issue of not choosing synchronization timing carefully

  • Large amount of divergent state

Synchronize when processing is almost finished!

  • CPU and disk usage is almost zero when service finishes

processing

  • Non-intrusive scheme to monitor service state
  • Invoke synchronization when CPU and disk usage is

nearly zero

slide-45
SLIDE 45

PLOVER addressed other practical challenges

  • Concurrent hash computation of dirty pages
  • Fast consensus without interrupting the VMM’s I/O event loop
  • Collect service running state from VMM without hurting performance
  • Full integration with KVM-QEMU

11

slide-46
SLIDE 46

Evaluation setup

  • Three replica machines
  • Dell R430 server
  • Connected with 40Gbps network
  • Guest VM configured with 4 vCPU and 16GB memory
  • Metrics: measured both throughput and latency with 95% percentile
  • Compared with three state-of-the-art VM fault tolerance systems
  • Remus (NSDI’08): use its latest KVM-based implementation developed by KVM
  • STR (DSN’09) and COLO (SoCC’13): various optimizations of Remus. E.g., COLO skips

synchronization if network outputs from two VMs are the same

12

slide-47
SLIDE 47
  • Evaluated PLOVER on 12 programs, grouped into 8 services

13

service Program type Benchmark Workload Redis Key value store self 50% SET, 50% GET SSDB Key value store self 50% SET, 50% GET MediaTomb Multimedia storage server ApacheBench Transcoding videos pgSQL Database server pgbench TPC-B DjCMS (Nginx, Python, MySQL) Content management system ApacheBench Web requests on a dashboard page Tomcat HTTP web server ApacheBench Web requests on a shopping store page lighttpd HTTP web server ApacheBench Watermark image with PHP Node.js HTTP web server ApacheBench Web requests on a messenger bot

slide-48
SLIDE 48

Evaluation questions

  • How does PLOVER compare to unreplicated VM and state-of-the-art

VM fault tolerance systems?

  • How does PLOVER scale to multi-core?
  • What is PLOVER’s CPU footprint?
  • How robust is PLOVER to failures?
  • Handle network partition, leader failure, etc, efficiently
  • Comparison of PLOVER and other three systems on different

parameter settings?

  • PLOVER is still much faster than the three systems

14

slide-49
SLIDE 49

Throughput on four services

15

slide-50
SLIDE 50

Throughput on the other four services

16

slide-51
SLIDE 51

Lighttpd+PHP performance analysis

17

Interval Dirty Page Same Transfer 86ms 33.9K 97% 2.8ms Sync-interval Dirty Page Transfer 25ms (Remus-Xen default) 33.3K 53.5ms 100ms (Remus-KVM default) 33.9K 55.7ms PLOVER: Remus: Analysis: PLOVER needs to transfer only 33.9k * 3% = 1.0K pages, But Remus, STR, and COLO need to transfer all or most of the 33K dirty pages. E.g., since most network outputs from two VMs differ, COLO has to do synchronizations for almost every output packet. lighttpd + PHP

slide-52
SLIDE 52

pgSQL performance analysis

18

PLOVER is slower than COLO on pgSQL

  • COLO safely skips synchronization because

most network outputs from two VMs are the same

slide-53
SLIDE 53

Performance Summary (4 vCPU per VM)

  • PLOVER’s throughput is 21% lower than unreplicated, 0.9X higher than

Remus, 1.0X higher than COLO, 1.4X higher than STR

  • 72% ~ 97% dirty memory pages between PLOVER’s primary and backup are the same
  • PLOVER’s TCP implementation throughput is still 0.9X higher the three systems on

average

19

slide-54
SLIDE 54

Multi-core Scalability (4vCPU - 16vCPU per VM)

  • Redis, DjCMS, pgSQL, and Node.js

are not listed because they don’t need many vCPUs per VM to improve throughput

  • E.g., Redis is single-threaded

20

slide-55
SLIDE 55

CPU footprint

21

slide-56
SLIDE 56

Conclusion and Ongoing Work

  • PLOVER: efficiently replicate VM with strong fault tolerance
  • Low performance overhead, scalable to multi-core, robust to replica failures
  • Collaborating with Huawei for technology transfer
  • Funded by Huawei Innovation Research Program 2017
  • Submitted a patent (Patent Cooperation Treaty ID: 85714660PCT01)
  • https://github.com/hku-systems/plover

22