Outline Introduction Fine-grained isolation Checkpoint-based recovery Evaluation and Conclusions 11 11
Unit of fault tolerance: Driver entry point probe network xmit driver config network card 12 12
Unit of fault tolerance: Driver entry point whole driver isolation probe network xmit driver config network card 12 12
Unit of fault tolerance: Driver entry point probe network xmit driver config network card 12 12
Unit of fault tolerance: Driver entry point FGFT isolation probe network xmit driver config network card 12 12
Unit of fault tolerance: Driver entry point FGFT isolation probe network xmit driver config network card ★ Provide fault tolerance to specific driver entry points 12 12
Unit of fault tolerance: Driver entry point FGFT isolation probe network xmit driver config network card ★ Provide fault tolerance to specific driver entry points ★ Can be applied to untested code or code marked suspicious by static or runtime tools 12 12
Transactional support through code generation netdev netdev get ringparam network driver 13 13
Transactional support through code generation netdev netdev s s get ringparam SFI t network t u u network driver b b driver s s 13 13
Transactional support through code generation netdev netdev s s get ringparam SFI t network t u u network driver b b driver s s 13 13
Transactional support through code generation Range Table netdev netdev Address Access rights s 0xffffa000 s Read get ringparam SFI t network t 0xffffa008 Write 0xffffa00a Read u u network driver b b driver s s 13 13
Transactional support through code generation Range Table netdev netdev Address Access rights s 0xffffa000 s Read get ringparam SFI t network t 0xffffa008 Write 0xffffa00a Read u u network driver b b driver s s ★ Detects and recovers from: ★ Memory errors like invalid pointer accesses ★ Structural errors like malformed structures ★ Processor exceptions like divide by zero, stack corruption 13 13
Transactional support through code generation Range Table netdev netdev netdev Address Access rights s 0xffffa000 s Read get ringparam SFI t network t 0xffffa008 Write 0xffffa00a Read u u network driver b b driver s s result ★ Detects and recovers from: ★ Memory errors like invalid pointer accesses ★ Structural errors like malformed structures ★ Processor exceptions like divide by zero, stack corruption 13 13
Outline Introduction Fine-grained isolation Checkpoint-based recovery Conclusion 14 14
Checkpointing drivers is hard ★ Easy to capture memory state network driver network card 15 15
Checkpointing drivers is hard ★ Easy to capture memory state checkpoint network driver network card 15 15
Checkpointing drivers is hard ★ Easy to capture memory state checkpoint network driver network card ★ Device state is not captured ★ Device configuration space 15 15
Checkpointing drivers is hard ★ Easy to capture memory state checkpoint network driver network card ★ Device state is not captured ★ Device configuration space ★ Internal device registers and counters 15 15
Checkpointing drivers is hard ★ Easy to capture memory state checkpoint network driver network card ★ Device state is not captured ★ Device configuration space ★ Internal device registers and counters ★ Memory buffer addresses used for DMA 15 15
Checkpointing drivers is hard ★ Easy to capture memory state checkpoint network driver network card ★ Device state is not captured ★ Device configuration space ★ Internal device registers and counters ★ Memory buffer addresses used for DMA ★ Unique for every device 15 15
Checkpointing drivers is hard ★ Easy to capture memory state checkpoint network Intuition: Operating systems already capture driver device state during power management network card ★ Device state is not captured ★ Device configuration space ★ Internal device registers and counters ★ Memory buffer addresses used for DMA ★ Unique for every device 15 15
Intuition with power management ★ Refactor power management code for device checkpoints ★ Correct: Developer captures unique device semantics ★ Fast: Avoids probe and latency critical for applications ★ Ask developers to export checkpoint/restore in their drivers 16 16
Device checkpoint/restore from PM code Suspend Resume Save config state Restore config state Save register state Restore register state Restore or reset Disable device DMA state Re-attach/Enable Save DMA state device Suspend device Device Ready 17 17
Device checkpoint/restore from PM code Suspend Resume Save config state Restore config state Save register state Restore register state Restore or reset DMA state Re-attach/Enable Save DMA state device Suspend device Device Ready 17 17
Device checkpoint/restore from PM code Suspend Resume Save config state Restore config state Save register state Restore register state Restore or reset DMA state Re-attach/Enable Save DMA state device Device Ready 17 17
Device checkpoint/restore from PM code Suspend Resume Save config state Restore config state Save register state Restore register state Restore or reset Save DMA state DMA state Re-attach/Enable device Device Ready 17 17
Device checkpoint/restore from PM code Checkpoint Resume Save config state Restore config state Save register state Restore register state Restore or reset Save DMA state DMA state Re-attach/Enable device Device Ready 17 17
Device checkpoint/restore from PM code Checkpoint Resume Save config state Restore config state Save register state Restore register state Restore or reset Save DMA state DMA state Re-attach/Enable device 17 17
Device checkpoint/restore from PM code Checkpoint Resume Save config state Restore config state Save register state Restore register state Restore or reset Save DMA state DMA state 17 17
Device checkpoint/restore from PM code Checkpoint Restore Save config state Restore config state Save register state Restore register state Restore or reset Save DMA state DMA state 17 17
Device checkpoint/restore from PM code Checkpoint Restore Save config state Restore config state Save register state Restore register state Restore or reset Save DMA state DMA state Suspend/resume code provides device checkpoint functionality 17 17
Synergy of isolation and fast checkpoints netdev netdev network driver 18 18
Synergy of isolation and fast checkpoints netdev netdev network driver xmit 18 18
Synergy of isolation and fast checkpoints netdev netdev get ringparam network driver 18 18
Synergy of isolation and fast checkpoints C netdev netdev get ringparam network driver 18 18
Synergy of isolation and fast checkpoints C netdev netdev s s get ringparam SFI t t network u u network driver b b driver s s 18 18
Synergy of isolation and fast checkpoints C netdev netdev s s get ringparam SFI t t network u u network driver b b driver s s 18 18
Synergy of isolation and fast checkpoints C netdev netdev netdev s s get ringparam SFI t t network u u network driver b b driver s s 18 18
Synergy of isolation and fast checkpoints C netdev netdev netdev s s get ringparam SFI t t network u u network driver b b driver s s 18 18
Synergy of isolation and fast checkpoints C Range Table netdev netdev netdev Address Access rights s 0xffffa000 s Read get ringparam SFI t t 0xffffa008 Write network 0xffffa00a Read u u network driver b b driver s s 18 18
Synergy of isolation and fast checkpoints C Range Table netdev netdev netdev Address Access rights s 0xffffa000 s Read get ringparam SFI t t 0xffffa008 Write network 0xffffa00a Read u u network driver b b driver s s 18 18
Synergy of isolation and fast checkpoints C Range Table netdev netdev netdev Address Access rights s 0xffffa000 s Read get ringparam SFI t t 0xffffa008 Write network 0xffffa00a Read u u network driver b b driver s s 18 18
Synergy of isolation and fast checkpoints C Range Table netdev netdev netdev Address Access rights s 0xffffa000 s Read get ringparam SFI t t 0xffffa008 Write network 0xffffa00a Read u u network driver b b driver s s R err 18 18
Synergy of isolation and fast checkpoints C Range Table netdev netdev netdev Address Access rights s 0xffffa000 s Read get ringparam SFI t t 0xffffa008 Write network 0xffffa00a Read u u network driver b b driver s s R err FGFT provides transactional execution of driver entry points 18 18
How does this give us transactional execution? 19 19
How does this give us transactional execution? ★ Atomicity: All or nothing execution ★ Driver state: Run code in SFI module ★ Device state: Explicitly checkpoint/restore state 19 19
How does this give us transactional execution? ★ Atomicity: All or nothing execution ★ Driver state: Run code in SFI module ★ Device state: Explicitly checkpoint/restore state ★ Isolation: Serialization to hide incomplete transactions ★ Re-use existing device locks to lock driver ★ Two phase locking 19 19
How does this give us transactional execution? ★ Atomicity: All or nothing execution ★ Driver state: Run code in SFI module ★ Device state: Explicitly checkpoint/restore state ★ Isolation: Serialization to hide incomplete transactions ★ Re-use existing device locks to lock driver ★ Two phase locking ★ Consistency: Only valid (kernel, driver and device) states ★ Higher level mechanisms to rollback external actions ★ At most once device action guarantee to applications 19 19
Outline Introduction Fine-grained isolation Checkpoint-based recovery Evaluation & Conclusions 20 20
Evaluation platform ★ Criterion : ★ Latency of recovery: How fast is it? ★ Correctness of recovery: How well does it work? ★ Incremental effort: How much work is it? ★ Performance: How much does it cost? 21 21
Evaluation platform ★ Criterion : ★ Latency of recovery: How fast is it? ★ Correctness of recovery: How well does it work? ★ Incremental effort: How much work is it? ★ Performance: How much does it cost? Driver Class Bus 8139too net PCI ★ Platform : e1000 net PCI ★ Implemented in Linux 2.6.29 r8169 net PCI ★ 2.5 GHz Intel Core 2 Quad pegasus net USB core w/ 4 GB DDR2 DRAM ★ Six drivers across three classes psmouse sound PCI ens1371 input serio 21 21
Recovery speedup Recovery times 2,000ms Restart recovery FGFT recovery 1,500ms 1,000ms 500ms 0ms 8139too e1000 pegasus r8169 ens1371 psmouse 22 22
Recovery speedup Recovery times 2,000ms Restart recovery 1800.00 FGFT recovery 1,500ms 1030.00 1,000ms 680.00 500ms 310.00 150.00 120.00 0ms 8139too e1000 pegasus r8169 ens1371 psmouse 22 22
Recovery speedup Recovery times 2,000ms Restart recovery 1800.00 FGFT recovery 1,500ms 1030.00 1,000ms 680.00 500ms 410.00 310.00 295.00 150.00 120.00 115.00 5.00 0.07 0.04 0ms 8139too e1000 pegasus r8169 ens1371 psmouse 22 22
Recovery speedup Recovery times 2,000ms Restart recovery 1800.00 FGFT recovery 1,500ms 1030.00 1,000ms 680.00 500ms 410.00 310.00 295.00 150.00 120.00 115.00 5.00 0.07 0.04 0ms 8139too e1000 pegasus r8169 ens1371 psmouse FGFT provides significant speedup in driver recovery and improves system availability 22 22
Static and dynamic fault injection Driver Injected Native Faults Crashes 8139too 43 43 e1000 47 47 r8169 36 36 pegasus 34 33 ens1371 22 21 psmouse 46 46 TOTAL 258 256 23 23
Static and dynamic fault injection Driver Injected Native FGFT Faults Crashes Crashes 8139too 43 43 NONE e1000 47 47 NONE r8169 36 36 NONE pegasus 34 33 NONE ens1371 22 21 NONE psmouse 46 46 NONE TOTAL 258 256 NONE 23 23
Static and dynamic fault injection Driver Injected Native FGFT Faults Crashes Crashes 8139too 43 43 NONE e1000 47 47 NONE r8169 36 36 NONE pegasus 34 33 NONE ens1371 22 21 NONE psmouse 46 46 NONE TOTAL 258 256 NONE FGFT recovers from multiple failures : 1) restores non-class state and 2) does not affect other threads 23 23
Programming e ff ort Driver LOC Isolation ann annotations Recovery ad y additions Driver Kernel LOC Moved LOC annotations annotations Added 8139too 1, 904 15 20 26 4 e1000 13, 973 32 32 10 r8169 2, 993 10 17 5 pegasus 1, 541 26 12 22 5 ens1371 2, 110 23 66 16 6 psmouse 2, 448 11 19 19 6 24 24
Programming e ff ort Driver LOC Isolation ann annotations Recovery ad y additions Driver Kernel LOC Moved LOC annotations annotations Added 8139too 1, 904 15 20 26 4 e1000 13, 973 32 32 10 r8169 2, 993 10 17 5 pegasus 1, 541 26 12 22 5 ens1371 2, 110 23 66 16 6 psmouse 2, 448 11 19 19 6 FGFT requires a loadable kernel module (1200 LOC) and 38 lines of kernel changes to trap processor exceptions 24 24
Throughput with isolation and recovery Native FGFT-‑I/O-‑all FGFT-‑off-‑I/O FGFT-‑I/O-‑1/2 netperf on Intel quad-core machines 25 25
Throughput with isolation and recovery Throughput %age (Baseline 844 Mbps) 100 75 Native 50 FGFT-‑I/O-‑all FGFT-‑off-‑I/O 25 FGFT-‑I/O-‑1/2 0 e1000 Network Card netperf on Intel quad-core machines 25 25
Throughput with isolation and recovery CPU : 2.4% Throughput %age (Baseline 844 Mbps) 100 100 75 Native 50 FGFT-‑I/O-‑all FGFT-‑off-‑I/O 25 FGFT-‑I/O-‑1/2 0 e1000 Network Card netperf on Intel quad-core machines 25 25
Throughput with isolation and recovery CPU : 2.4% 2.4% Throughput %age (Baseline 844 Mbps) 100 100 93 75 Native 50 FGFT-‑I/O-‑all FGFT-‑off-‑I/O 25 FGFT-‑I/O-‑1/2 0 e1000 Network Card netperf on Intel quad-core machines 25 25
Throughput with isolation and recovery CPU : 2.4% 2.4% 3.4% Throughput %age (Baseline 844 Mbps) 100 100 100 93 75 Native 50 FGFT-‑I/O-‑all FGFT-‑off-‑I/O 25 FGFT-‑I/O-‑1/2 0 e1000 Network Card netperf on Intel quad-core machines 25 25
Throughput with isolation and recovery CPU : 2.4% 2.4% 3.4% 2.9% Throughput %age (Baseline 844 Mbps) 100 100 100 96 93 75 Native 50 FGFT-‑I/O-‑all FGFT-‑off-‑I/O 25 FGFT-‑I/O-‑1/2 0 e1000 Network Card netperf on Intel quad-core machines 25 25
Throughput with isolation and recovery CPU : 2.4% 2.4% 3.4% 2.9% Throughput %age (Baseline 844 Mbps) 100 100 100 96 93 75 Native 50 FGFT-‑I/O-‑all FGFT-‑off-‑I/O 25 FGFT-‑I/O-‑1/2 0 FGFT can isolate and recover high bandwidth devices at low overhead without adding kernel subsystems e1000 Network Card netperf on Intel quad-core machines 25 25
Summary 26 26
Summary ★ FGFT runs driver code as transactions ★ Provides fault tolerance at incremental performance and programmer efforts ★ Introduced device checkpoints ★ Provides fast and complete recovery semantics ★ Fast device checkpoints should be explored in other domains like fast reboot, upgrade etc. 26 26
Questions Asim Kadav ★ http://cs.wisc.edu/~kadav ★ kadav@cs.wisc.edu ★ Graduating in spring! 27
Extra slides ★ Unlike suspend, devices continue to be accessed after a checkpoint ★ Rely on drivers following ACPI specifications for correctness 28
Latency for device checkpoint/restore Driver Class Bus Checkpoint Restore Times Times 8139too net PCI 33 μ s 62 μ s e1000 net PCI 280ms 32 μ s r8169 net PCI 26 μ s 30 μ s pegasus net USB 0 μ s 4ms ens1371 sound PCI 111ms 33 μ s psmouse input serio 0 μ s 390ms Fast checkpoint/restore using suspend/resume 29 29
Transforming drivers to run as FGFT If ¡(c==0) ¡{ . print ¡(“Driver ¡ init”); } . . Driver with annotations Static modifications 30 30
Recommend
More recommend