instant os updates via userspace checkpoint and restart
play

Instant OS Updates via Userspace Checkpoint-and-Restart Sanidhya - PowerPoint PPT Presentation

Instant OS Updates via Userspace Checkpoint-and-Restart Sanidhya Kashyap , Changwoo Min, Byoungyoung Lee, Taesoo Kim, Pavel Emelyanov OS updates are prevalent And OS updates are unavoidable Prevent known, state-of-the-art attacks


  1. Instant OS Updates via Userspace Checkpoint-and-Restart Sanidhya Kashyap , Changwoo Min, Byoungyoung Lee, Taesoo Kim, Pavel Emelyanov

  2. OS updates are prevalent

  3. And OS updates are unavoidable ● Prevent known, state-of-the-art attacks – Security patches ● Adopt new features – New I/O scheduler features ● Improve performance – Performance patches

  4. Unfortunately, system updates come at a cost ● Unavoidable downtime ● Potential risk of system failure

  5. Unfortunately, system updates come at a cost ● Unavoidable downtime ● Potential risk of system failure $109k per minute Hidden costs (losing customers)

  6. Example: memcached ● Facebook's memcached servers incur a downtime of 2-3 hours per machine – Warming cache (e.g., 120 GB) over the network

  7. Example: memcached ● Facebook's memcached servers incur a downtime of 2-3 hours per machine – Warming cache (e.g., 120 GB) over the network Our approach updates OS in 3 secs for 32GB of data from v3.18 to v3.19 for Ubuntu / Fedora releases

  8. Existing practices for OS updates ● Dynamic Kernel Patching (e.g., kpatch, ksplice) – Problem: only support minor patches ● Rolling Update (e.g., Google, Facebook, etc) – Problem: inevitable downtime and requires careful planning

  9. Existing practices for OS updates ● Dynamic Kernel Patching (e.g., kpatch, ksplice) Losing application state is inevitable – Problem: only support minor patches → Restoring memcached takes 2-3 hours ● Rolling Update (e.g., Google, Facebook, etc) – Problem: inevitable downtime and requires careful planning

  10. Existing practices for OS updates ● Dynamic Kernel Patching (e.g., kpatch, ksplice) Losing application state is inevitable – Problem: only support minor patches → Restoring memcached takes 2-3 hours ● Rolling Update (e.g., Google, Facebook, etc) Goals of this work: – Problem: inevitable downtime and requires ● Support all types of patches careful planning ● Least downtime to update new OS ● No kernel source modifjcation

  11. Problems of typical OS update Memcached OS OS OS OS Stop service

  12. Problems of typical OS update Memcached OS OS OS OS Stop service Soft reboot New OS

  13. Problems of typical OS update Memcached OS OS OS OS Stop service Soft reboot Start service Memcached New OS New OS

  14. Problems of typical OS update 2-3 hours of downtime Memcached OS OS OS OS Stop service Soft reboot Start service Memcached New OS New OS

  15. Problems of typical OS update 2-3 hours of downtime Memcached OS OS OS OS Stop service 2-10 minutes of downtime Soft reboot Start service Memcached New OS New OS

  16. Problems of typical OS update 2-3 hours of downtime Memcached OS OS OS OS Stop service 2-10 minutes of downtime Soft reboot Start service Memcached Is it possible to keep the New OS New OS application state?

  17. KUP: Kernel update with application OS updates loose application states checkpoint-and-restore (C/R) Memcached OS OS OS OS Stop service Soft reboot Start service Memcached New OS New OS

  18. KUP: Kernel update with application OS updates loose application states checkpoint-and-restore (C/R) Memcached Memcached OS OS OS OS Stop service Checkpoint Soft reboot Start service Memcached New OS New OS

  19. KUP: Kernel update with application OS updates loose application states checkpoint-and-restore (C/R) Memcached Memcached OS OS OS OS Stop service Checkpoint In-kernel Soft reboot switch Start service Memcached Memcahed New OS New OS

  20. KUP: Kernel update with application OS updates loose application states checkpoint-and-restore (C/R) Memcached Memcached OS OS OS OS Stop service Checkpoint In-kernel Soft reboot switch Start service Restore Memcached Memcahed New OS New OS

  21. KUP: Kernel update with application OS updates loose application states checkpoint-and-restore (C/R) KUP's life cycle Stop service Checkpoint In-kernel switch Start service Restore

  22. KUP: Kernel update with application OS updates loose application states checkpoint-and-restore (C/R) KUP's life cycle Stop service Checkpoint In-kernel 1-10 minutes of downtime switch Start service Restore

  23. KUP: Kernel update with application OS updates loose application states checkpoint-and-restore (C/R) KUP's life cycle Stop service Checkpoint In-kernel 1-10 minutes of downtime switch Start service Restore Challenge: how to further decrease New OS New OS the potential downtime?

  24. Techniques to decrease the downtime 1) Incremental checkpoint Checkpoint In-kernel switch Restore

  25. Techniques to decrease the downtime 1) Incremental checkpoint Checkpoint In-kernel switch Restore 2) On-demand restore

  26. Techniques to decrease the downtime 1) Incremental checkpoint Checkpoint 3) FOAM: a snapshot abstraction In-kernel switch Restore 2) On-demand restore

  27. Techniques to decrease the downtime 1) Incremental checkpoint Checkpoint 3) FOAM: a 4) PPP: reuse memory without an explicit dump snapshot abstraction In-kernel switch Restore 2) On-demand restore

  28. Techniques to decrease the downtime 1) Incremental checkpoint Checkpoint 3) FOAM: a 4) PPP: reuse memory without an explicit dump snapshot abstraction In-kernel switch Restore 2) On-demand restore

  29. Incremental checkpoint ● Reduces downtime (up to 83.5%) ● Problem : Multiple snapshots increase the restore time Naive S 1 S i Snapshot instance → checkpoint downtime Timeline

  30. Incremental checkpoint ● Reduces downtime (up to 83.5%) ● Problem : Multiple snapshots increase the restore time Naive S 1 S i Snapshot instance → checkpoint downtime Timeline Incremental S 1 checkpoint

  31. Incremental checkpoint ● Reduces downtime (up to 83.5%) ● Problem : Multiple snapshots increase the restore time Naive S 1 S i Snapshot instance → checkpoint downtime Timeline Incremental S 2 S 1 checkpoint

  32. Incremental checkpoint ● Reduces downtime (up to 83.5%) ● Problem : Multiple snapshots increase the restore time Naive S 1 S i Snapshot instance → checkpoint downtime Timeline Incremental S 2 S 3 S 1 checkpoint

  33. Incremental checkpoint ● Reduces downtime (up to 83.5%) ● Problem : Multiple snapshots increase the restore time Naive S 1 S i Snapshot instance → checkpoint downtime Timeline Incremental S 4 S 2 S 3 S 1 checkpoint downtime

  34. On-demand restore ● Rebind the memory once the application accesses it – Only map the memory region with snapshot and restart the application ● Decreases the downtime (up to 99.6%) ● Problem : Incompatible with incremental checkpoint

  35. Problem : both techniques together result in ineffjcient application C/R ● During restore, need to map each pages individually – Individual lookups to fjnd the relevant pages – Individual page mapping to enable on-demand restore An application has 4 pages as ● S 1 S 1 its working set size 1 2 3 4 Incremental checkpoint has 2 ● iterations – 1 st iteration all 4 pages (1, 2, 3, 4) are dumped → – 2 nd iteration 2 pages (2, 4) are dirtied → ● Increases the restoration downtime (42.5%)

  36. Problem : both techniques together result in ineffjcient application C/R ● During restore, need to map each pages individually – Individual lookups to fjnd the relevant pages – Individual page mapping to enable on-demand restore An application has 4 pages as ● S 2 S 1 S 1 its working set size 2 4 1 3 Incremental checkpoint has 2 ● iterations – 1 st iteration all 4 pages (1, 2, 3, 4) are dumped → – 2 nd iteration 2 pages (2, 4) are dirtied → ● Increases the restoration downtime (42.5%)

  37. New abstraction : fjle-ofgset based address mapping (FOAM) ● Flat address space representation for the snapshot – One-to-one mapping between the address space and the snapshot – No explicit lookups for the pages across the snapshots – A few map operations to map the entire snapshot with address space ● Use sparse fjle representation – Rely on the concept of holes supported by modern fjle systems ● Simplifjes incremental checkpoint and on-demand restore

  38. Techniques to decrease the downtime 1) Incremental checkpoint Checkpoint 3) FOAM: a 4) PPP: reuse memory without an explicit dump snapshot abstraction In-kernel switch Restore 2) On-demand restore

  39. Redundant data copy ● Application C/R copies data back and forth ● Not a good fjt for applications with huge memory Memcached OS RAM 1 2 3 4 In-kernel Running Running Checkpoint Restore Running switch

  40. Redundant data copy ● Application C/R copies data back and forth ● Not a good fjt for applications with huge memory Memcached OS RAM S 1 Snapshot 1 2 3 4 In-kernel Running Checkpoint Checkpoint Restore Running switch

  41. Redundant data copy ● Application C/R copies data back and forth ● Not a good fjt for applications with huge memory Memcached Memcached New OS OS RAM S 1 Snapshot 1 2 3 4 In-kernel In-kernel Running Checkpoint Restore Running switch switch

  42. Redundant data copy ● Application C/R copies data back and forth ● Not a good fjt for applications with huge memory Memcached Memcached New OS OS RAM 1 2 3 4 S 1 Snapshot 1 2 3 4 In-kernel Running Checkpoint Restore Restore Running switch

Recommend


More recommend