how to keep critical applications up and running 24x7
play

How to Keep Critical Applications up and running 24x7 Linda Wang - PowerPoint PPT Presentation

How to Keep Critical Applications up and running 24x7 Linda Wang Red Hat, Inc. October 6, 2016 1 LinuxConf Europe 2016 - How to keep application up 24x7 Background Computer industry has been evolving Decades of improvement


  1. “How to Keep Critical Applications up and running 24x7” Linda Wang Red Hat, Inc. October 6, 2016 1 LinuxConf Europe 2016 - How to keep application up 24x7

  2. Background ● Computer industry has been evolving ● Decades of improvement ● Various OS's claimed to be able to achieve Zero down time for their users, through various of individual mechanisms.. System monitoring ● Predictive Self Healing ● ● Without indepth analysis the fundamental causes of down time, do these features really help? 2 LinuxConf Europe 2016 - How to keep application up 24x7

  3. Today ● Open Source community ● Ease of access to source ● Linux - lot of research and development in research institutes ● Opens doors and paths to different approaches and allows experimentation ● Advanced Kernel development 3 LinuxConf Europe 2016 - How to keep application up 24x7

  4. How to Achieve 24x7 Uptime ● Analysis the reasons behind down time ● Planned vs Unplanned ● With unplanned, we want to proactively avoid it ● Predictable vs Unpredictable 4 LinuxConf Europe 2016 - How to keep application up 24x7

  5. How to achieve 24x7 Uptime ● Reasons behind Down Times ● Two types of Down-Time: unplanned vs. planned ● Unplanned: predictable, unpredictable Unpredictable/ Predictable/ Proactive Unplanned Planned Planning Application Crash Operating System Panic Hardware Failure 5 LinuxConf Europe 2016 - How to keep application up 24x7

  6. 24x7 Uptime ● Reasons behind Down Times ● Two types of Down-Time: unplanned vs. planned ● Unplanned: predictable, unpredictable; Unpredictable/ Predictable/ Proactive Unplanned Planned Planning Application Crash * Diag. - (gdb) * Auto restart - (systemd ufile) Operating System Panic * Diagnostic tool (kdump/crash) * Auto restart (NMI timeout) Hardware Failure * Error detection (HERM) 6 LinuxConf Europe 2016 - How to keep application up 24x7

  7. 24x7 Uptime ● Reasons behind Down Times ● Two types of Down-Time: unplanned vs. planned ● Unplanned: predictable, unpredictable; Unpredictable/ Predictable/ Proactive Unplanned Planned Planning Application Crash * Security * Diag. - (gdb) updates * Auto restart - (systemd ufile) Operating System Panic * Diagnostic tool (kdump/crash) * Auto restart (NMI timeout) Hardware Failure * Error detection (HERM) 7 LinuxConf Europe 2016 - How to keep application up 24x7

  8. 24x7 Uptime ● Reasons behind Down Times ● Two types of Down Time: unplanned vs. planned ● Unplanned: predictable, unpredictable; Unpredictable/ Predictable/ Proactive Unplanned Planned Planning Application Crash * Security * Diag. - (gdb) updates * Auto restart - (systemd ufile) Operating System Panic * Kernel * Diagnostic tool security, bugfix (kdump/crash) updates * Auto restart (NMI timeout) Hardware Failure * Error detection (HERM) 8 LinuxConf Europe 2016 - How to keep application up 24x7

  9. 24x7 Uptime ● Reasons behind Down Times ● Two types of Down Time: unplanned vs. planned ● Unplanned: predictable, unpredictable; Unpredictable/ Predictable/ Proactive Unplanned Planned Planning Application Crash * Security * Diag. - (gdb) updates * Auto restart - (systemd ufile) Operating System Panic * Kernel * Diagnostic tool security, bugfix (kdump/crash) updates * Auto restart (NMI timeout) Hardware Failure * Hardware * Error detection replacement (HERM) 9 LinuxConf Europe 2016 - How to keep application up 24x7

  10. 24x7 Uptime ● Reasons behind Down Times ● Two types of Down Time: unplanned vs. planned ● Unplanned: predictable, unpredictable; Unpredictable/ Predictable/ Proactive Unplanned Planned Planning Application Crash * Security * Live patching * Diag. - (gdb) updates security fixes * Auto restart - (systemtap) (systemd ufile) Operating System Panic * Kernel * Diagnostic tool security, bugfix (kdump/crash) updates * Auto restart (NMI timeout) Hardware Failure * Hardware * Error detection replacement (HERM) 10 LinuxConf Europe 2016 - How to keep application up 24x7

  11. 24x7 Uptime ● Reasons behind Down Times ● Two types of Down Time: unplanned vs. planned ● Unplanned: predictable, unpredictable; Unpredictable/ Predictable/ Proactive Unplanned Planned Planning Application Crash * Security * Live patching * Diag. - (gdb) updates security fixes * Auto restart - (systemtap) (systemd ufile) Operating System Panic * Kernel * Live patching * Diagnostic tool security, bugfix known kernel (kdump/crash) updates issues (kpatch) * Auto restart (NMI timeout) Hardware Failure * Hardware * Error detection replacement (HERM) 11 LinuxConf Europe 2016 - How to keep application up 24x7

  12. 24x7 Uptime ● Reasons behind Down Times ● Two types of Down Time: unplanned vs. planned ● Unplanned: predictable, unpredictable Unplanned Planned Down Proactive Down Time Time Planning Application Crash * Security * Live patching * Diag. - (gdb) updates security fixes * Auto restart - (systemtap) (systemd ufile) Operating System Panic * Kernel * Live patching * Diagnostic tool security, bugfix known kernel (kdump/crash) updates issues (kpatch) * Auto restart (NMI timeout) Hardware Failure * Hardware *Checkpoint/R * Error detection replacement estore (criu) (HERM) 12 LinuxConf Europe 2016 - How to keep application up 24x7

  13. Prepare for DownTime Scenarios ● Preventive Measures ● For security fixes and known issues to avoid crashes ● Live Patches - for both kernel and userspace ● To avoid Down Times due to Hardware Failure or Regular Maintenance ● Containerize critical applications, and use Live Migration to move to alternative systems while original systems under-going maintenance to avoid down time 13 LinuxConf Europe 2016 - How to keep application up 24x7

  14. Kernel Live Patching Enhancements ● Demo 14 LinuxConf Europe 2016 - How to keep application up 24x7

  15. Use Space Live Patching ● Demo 15 LinuxConf Europe 2016 - How to keep application up 24x7

  16. Container Migration ● Demo 16 LinuxConf Europe 2016 - How to keep application up 24x7

  17. For more information... Kernel Live Patching: ■ http://rhelblog.redhat.com/?s=live+patching ■ questions: kpatch@redhat.com ● Checkpoint Restore/Live Migration: ■ http://rhelblog.redhat.com/?s=criu ■ questions: criu@redhat.com 17 LinuxConf Europe 2016 - How to keep application up 24x7

  18. Thank-you! 18 LinuxConf Europe 2016 - How to keep application up 24x7

Recommend


More recommend