Using Application-Driven Checkpointing for Hot Spare High Availability Antti Kantee Cubical Solutions Ltd .
The Target: 2n hotspare Antti Kantee <pooka@cubical.fi>, 2004 . imagine some mission critical service fact: all hardware will break some day for each server, install a spare server if something bad happens to the server, the spare will take over machine crash service crash service state will be migrated not migrating service state: Cold spare easy ... adding support should not cripple service
The Presentation Antti Kantee <pooka@cubical.fi>, 2004 . problem solution boosting performance implementation adaption conclusions standing ovation (or the more likely rotten tomato scene)
The Problem Antti Kantee <pooka@cubical.fi>, 2004 . how to preserve state? classic approach: checkpoint usually below process-level ==> transparent to process problem in classic approach implementations: apply rarely to networked services checkpointing will take very long checkpoint will be huge feature-support limited no external communication allowed (fd’s ...) thread-support usually non-existant
The Solution Antti Kantee <pooka@cubical.fi>, 2004 . Application-Driven Checkpointing instead of checkpointing being transparent to the process, do the opposite: leave checkpointing entirely up to application good: checkpoint exactly the right data checkpoint at exactly the right time possible to get extended feature support bad: need to modify each application separately
What is process state? Antti Kantee <pooka@cubical.fi>, 2004 . in other words: what do we want to capture memory: for a C program, this is pretty much WYGIWYG loads and stores are directly mapped to memory might be more difficult for actual programming languages "other stuff": file descriptors / sockets threads you name it ...
Application-Checkpointing: Naive Approach Antti Kantee <pooka@cubical.fi>, 2004 . simply write out pieces in previous two sets the application decides what gets stored instead of application deciding what does not get stored need to figure out some serialization form for information for memory this is pretty easy: (addr, len, content) for "other stuff" equally easy, just more laborious we could just write out everything in the process context when checkpoint() is called but that doesn’t perform especially well
Boosting Performance Antti Kantee <pooka@cubical.fi>, 2004 . two common & cheap solutions asynchonous do not checkpoint in process context while actual cost is still there, the application does not hopefully take such a heavy penalty incremental write out deltas only the more you checkpoint, the more you save hmm, where have I heard that before?
Asynchronous checkpointing Antti Kantee <pooka@cubical.fi>, 2004 . many employ fork() get new execution context memory "protected" by copy-on-write ok, that was easy
Incremental checkpointing Antti Kantee <pooka@cubical.fi>, 2004 . many employ mprotect() and signal handlers userspace solution MMU already tracks modification information used by pagedaemon wire pages, and pagedaemon no longer needs that info asking MMU perhaps not the best option, but it was easy to implement ;-) some archs have soft "dirty" bit, not in MMU
Pulling memory checkpointing together Antti Kantee <pooka@cubical.fi>, 2004 . two new syscalls: cptctl() and cptfork() cptctl: add/remove checkpoint areas monitored for deltas query changes cptfork: mostly same as fork() check for modified pages
Additional State Antti Kantee <pooka@cubical.fi>, 2004 . we cannot take file descriptors, sockets, signals, threads etc. from a memory dump kernel state, including lots of structure linkage, so transfer as opaque data not possible use a syscall augmentation-style approach: for most entities, it is possible to query the current state from the kernel when restoring, use normal syscalls to "trick" kernel so basically handle this entirely in userspace unfortunately TCP is not supported :(
Dealing with Multithreading Antti Kantee <pooka@cubical.fi>, 2004 . do not record program counter, register values, etc. treat a thread as like any other "additional state" record "worker function" address and argument only for each registered thread, at restore a thread is created and the worker function is called problem: locking
Additional Support Antti Kantee <pooka@cubical.fi>, 2004 . define spare machine(s) move snapshots of runtime state to spare machines TCP/IP, IP/carrier pigeon, whatever suits you detect failures leave that up to the application to define ;-) provide a simple "ping"-approach in the framework direct network traffic to "spare" after master has crashed and process has been rebuilt
Application Interface to Framework Antti Kantee <pooka@cubical.fi>, 2004 . Philosophy: everything that can be supported application-transparently should be, but it should not prevent any tricks the application might want to pull generally what needs to be done: reserve checkpoint memory with hsmalloc() group essential memory into e.g. structs register some additional info: hsfdreg(), hsthreadreg() sprinkle checkpoints into appropriate places: hscpt() restore handled in framework also
Adapting Antti Kantee <pooka@cubical.fi>, 2004 . kernel portion should be in theory adaptable to other systems Linux & FreeBSD & Chorus investigated userspace library should be portable code as-is adapting application is an interesting question most UNIX programs are stateless state tied to TCP persistence state dealt with by application-specific methods tetris was easy to adapt sqlite almost equally easy
Conclusions Antti Kantee <pooka@cubical.fi>, 2004 . transparent checkpointing has problems application-driven checkpointing ties application semantics to the task of checkpointing knowledge can be used in optimizing checkpoint time & place kernel support provides additional boost state annoyingly tied to TCP but at least but application-driven checkpointing we have a chance to deal with it adaption effort depends greatly on application
Recommend
More recommend