Checkpointing and Rollback-Recovery

Next: Performance Study Up: Protocol Description Previous: Causal Logging

Checkpointing and Rollback-Recovery

Each process in the system periodically takes a checkpoint to reduce the amount of recomputation in case of a system failure. A checkpoint includes the intermediate state of the process, the current vector clock, the current content of the diff structure and other information required for memory consistency. Checkpointing activities among the related processes need not be performed in a coordinated way, however, if checkpointing is incorporated into the barrier operation or garbage collection, the overhead of checkpointing can be reduced.

For a process p_i to be recovered from a failure, a recovery process, say p_i', is first created and p_i' broadcasts the recovery message to all the other processes in the system. On the receipt of the recovery message, each process p_j first determines whether it is a dependent of p_i or not. If so, it replies with its write notice structure, which includes p_i's intervals and their preceding intervals. When p_i' collects the reply message from every process, it eliminates the duplicates of interval records and reconstructs its own write notice structure. The recovery process p_i' then restores the latest checkpoint of p_i and from the reconstructed state, p_i begins the recomputation as follows:

At the lock acquire time: p_i calculates the vector clock from the information saved in the write notices and sets its current vector clock as the calculated one. p_i also searches the write notice structure with the new vector clock, to select the write notices with which the new invalidation has to be performed.

At the data page access miss: p_i searches the write notice structure and sends the diff requests to the selected writer of the pages as before. Each writer process then replies with the diffs of the data page it has produced before the given vector clock. p_i arranges the received diffs in the timing order and applies them on the data page to create an up-to-date version, as performed during the normal execution. If the operation performed is the write, p_i creates a diff with its current vector clock.

At the lock release time or at the barrier: If p_i has performed any new write operation in this interval, it increments the i-th entry of its vector clock by one.

Since for each synchronization operation, the process can retrieve the vector clock same as the one created before the failure, it can retrieve the same diffs from other processes and creates the same data page for its read and write operations.

Theorem 2: The rollback-recovery under the proposed protocol is consistent.

Proof Sketch: If for every read/write event e_a, an event e_b dependent on e_a exists, Log(e_a) can be retrieved after a failure (by Theorem 1). As a result, the rollback-recovery of a process must be consistent.

Next: Performance Study Up: Protocol Description Previous: Causal Logging

Park TaeSoon
1999-11-30