Each process in the system periodically takes a checkpoint to reduce the amount
of recomputation in case of a system failure.
A checkpoint includes the intermediate state of the process, the current vector
clock, the current content of the diff structure and other information
required for memory consistency.
Checkpointing activities among the related processes need not be performed
in a coordinated way, however, if checkpointing is incorporated into the barrier
operation or garbage collection, the overhead of checkpointing can be reduced.
For a process pi to be recovered from a failure, a recovery process, say
pi', is first created and pi' broadcasts the recovery message to
all the other processes in the system.
On the receipt of the recovery message, each process pj first determines
whether it is a dependent of pi or not.
If so, it replies with its write notice structure, which includes pi's
intervals and their preceding intervals.
When pi' collects the reply message from every process, it eliminates
the duplicates of interval records and reconstructs its own write notice
structure.
The recovery process pi' then restores the latest checkpoint of pi and
from the reconstructed state, pi begins the recomputation as follows:
At the lock acquire time: pi calculates the vector clock from the
information saved in the write notices and sets its current vector
clock as the calculated one.
pi also searches the write notice structure with the new vector clock,
to select the write notices with which the new invalidation has to be performed.
At the data page access miss: pi searches the write notice structure
and sends the diff requests to the selected writer of the pages as before.
Each writer process then replies with the diffs of the data page
it has produced before the given vector clock.
pi arranges the received diffs in the timing order and applies them
on the data page to create an up-to-date version, as performed during
the normal execution.
If the operation performed is the write, pi creates a diff with its
current vector clock.
At the lock release time or at the barrier: If pi has performed
any new write operation in this interval, it increments the i-th
entry of its vector clock by one.
Since for each synchronization operation, the process can retrieve the vector
clock same as the one created before the failure, it can retrieve the same
diffs from other processes and creates the same data page for its read
and write operations.
Theorem 2: The rollback-recovery under the proposed protocol is
consistent.
Proof Sketch: If for every read/write event ea, an event eb dependent on ea exists, Log(ea) can be retrieved after a failure (by Theorem 1). As a result, the rollback-recovery of a process must be consistent.