Distributed shared memory(DSM) systems  provide a simple mean of programming for the networks of workstations, which are gaining popularity due to their cost-effective high computing power. However, as the number of workstations participating in a DSM system increases, the probability of failure also increases, which could render the system useless for long-running applications. For DSM systems to be of any practical use, it is important for the system to be recoverable so that the processes do not have to restart from the beginning when a failure occurs. One solution to provide fault-tolerance for DSM systems is to use message logging in addition to independent checkpointing , and many message logging schemes have been suggested.
In the reader-based logging schemes [8,11], each process logs the data items it has accessed into the stable storage in the access order, so that the process can retrieve the logged messages at the same computational points after the failure-recovery. However, these schemes could incur high logging overheads during the failure-free operations. The writer-based logging schemes [6,7] utilize the volatile log of the writer process for each data item, and the corresponding readers only log the access information into the stable storage. As a result, the logging overhead can be reduced, however, the overhead of stable logging is still non-negligible. One solution suggested in  uses only the volatile log of the reader to store the access information. This scheme removes the overhead of stable logging, however, concurrent failures of multiple processes cannot be tolerated.
Causal logging is one logging approach which is gaining a lot of attention for the message-passing based distributed computing systems . In the causal logging technique, the writer-based logging of data items is performed and the access information is logged at the volatile storage of the dependent processes. Since this scheme completely eliminates the needs for stable logging, logging overhead can significantly be reduced. Also, since the storage of the dependent processes are utilized, concurrent and multiple failures can be handled. However, in this scheme, the log of the access information has to be causally spread over the dependent processes, which may cause the non-negligible message overhead.
A causal logging scheme for the DSM system based on lazy release consistent(LRC) memory model has been suggested in . In this scheme, to reduce the message overhead, the data structures and operations supported by the LRC model, such as diff, write notices, and vector clocks, are utilized. Moreover, logging is performed only at the synchronization points instead of every message passing point. However, since this scheme utilizes the vector clock to trace the access information, a vector clock for each synchronization interval must be logged. As a result, the message overhead turns out to be non-negligible for the system with a large number of processes. Especially, for the barrier operations, the logging overhead can be significant, since each process has to perform the logging for all the other processes.
In this paper, we propose an efficient implementation technique for the causal logging. In the proposed scheme, instead of logging the vector clock for each synchronization operation, the sufficient and necessary information to recreate the corresponding vector clock is inserted into the existing write notice structures. As a result, the additional information carried by each message becomes less than two integers for each synchronization interval, and hence, fault-tolerance can be achieved with a very little overhead. Moreover, the size of the additional information is independent of the number of processes in the system, and hence, the new scheme can be very efficient for the large size systems.
The rest of this paper is organized as follows: In Section 2, the system model and the definition of consistent logging for the correct recovery are presented and in Section 3, the protocols for causal logging and recovery are presented with their correctness. The performance of the proposed protocol is discussed with the experimental results in Section 4 and Section 5 concludes the paper.