Techniques for structuring forward and backward error recovery in single process systems are now quite well understood; unfortunately, this is not at all the case for systems made out of concurrent processes. This paper describes a general error recovery scheme for systems of concurrent processes. In particular, it proposes a general exception handling scheme for atomic actions involving concurrent interacting processes. According to this scheme, an exception raised by one of the processes of an atomic action implies a change from normal to abnormal activity for all processes of the action, so fault-tolerance measures involve all processes composing the action. If the abnormal activities succeed in handling the exception, then processes return to normal activity; otherwise, an atomic action failure is signaled.
Appropriate rules have been set up that allow us to solve ambiguities in the choice of abnormal activities to handle particular exceptions currently raised by components of an atomic action. Some implementation issues are also discussed.
This paper provides an in-depth study of error recovery in single- or multiple-process systems and should be of interest for distributed system designers. However, it is clear that more experience is needed in order to fully evaluate the applicability of the proposed techniques in the structuring of real distributed systems. Complementary studies aimed at precisely characterizing the semantics of the proposed structure and, possibly, at providing some methodological approach to the design of fault-tolerant software using the proposed scheme would be welcome. Of course, such research is out of the scope of this paper, but it is worth some further efforts.