Dynamic cluster-based recovery : pessimistic and optimistic schemes (preliminary version) /
Abstract: "This report describes a novel checkpointing and recovery mechanism being incorporated in a software layer under development at the Texas A&M University. Primary goal of this software layer is to provide reliability to distributed applications running on a network of workstatio...
| Main Author: | |
|---|---|
| Format: | Book |
| Language: | English |
| Published: |
College Station, Tex. :
Texas A & M University, Computer Science Dept.,
[1993]
|
| Series: | Technical report (Texas A & M University. Computer Science Department) ;
93-027. |
| Subjects: |
| Summary: | Abstract: "This report describes a novel checkpointing and recovery mechanism being incorporated in a software layer under development at the Texas A&M University. Primary goal of this software layer is to provide reliability to distributed applications running on a network of workstations. The distributed applications of interest here consist of multiple processes that communicate with each other via message passing. Recovery in a system of communicating processes requires that a consistent state of the system be recoverable after failure occurs. Two basic approaches which achieve this goal have been extensively studied in the literature: i) coordinated checkpointing, and (ii) independent checkpointing and message logging. For some applications, it may be possible to choose one approach that will achieve desired performance, however, for many aplications [sic], we believe that a dynamic fault tolerance scheme, that modifies itself as the demands of the application change over time, is more suitable. This report presents a dynamic fault tolerance scheme in which processes are partitioned into clusters. Each cluster contains at least one process, also, in the extreme, all the processes may belong to just one cluster. Clusters are dynamic entities; membership of the processes to the clusters can change over time. A process can leave or join a cluster at run-time based on the dynamic communication patterns. The report proposes heuristics to dynamically assign processes to clusters by performing cluster-join and cluster-fork operations. In the proposed scheme, (i) the processes within each cluster coordinate their checkpoints to establish a consistent recovery line, (ii) the inter-cluster messages are logged for the purpose of recovery, and (iii) only the order information for intra-cluster messages is logged. Processes in different clusters do not coordinate their checkpoints. The motivation behind the proposed approach is to eliminate the disadvantages and retain the advantages of the two basic approaches. Message logging schemes may perform poorly if the number of messages to be logged is large. Coordinated checkpointing can result in large delays before an output may be committed. The proposed approach permits dynamic allocation of processes to clusters such that the number of inter-cluster messages is relatively small, while the number of intra-cluster messages may be large. As only inter-cluster messages result in message logging, such an allocation will result in fewer logged messages. Also, when the number of processes per cluster is kept small (compared to the total number of processes), the delay in committing an output is expected to be smaller. It is expected that, based on the dynamic communication patterns and input/output requirements, the clusters can be dynamically re-arranged to achieve the desired performance." |
|---|---|
| Item Description: | "May 5, 1993." |
| Physical Description: | 26 leaves ; 28 cm. |
| Bibliography: | Includes bibliographical references. |