| Summary: | Abstract: "A new approach for triplex-based checkpointing that achieves fault-tolerance in the presence of independent and correlated multiple faults is presented. Each task is executed on three processing modules, their checkpoints compared at the end of every checkpoint interval. Single faults occurring in one checkpoint interval can be corrected by majority voting, without any roll-back. Single faults occurring over consecutive checkpoint intervals can also be handled without any roll-back. In the presence of multiple faults in one interval, none of the checkpoints will match. Various recovery actions are possible depending on the choice of the Concurrent Depth. The proposed scheme is capable of handling multiple faults occurring in one checkpoint interval as well as in consecutive checkpoint intervals without losing any computation interval, in many cases. This capability to handle both correlated as well as independent faults without any loss of checkpoint intervals in many cases, makes it suitable for hard real-time applications as well as applications requiring long execution times. Existing roll- forward strategies do not provide such fault-tolerance capabilities. Unlike the existing roll-forward strategies, the proposed strategy does not require any task to be migrated to a spare processor in the event of a fault. Network congestion is therefore eliminated as is the delay to the task that experienced the fault. Also, the proposed scheme can be implemented in a stand-alone system with three processing modules, unlike other strategies that require access to a spare. The proposed scheme also offers the flexibility of dynamically choosing one of the various alternatives for roll-forward recovery depending on the amount of execution left, the amount of slack available and the fault patterns so as to optimize the performance parameters(s) of interest. The performance of the proposed strategy is compared against an existing strategy to demonstrate its effectiveness. Results from our simulations indicate that the proposed scheme outperforms the existing one in terms of parameters such as probability in meeting a deadline, average delay in completing a task and variance in the delays encountered by the task."
|