Fault tolerance in multiprocessor systems using N-modular redundancy on demand /
Idle processors naturally offer spare capacity in a multiprocessor under normal loading conditions. Many attempts have been made to utilize this redundancy to provide fault tolerant computation. A popular approach has been duplexing user tasks. In this dissertation we study the dependability of dupl...
| Main Author: | |
|---|---|
| Format: | Thesis Book |
| Language: | English |
| Published: |
[Place of publication not identified] :
[publisher not identified] ;
2000.
|
| Subjects: | |
| Online Access: | http://proxy.library.tamu.edu/login?url=http://proquest.umi.com/pqdweb?did=731989941&sid=1&Fmt=2&clientId=2945&RQT=309&VName=PQD |
| Summary: | Idle processors naturally offer spare capacity in a multiprocessor under normal loading conditions. Many attempts have been made to utilize this redundancy to provide fault tolerant computation. A popular approach has been duplexing user tasks. In this dissertation we study the dependability of duplexing using a typical policy we call NMR on demand (NMROD). NMROD relies on a fault model favoring response correctness instead of actual fault status and integrates dynamic repair to provide non-stop operation over an extended period. It works by scheduling two identical copies of a task on different processors. Upon completion, outcomes are compared. If they match, the task is released and the processors are made available. Otherwise, a third copy is executed on a different processor. If the three outcomes do not produce a majority, we run yet another copy and so on until a pair reaches agreement, at which point the disagreeing processors are removed for repair. We first present stochastic models to describe the dynamics of processor loss and gain, and to determine the performance penalty resulting from additional loading due to task replication. Next, we identify and address two weaknesses which plague NMROD-like techniques: deadlock and agreement under faulty outcomes (malicious agreement). The deadlock problem is formalized and deadlock conditions are presented together with avoidance strategy. Then, a stochastic model to dynamically assess dependability in the presence of malicious agreement is developed. The model provides a tool for design and analysis of duplex systems with well known and controllable malicious agreement risks. Finally, an implementation on the nCUBE 2 multiprocessor demonstrates the low cost of adding NMROD-like fault tolerance to existing multiprocessors. It also gauges performance in a real world setting where various architecture implementation overheads are incorporated. |
|---|---|
| Item Description: | Vita. "Major Subject: Computer Science". |
| Physical Description: | x, 126 leaves : illustrations ; 28 cm. Issued also on microfiche from University Microfilm Inc. |
| Bibliography: | Includes bibliographical references (leaves 82-86). |