Fault-Tolerant Computing

HomeEssaysTechnologyFault-Tolerant Computing


Moore’s law asserts that the processor speed or the computational power of the processors or computers will double every eighteen months. The law was attained after a careful analysis of the transition in the computing field that was geared toward achieving high performance in computers. However, no matter the processing power of a single computer, there is always a reason why the use of multiple computational units is regarded in the industry (Touzene, 2002). This may be spread across different geographic regions so long as there is a link on how multiple computers interact. The need to share information and improve the performance threshold of computer systems, and enable computers to tackle multiple tasks simultaneously brings the need to adopt multiple computational units. The two reasons can be depicted as the founding stones for the formulation of the distributed and parallel computing architectures (Um, 2010).

Get a price quote

A number of problems accrue from the use of both distributed and parallel computer approaches. Some of the notable problems linked to the two techniques include concurrency control effects, the efficiency of algorithms implemented in the systems, and fault tolerance complications. However, fault tolerance is the most common problem that is brought to the fore by multiple researchers and developers in the industry. This is because it is the major aspect that guarantees reliability, availability, and efficiency in computer systems and applications. The initiation of fault tolerance techniques is majorly put into place to predict the likelihood of failures occurring in the systems and offer immediate solutions to curb the impact of the errors. This is because there is a need for both developers and users to anticipate the failures and accord them the required attention within the shortest time possible (Choi, Chung and Yu, 2013). This paper discusses some of the fault-tolerant techniques that have been adopted in the distributed and parallel system architectures in a bid to foster performance and avoid errors. The experimental results in the paper illustrate that both distributed and parallel system architectures are able to deal with various system errors and complications that may arise.


There are a number of faults that occur in parallel and distributed computing. Based on various fault-tolerance policies, different types of fault tolerance techniques can be used on either the workflow or the task level. Being fault-tolerant is related to system dependability. This is in line with the system giving what it was designed for to the users. If a system does not cover the scope, both functional and non-functional, then it is defined as non-fault tolerant (Hao et al., 2014). Dependability is a term that encompasses a number of other useful requirements that cover both the distributed and parallel computing architectures. Some of the useful terms include:

· Availability

· Reliability

· Maintainability

· Safety

Availability can be described as a property of a system that illustrates that the system is ready to be used for the purposes that it was designed to accomplish. Essentially, the term covers the probability that the system operates correctly at any specific time period. It also covers the precept that the system is available to serve the users. This means that a highly available system is the one that is readily accessible and can work whenever needed by the users. The characteristic that describes a system which is able to run continuously without failing is termed as reliability. Instead of being defined at a certain instance, reliability is defined at a given interval, which makes it different from availability. A system that is able to work continuously without any fault is described to be highly reliable. A system is said to have an availability of 99.999% if it tends to go down for only one millisecond; however, the same system may be said to be completely unreliable. If, on the other hand, a system shuts down after every two weeks in a given month, then the system may be said to be highly reliable but unavailable (Hao et al., 2014).

It is always hard to prevent system crashes; however, for a system to become safe, it must be able to thwart catastrophic losses. In the case of nuclear power control systems, such systems should encompass high levels of safety because their failure may be catastrophic. Nuclear power plant systems should be able to contain catastrophic damages in case they fail even for a millisecond. Finally, maintainability connotes how easily a system can be repaired or upgraded when the need arises. A highly maintainable system should be able to allow ease during repairs (Hwang and Kesselman, 2003). Traditionally, fault tolerance was associated with three major metrics: mean time to failure, repair, and failures. Therefore, the formula is depicted below:


Where MTTF – Mean time to failure

MMTR – Mean time to repair

MTBF – Mean time between failures

One important note that should be made when using the above formula is that there should be an accurate illustration of an actual failure. This is because the identification of accurate failures in systems is not always accurate. A system is said to have failed if it cannot serve the purpose that it was assigned to accomplish. A system that is not able to provide the required services to the users is said to have failed in its mandate that it was established; however, this can be noticed early through system errors that may be registered, thus thwarting the impact. For example, the major reason for a failure in a network when data loss occurs during the transfer process is the existence of errors in the network systems. However, it is important to note that fault-tolerant network systems have to cover the loss and ensure that users access the required information (Hwang and Kesselman, 2003). On the other hand, if this does not happen, then the system is said to have failed in serving the users.

Research Results: Fault-Tolerant Computing

Fault-tolerant computing is described as the art and science of coming forth with computing systems that will be able to continue operating properly despite the existence of faults. A fault-tolerant system will be able to operate in the existence of one or more faults (Keidar, 2011). Some of these faults in the distributed and parallel computing architectures include:

· Network faults. They are termed as the faults that occur in specific networks due to partitioning problems, linkage failures, packet losses, or any other related errors.

· Media faults. Media head crashes may lead to the occurrence of these faults.

· Service expiry faults. Sometimes the service time of the resources may expire while in use by different applications. Such occurrences lead to faults in the operation of the systems.

· Processor faults. Software bugs and lack of processor resources are some of the aspects that lead to processor faults.

Other types of faults may be categorized as:

· Permanent faults: these are those failures that occur due to power breakdowns or wire cuts.

· Intermittent faults occur occasionally and are on most occasions ignored during the system testing process.

· Transient faults are caused by inherent faults in the system; the best way to correct or hinder these faults is by using system rollbacks.

A number of techniques have been devised to restrain the impact of these faults. Below is the analysis of some of the possible approaches developed to curb hardware and software-related fault-tolerant computer systems. However, designers on most occasions put more emphasis on hardware-related faults, generally overlooking software-related errors (Piuri, 2001).

Free Extras
  • Free formatting
  • Free email delivery
  • Free outline (on request)
  • Free plagiarism report (on request)
  • Free revision (within 2 days)
  • Free title page
  • Free bibliography
We Guarantee
  • 24/7/365 Customer Support
  • Quality research and writing
  • BA, MA, and PhD degree writers
  • 100% confidentiality
  • No hidden charges
  • Works are never resold
  • No plagiarism
Paper Format
  • 12pt. Times New Roman
  • Double-spaced/Single-spaced
  • Up-to-date sources
  • Fully referenced papers
  • 1 inch margins
  • Any citation style

Hardware fault tolerance

The modern society has experienced tremendous improvements in hardware development. More emphasis has been placed on the process of developing fault-tolerant hardware components as opposed to the software components. However, this does not mean that there are no efforts directed toward the development of fault-tolerant software components. The major approaches instituted toward the development of hardware components include the partitioning of computer systems (Saha, 2004). These segments are meant to act as fault-tolerant regions. Redundancy is a key aspect in the modern-day systems; this means that if a single module fails, it is immediately replaced by another module that assumes the responsibilities of the failed system. Strategies such as fault-masking and dynamic recovery have also been implemented (Saha, 2006).

Fault masking

It is described as a structural redundancy technique that completely masks faults within a given set of modules in a computer system. In this case, a number of modules will be able to perform the same task, and the results will be voted in a bid to get rid of any errors that might be attained by a faulty module (Touzene, 2002). Triple Modular Redundancy (TMR) is the common term that is used, in this case meaning that the circuitry is triplicated and the results are voted. However, a failure may also arise in this situation which is the case if two modules in a redundant triplet come forth with errors, making the voting processes meaningless. Hybrid redundancy that galvanizes triplicated backup modules was developed to curb such problems. The objective behind the development was to ensure that if one module fails, then the next module will be in place to assume the designated tasks. This means that even if one module fails in the system, interruptions will not occur. Therefore, long reliability, efficiency, and availability are attained (Saha, 2004).

Dynamic Recovery

This technique is required when only a single copy of computation is running at any specific time. Dynamic recovery requires a system to be partitioned into different modules which are backed up by spares to increase the tolerance margin. The dynamic recovery option, on the other hand, calls for the utilization of multiple activities. Faulty modules are switched with the available spares, and the recovery procedures are initiated by the operating systems, thus providing the continuation of the computation process (Um, 2010). Comparing the dynamic recovery technique and the voted systems approach, the former can be depicted to be more efficient when it comes to the use of the hardware. The efficiency of the dynamic recovery approach makes it a better choice in instances of resource-constrained systems. The approach fails because it does not take into consideration the speed of the faults recovery. Multiple delays accrue from the use of the approach in the recovery process. Additionally, the technique calls for the use of specialized operating systems which, however, tend to cover a lower scope of faults (Choi, Chung and Yu, 2013).

Software Fault Tolerance

Static and dynamic approaches are used within the software tolerance sphere. One specific approach commonly used is the N-version programming which takes into consideration static redundancy. It is written in the independent programs which tend to perform the same functions, while their outputs are voted to develop an error-free output. The dynamic approach used in the case of software components is based on the concept of recovery blocks. In this case, the programs are partitioned into blocks, and tests that are acceptable are performed on each block. If the test fails, then a redundant code block is executed. This means that in the long run, the users will be able to enjoy the services provided by the systems without failure (Choi, Chung and Yu, 2013). The main objective of devising such tolerance approaches is to ensure that critical operations in systems are not halted or interfered with.

Another approach is the design diversity technique; this approach combines the hardware and software fault tolerance. This is possible by the implementation of a fault-tolerant computer system that incorporates different hardware and software components skewed in redundancy. Each of the channels ingrained in the system is designed in such a way that it provides the same function, and a well-crafted method is used to depict whether any of the channels deviate from the acceptable limits. This technique is used in critical aircraft control applications, owing to its cost implications (Hao et al., 2014). The goal of the technique is to eliminate tolerant faults that emanate from both the hardware and software components in any system.


Numerous applications in the current society require computers to function properly, as in the case of spacecraft, nuclear plants, and ships. It is almost impossible to run these applications without computers, owing to numerous activities and processes that are handled on the go. Additionally, these applications require to be operated for a long period of time without constant repairs, which would otherwise add up to their cost implications. Typically, it is depicted that computers will operate within a period of 5-10 years at a probability of 95%. This means that efficiency is key in ensuring the proper functioning of the applications. Such applications are also constrained to low power, weight, and volume. It is on this premise that fault tolerance is a fundamental approach to enhancing efficiency in such applications. This also explains why such bodies as NASA were early sponsors of fault tolerance computing (Hwang and Kesselman, 2003).

Ultra-dependable computers are those computers in which a small error occurrence or delay can prove disastrous. Mostly, these kinds of computers are designed for applications used in critical sectors that include nuclear plants. Other applications require high availability but can be able to tolerate occasional errors or short delays, while the process of error recovery is taking place at the same time. Systems designed to integrate this technique are less expensive as opposed to the ultra-dependable computers. High availability computers use duplex designs. Some of the examples in this category include transaction processing and telephone switching computer systems. There lies a major difference between the two systems when it comes to the error margin (Choi, Chung and Yu, 2013). While the ultra-dependable computers deny any error margins, high availability systems tend to allow a small error margin.

One of the major difficult tasks in the design of fault-tolerant computing systems is the verification process which is put in place to determine if the system meets its reliability requirements. This is because the process requires the creation of different models. The first crucial model required in this case is the error/fault environment which is expected in the system (Piuri, 2001). Other models that will be integrated into the system will assess the structure and behavior of the design adopted. It is always important to determine the fault tolerance mechanisms’ workability by using fault simulations and analytic studies. The results that may be subjected in the form of latencies, error, coverage, and fault rates will then be used in the prediction of the reliability of the models. Computer-aided tools have been integrated with a number of models developed, using the Markov and semi-Markov processes that help predict the reliability of the fault-tolerant computers. In a bid to improve fault tolerance assessment, researchers have focused on experimental testing, using the fault insertion in an attempt to aid in assessing reliability (Choi, Chung and Yu, 2013).

Chat label

Struggling with your essay?

Chat label

Ask professionals to help you!

Chat label

Start Chat


Fault-tolerance is attained in systems by applying a cumulative set of analysis and techniques to improve system dependability. The main essence of using systems in the current world is the need to foster efficiency and effectiveness in the undertaken processes and activities. However, systems sometimes fail to meet the specifications desired by the users due to the existence of errors. With the emergence of new technologies and systems in the current society, there is a need to improve system dependability by encouraging the development of fault-tolerant systems; this is the case in both the distributed and parallel computing architectures. Traditionally, it was much easier to develop fault-tolerant computing systems from scratch as opposed to the present age, coupled with new chips that contain complex and highly integrated functions. Additionally, there is a need to devise hardware and software components that meet the standards set to be economically viable. Currently, a variety of researches revolve around the creation of fault-tolerant commercial of the shelf technology (Touzene, 2002).

Some of the recent developments in technology include the adaptation of existing fault-tolerance techniques to RAID disks that agglomerate the information stripping approaches. In this case, the information is stripped across the existing several disks in a bid to improve the bandwidth. A redundant disk comes into play to hold the encoded information that acts as a backup whenever a failure occurs. When an error occurs, data is reconstructed to ensure that the output is performed as expected by the user. Another important area that retains the focus is the use of application-based fault tolerance techniques (Piuri, 2001). These approaches come in handy during the detection of errors in high-performance parallel processors. Fault-tolerant computing plays a major role in processor control, space, communication, transportation, and e-commerce.

all Post