Skip to main content


Showing posts from May, 2009

PyTorch Code for Simple Neural Networks for MNIST Dataset

Hardware Redundancy

Hardware Redundancy Use of additional hardware to compensate for failures This can be done in two ways Fault detection, correction and Masking. Multiple hardware units may be assigned to do the same task in parallel and their results compared. If one or more units are faulty, we can express this to show up as a disagreement in the results. The second is to replace the malfunctioning units. Redundancy is expensive, duplicating or triplicating the hardware is justified only in most critical applications Two methods of hardware redundancy is given below are, Static Pairing N modular Redundancy (NMR) Static Pairing Hardwire processors in pairs and to discard the entire pair if one of the processors fails, this is very simple scheme The Pairs runs identical software with identical inputs and should generate idientical outputs. If the output is not identical, then the pair is non functional, so the entire pair is discarded This approach is depicted in the following figure, and it will w

Fault and Error Containment

A Fault in one part of the system cause large voltage swings in the other parts of the system. So it is necessary to prevent from spreading through the system. This is called as containment. This can be divided into Fault Containment Zone (FCZ) and A failure of some part of the computer outside an FCZ cannot cause any element inside that FCZ to fail Hardware inside the FCZ must be isolated from the outside system. Each FCZ should be have independent power supply and its own clock (may be synchronized with the other clocks) Typically, the FCZ consists of a whole computer which includes processors, memory I/O and control interfaces. Error Containment Zone (ECZ) Prevent errors from propagating across zone boundaries. This is achived by means of voting redundant outputs. Hardware Redundancy Software Redundancy Time Redundancy Information Redundancy

Introduction to Fault Tolerance

Fault Tolerance Techniques Introduction Hardware Faults – Occurs due to a physical defect of a system like a broken wire or a logic struck at 0 in a gate. Software faults – occurs due to a bug introduced in a system so the software misbehaves for a given set of inputs Error – the manifestation of a fault is the error (Fault may occur anytime, but only the error manifests that fault) Fault Latency – the time between the onset of fault and its manifestation as an error is the fault latency Error Recovery Forward Error Recovery – the error is masked without any computations having to be redone. Backward Error Recovery - the system is rolled back to a moment in time before the error is believed to have occurred. What Causes Failures? There are three main causes of failures: Errors in the specification or design Mistakes in the specification and Design are very difficult to guard. Many hardware failures and all software failures occur due to such mistakes. It is difficult to ensure that the