PARALLEL AND DISTRIBUTED COMPUTING FAULT TOLERANT DISTRIBUTED COMPUTING
FAULT TOLERANCE  System ability to continue operating uninterrupted despite the failure of one or more of its components  How an OS Responds to and allows malfunctions and failures  It guarantees no break in service  Recovers from failure completely and transparently
FAULT TOLERANCE  Every achievement in fault tolerance leads to a drawback somewhere else  The system will be slower, take more disk space, utilize more machines and also increase other costs  There for fault tolerance is always a trad-off between cost and the degree of fault tolerance.
FAILUREVS ERROR  System differs from expected behavior  Failure might involve the system being unreachable or producing incorrect output  Error is incorrectness of system that may lead to a failure.  Error do not must create failures but can be detect in the system before they produce failure.
FAULT TOLERANCE  Fault tolerance usually running through several phases.  Error Detection: error has to be detect in order to avoid failure.  Damage Confinement: it must prevent that the error spreads through other components  Error recovery: error must be removed, otherwise system would run into failure
PROCESSOR FAULT  Occur when the processor behaves in unexpected manner. It may be classified into three kinds. 1. Fail Stop: totally failed and will never respond, neighboring processors can detect the failed processor 2. Slowdown: processor might run in degraded form or might totally fail 3. Byzantine: processor can fail, run in degraded fashion for some time or execute at normal speed but tries to fail the computation
NETWORK FAULTS  When processors are prevented from communicating with each other. Link faults can cause new kinds of problems like  One way Links: one processor can send messages but other is not able to receive message.  Network partition: network of portion is completely isolated with other
ATTRIBUTES OF FAULT TOLERANT SYSTEM Fault tolerance system is depended system which requires following attributes 1. Availability: when system is in a ready state and ready to deliver tis functions. Highly available systems works at a given instant in time. 2. Reliability: ability of computer to run continuously without failure, it is defined as time interval instead of instant time. Reliable system works constantly without interruption. 3. Safety: fails to carry out its corresponding processes correctly and operations are incorrect but no major disastrous happened and also doesn’t affect other system to be faulty 4. Maintainability: if failures can be notices and fixed easily.
TYPES OF FAILURE
CLASSIFICATION OF FAILURE Transient: Intermittent: Permanent:
FAULT TOLERANCE MECHANISM IN DISTRIBUTED SYSTEM  Replication based fault tolerance technique  Process level redundancy technique  Fusion based redundancy technique
REPLICATION BASED FAULTTOLERANCE TECHNIQUE  Replicate the data on other machine. It will not cause the whole system to stop.  Replicate the data on different server.
 Problems of replication  Consistency: major problem of replication is consistency because of updating by any client. Consistency of data is ensured by some model such as sequential, causal memory consistency model  Degree of replica: large number of replications are needed in order to achieve high fault tolerance.
PROCESS LEVEL REDUNDANCY TECHNIQUES  Faults that disappears without anything been done is called transient faults.This type of faults are hard to identify  Handling transient fault, software based fault tolerance technique are used  PLR Compares processes to ensure correct execution  Check point and roll back are popular technique in which the current state of system is done.
FUSION BASEDTECHNIQUE  Replication: downside is multiple backups that increases cost  This problem is solved by fusion based technique because it requires fewer backup  Backup machines are fused to a given set of system (NP- Problem)  Fusion based technique has very high overhead during recovery process and it’s acceptable in low probability of fault in a system.

Parallel and Distributed Computing Chapter 12

  • 1.
    PARALLEL AND DISTRIBUTEDCOMPUTING FAULT TOLERANT DISTRIBUTED COMPUTING
  • 2.
    FAULT TOLERANCE  Systemability to continue operating uninterrupted despite the failure of one or more of its components  How an OS Responds to and allows malfunctions and failures  It guarantees no break in service  Recovers from failure completely and transparently
  • 3.
    FAULT TOLERANCE  Everyachievement in fault tolerance leads to a drawback somewhere else  The system will be slower, take more disk space, utilize more machines and also increase other costs  There for fault tolerance is always a trad-off between cost and the degree of fault tolerance.
  • 4.
    FAILUREVS ERROR  Systemdiffers from expected behavior  Failure might involve the system being unreachable or producing incorrect output  Error is incorrectness of system that may lead to a failure.  Error do not must create failures but can be detect in the system before they produce failure.
  • 5.
    FAULT TOLERANCE  Faulttolerance usually running through several phases.  Error Detection: error has to be detect in order to avoid failure.  Damage Confinement: it must prevent that the error spreads through other components  Error recovery: error must be removed, otherwise system would run into failure
  • 6.
    PROCESSOR FAULT  Occurwhen the processor behaves in unexpected manner. It may be classified into three kinds. 1. Fail Stop: totally failed and will never respond, neighboring processors can detect the failed processor 2. Slowdown: processor might run in degraded form or might totally fail 3. Byzantine: processor can fail, run in degraded fashion for some time or execute at normal speed but tries to fail the computation
  • 7.
    NETWORK FAULTS  Whenprocessors are prevented from communicating with each other. Link faults can cause new kinds of problems like  One way Links: one processor can send messages but other is not able to receive message.  Network partition: network of portion is completely isolated with other
  • 8.
    ATTRIBUTES OF FAULTTOLERANT SYSTEM Fault tolerance system is depended system which requires following attributes 1. Availability: when system is in a ready state and ready to deliver tis functions. Highly available systems works at a given instant in time. 2. Reliability: ability of computer to run continuously without failure, it is defined as time interval instead of instant time. Reliable system works constantly without interruption. 3. Safety: fails to carry out its corresponding processes correctly and operations are incorrect but no major disastrous happened and also doesn’t affect other system to be faulty 4. Maintainability: if failures can be notices and fixed easily.
  • 9.
  • 10.
  • 11.
    FAULT TOLERANCE MECHANISMIN DISTRIBUTED SYSTEM  Replication based fault tolerance technique  Process level redundancy technique  Fusion based redundancy technique
  • 12.
    REPLICATION BASED FAULTTOLERANCETECHNIQUE  Replicate the data on other machine. It will not cause the whole system to stop.  Replicate the data on different server.
  • 13.
     Problems ofreplication  Consistency: major problem of replication is consistency because of updating by any client. Consistency of data is ensured by some model such as sequential, causal memory consistency model  Degree of replica: large number of replications are needed in order to achieve high fault tolerance.
  • 14.
    PROCESS LEVEL REDUNDANCYTECHNIQUES  Faults that disappears without anything been done is called transient faults.This type of faults are hard to identify  Handling transient fault, software based fault tolerance technique are used  PLR Compares processes to ensure correct execution  Check point and roll back are popular technique in which the current state of system is done.
  • 15.
    FUSION BASEDTECHNIQUE  Replication:downside is multiple backups that increases cost  This problem is solved by fusion based technique because it requires fewer backup  Backup machines are fused to a given set of system (NP- Problem)  Fusion based technique has very high overhead during recovery process and it’s acceptable in low probability of fault in a system.