Parallel and Distributed Computing Chapter 12

PARALLEL AND DISTRIBUTED COMPUTING FAULT TOLERANT DISTRIBUTED COMPUTING

FAULT TOLERANCE  System ability to continue operating uninterrupted despite the failure of one or more of its components  How an OS Responds to and allows malfunctions and failures  It guarantees no break in service  Recovers from failure completely and transparently

FAULT TOLERANCE  Every achievement in fault tolerance leads to a drawback somewhere else  The system will be slower, take more disk space, utilize more machines and also increase other costs  There for fault tolerance is always a trad-off between cost and the degree of fault tolerance.

FAILUREVS ERROR  System differs from expected behavior  Failure might involve the system being unreachable or producing incorrect output  Error is incorrectness of system that may lead to a failure.  Error do not must create failures but can be detect in the system before they produce failure.

FAULT TOLERANCE  Fault tolerance usually running through several phases.  Error Detection: error has to be detect in order to avoid failure.  Damage Confinement: it must prevent that the error spreads through other components  Error recovery: error must be removed, otherwise system would run into failure

PROCESSOR FAULT  Occur when the processor behaves in unexpected manner. It may be classified into three kinds. 1. Fail Stop: totally failed and will never respond, neighboring processors can detect the failed processor 2. Slowdown: processor might run in degraded form or might totally fail 3. Byzantine: processor can fail, run in degraded fashion for some time or execute at normal speed but tries to fail the computation

NETWORK FAULTS  When processors are prevented from communicating with each other. Link faults can cause new kinds of problems like  One way Links: one processor can send messages but other is not able to receive message.  Network partition: network of portion is completely isolated with other

ATTRIBUTES OF FAULT TOLERANT SYSTEM Fault tolerance system is depended system which requires following attributes 1. Availability: when system is in a ready state and ready to deliver tis functions. Highly available systems works at a given instant in time. 2. Reliability: ability of computer to run continuously without failure, it is defined as time interval instead of instant time. Reliable system works constantly without interruption. 3. Safety: fails to carry out its corresponding processes correctly and operations are incorrect but no major disastrous happened and also doesn’t affect other system to be faulty 4. Maintainability: if failures can be notices and fixed easily.

CLASSIFICATION OF FAILURE Transient: Intermittent: Permanent:

FAULT TOLERANCE MECHANISM IN DISTRIBUTED SYSTEM  Replication based fault tolerance technique  Process level redundancy technique  Fusion based redundancy technique

REPLICATION BASED FAULTTOLERANCE TECHNIQUE  Replicate the data on other machine. It will not cause the whole system to stop.  Replicate the data on different server.

 Problems of replication  Consistency: major problem of replication is consistency because of updating by any client. Consistency of data is ensured by some model such as sequential, causal memory consistency model  Degree of replica: large number of replications are needed in order to achieve high fault tolerance.

PROCESS LEVEL REDUNDANCY TECHNIQUES  Faults that disappears without anything been done is called transient faults.This type of faults are hard to identify  Handling transient fault, software based fault tolerance technique are used  PLR Compares processes to ensure correct execution  Check point and roll back are popular technique in which the current state of system is done.

FUSION BASEDTECHNIQUE  Replication: downside is multiple backups that increases cost  This problem is solved by fusion based technique because it requires fewer backup  Backup machines are fused to a given set of system (NP- Problem)  Fusion based technique has very high overhead during recovery process and it’s acceptable in low probability of fault in a system.

Parallel and Distributed Computing Chapter 12

More Related Content

What's hot

Similar to Parallel and Distributed Computing Chapter 12

More from AbdullahMunir32

Recently uploaded

Parallel and Distributed Computing Chapter 12