Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
Multiprocessor architecture
Taxonomy of Parallel Architectures
Parallel computing is a computing where the jobs are
broken into discrete parts that can be executed concurrently.
Each part is further broken down to a series of instructions.
Instructions from each part execute simultaneously on
different CPUs.
Parallel systems deal with the simultaneous use of multiple
computer resources that can include a single computer with
multiple processors, a number of computers connected by a
network to form a parallel processing cluster or a combination
of both.
Parallel systems are more difficult to program than computers
with a single processor because the architecture of parallel
computers varies accordingly and the processes of multiple
CPUs must be coordinated and synchronized.
The crux of parallel processing are CPUs. Based on the
number of instruction and data streams that can be
processed simultaneously, computing systems are classified
into four major categories:
 1|P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
Flynn’s classification –
 1. Single-instruction, single-data (SISD) systems –
 An SISD computing system is a uniprocessor machine
 which is capable of executing a single instruction,
 operating on a single data stream. In SISD, machine
 instructions are processed in a sequential manner and
 computers adopting this model are popularly called
 sequential computers. Most conventional computers
 have SISD architecture. All the instructions and data to
 be processed have to be stored in primary memory.
 2|P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 The speed of the processing element in the SISD
 model is limited(dependent) by the rate at which the
 computer can transfer information internally. Dominant
 representative SISD systems are IBM PC,
 workstations.
 2. Single-instruction, multiple-data (SIMD) systems
 –
 An SIMD system is a multiprocessor machine capable
 of executing the same instruction on all the CPUs but
 operating on different data streams. Machines based
 on an SIMD model are well suited to scientific
 computing since they involve lots of vector and matrix
 operations. So that the information can be passed to
 all the processing elements (PEs) organized data
 elements of vectors can be divided into multiple
 sets(N-sets for N PE systems) and each PE can process
 one data set.
 Dominant representative SIMD systems is Cray’s
 vector processing machine.
 3|P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 3. Multiple-instruction, single-data (MISD) systems
 –
 An MISD computing system is a multiprocessor
 machine capable of executing different instructions on
 different PEs but all of them operating on the same
 dataset .
 Example Z = sin(x)+cos(x)+tan(x)
 The system performs different operations on the same
 data set. Machines built using the MISD model are not
 useful in most of the application, a few machines are
 built, but none of them are available commercially.
 4. Multiple-instruction, multiple-data (MIMD)
 systems –
 An MIMD system is a multiprocessor machine which is
 capable of executing multiple instructions on multiple
 data sets. Each PE in the MIMD model has separate
 instruction and data streams; therefore machines built
 using this model are capable to any kind of application.
 Unlike SIMD and MISD machines, PEs in MIMD
 machines work asynchronously.
 4|P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 MIMD machines are broadly categorized into shared-
 memory MIMD and distributed-memory
 MIMD based on the way PEs are coupled to the main
 memory.
 In the shared memory MIMD model (tightly coupled
 multiprocessor systems), all the PEs are connected to
 a single global memory and they all have access to it.
 The communication between PEs in this model takes
 place through the shared memory, modification of the
 data stored in the global memory by one PE is visible
 to all other PEs. Dominant representative shared
 memory MIMD systems are Silicon Graphics machines
 and Sun/IBM’s SMP (Symmetric Multi-Processing).
 In Distributed memory MIMD machines (loosely
 coupled multiprocessor systems) all PEs have a local
 memory. The communication between PEs in this
 model takes place through the interconnection network
 (the inter process communication channel, or IPC). The
 network connecting PEs can be configured to tree,
 mesh or in accordance with the requirement.
 The shared-memory MIMD architecture is easier to
 program but is less tolerant to failures and harder to
 extend with respect to the distributed memory MIMD
 model. Failures in a shared-memory MIMD affect the
 entire system, whereas this is not the case of the
 distributed model, in which each of the PEs can be
 easily isolated. Moreover, shared memory MIMD
 architectures are less likely to scale because the
 addition of more PEs leads to memory contention. This
 5|P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 is a situation that does not happen in the case of
 distributed memory, in which each PE has its own
 memory. As a result of practical outcomes and user’s
 requirement , distributed memory MIMD architecture is
 superior to the other existing models.
Centralized shared memory architecture
 • The use of large multilevel caches can substantially
 reduce memory bandwidth demands of a processor.
 • This has made it possible for several
 (micro)processors to share the same memory
 through a shared bus.
 • Caching supports both private and shared data.
 o For private data, once cached, it's
 treatment is identical to that of a
 uniprocessor.
 o For shared data, the shared value may be
 replicated in many caches.
 Replication has several advantages:
 •
 • Reduced latency and memory bandwidth requirements.
 • Reduced contention for data items that are read by
 multiple processors simultaneously.
 • However, it also introduces a problem: Cache
 coherence .
 6|P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
Cache Coherence
 • With multiple caches, one CPU can modify
 memory at locations that other CPUs have cached.
 For example:
 •
 • CPU A reads location x, getting the value N .
 • Later, CPU B reads the same location, getting the
 value N .
 • Next, CPU A writes location x with the value N - 1 .
 • At this point, any reads from CPU B will get the
 value N , while reads from CPU A will get the value N -
 1.
 • This problem occurs both with write-
 through caches and (more seriously) with write-
 back caches.
 • Cache coherence : informal definition:
 o A memory system is coherent if any read
 of a data item returns the most recently
 written value of that data item.
 • Upon closer inspection, there are several aspects
 that need to be addressed.
Cache Coherence
 • Coherence defines what values can be returned by a
 read.
 7|P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 A memory system is coherent if:
 •
 • Read after write works for a single processor.
 o If CPU A writes N to location X, all
 future reads of location X will return N
 if no other processor writes location X
 after CPU A.
 • Other processors' writes eventually propagate.
 o If CPU A writes value N to location X,
 CPU B will eventually be able to read
 value N from location X.
 o Once it does so, it will continue to read
 value N until location X is written
 again.
 o This is our intuitive notion of a coherent
 view of memory.
Cache Coherence
 • Writes to a single location are serialized.
 o If CPUs A and B both write to location
 X, all processors see the same order of the
 writes.
 o This does not mean that all reads must
 return the same value.
 8|P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 ▪ If value N1 is written "first" to
 location X, followed closely by
 reads of X and a write of X with
 value N2, some reads may
 return N1 and some N2.
 ▪ However, a processor that reads
 N2 will return N2 for all future
 reads.
 • Consistency :
 o This indicates when a modification to
 memory is seen by other processors (i.e.
 will be returned by a read).
 o Clearly, this can NOT be
 "instantaneous" since it may be that the
 new value has not even left the processor
 when a read occurs.
Cache Coherence
 • Consistency :
 o The issue of when a written value MUST
 be seen by a reader is defined by a
 memory consistency model.
 o For now, let's assume that a write is not
 complete until all processors have "seen"
 the effect of the write.
 9|P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 o Also, assume that a processor may not
 reorder memory accesses to move reads
 before an outstanding write.
 ▪ Reads can be reordered,
 but reads and writes can not be
 interchanged.
 Coherent caches provide both:
 •
 • Replication of shared data items (reduces latency and
 contention).
 o Here, the purpose is to provide multiple
 copies of data so that several processors
 can access a single piece of memory
 without serialization.
 • Migration of data items (reduces latency).
 o Data items are moved from one processor
 to another as needed.
Cache-Coherence Protocols
 • Small-scale multiprocessor use hardware
 mechanisms to track the state of data blocks that
 are shared.
 Two classes of protocols:
 •
 • Directory based.
 10 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 o The sharing status of a block of physical
 memory is kept in one location (the
 directory).
 • Snooping.
 o The sharing status is distributed and
 kept with the block in each cache.
 o The caches are usually on a shared
 memory bus.
 ▪ The cache controllers snoop the
 bus to watch for transactions
 that occur on data blocks that
 they hold.
Bus Snooping Protocols
 • Write invalidate.
 o It is the most common protocol, both for
 snooping and for directory schemes.
 o The basic idea behind this protocol is that
 writes to a location invalidate other
 caches' copies of the block.
 ▪ Reads by other processors on
 invalidated data cause cache
 misses.
 11 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 ▪ If two processors write at the
 same time, one wins and obtains
 exclusive access.
 Conten Conten Conten
 Process Bus ts of ts of ts of
 or activit CPU CPU mem
 activity y A's B's locatio
 cache cache nX
 CPU A Cache
 0 0
 reads X miss
 CPU B Cache
 0 0 0
 reads X miss
 CPU A Invalida
 1 0
 writes 1 te
 CPU B Cache
 1 1 1
 reads X miss
 ▪ This example assumes a write-
 back cache.
Bus Snooping Protocols
 • Write broadcast (write update).
 o An alternative is to update all cached
 copies of the data item when it is written.
 o To reduce bandwidth requirements, this
 protocol keeps track of whether or not a
 word in the cache is shared.
 12 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 ▪ If not, no broadcast is
 necessary.
 Conten Conten Conten
 Process Bus ts of ts of ts of
 or activit CPU CPU mem
 activity y A's B's locatio
 cache cache nX
 CPU A Cache
 0 0
 reads X miss
 CPU B Cache
 0 0 0
 reads X miss
 CPU A Broadca
 1 1 1
 writes 1 st
 CPU B
 1 1 1
 reads X
 ▪ This example also assumes
 a write-back cache.
Performance Differences between Bus Snooping Protocols
 • Write invalidate is much more popular.
 This is due primarily to the performance
 •
 differences.
 • Multiple writes to the same word with no intervening
 reads require multiple broadcasts.
 • With multiword cache blocks, each word written requires
 a broadcast.
 13 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 For write invalidate, the first word
 o
 written invalidates.
 o Also write invalidate works on blocks ,
 while write broadcast must work on
 individual words or bytes.
 • The delay between writing by one processor and reading
 by another is lower in the write broadcast scheme.
 o For write invalidate, the read causes a
 miss.
 • Since bus and memory bandwidth are more
 important in a bus-based multiprocessor, write
 invalidation performs better.
 • Therefore, we focus on implementation of the write
 invalidate protocol.
Implementation of Write Invalidate Protocols
 • Write invalidate is simple in bus-based schemes.
 o Acquire the bus and broadcast the
 address to be invalidated.
 • Since all processors snoop the bus, they can check
 the address against items in their cache.
 • Bus acquisition also serializes write operations to
 the same memory location.
 o Writes to a shared data item cannot
 complete until the bus is acquired.
 14 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 • What about locating a data item when a cache miss
 occurs ?
 o For write-through , it's in memory.
 o For write-back , snooping can be used.
 ▪ If a processor finds that it has a
 dirty copy of the requested cache
 block, it provides the block
 instead of memory.
 • Note, write-back caches are greatly preferred in
 a multiprocessor environment since they reduce
 memory bandwidth.
Implementation of Write Invalidate Protocol on Write-Back caches.
 • Writes are the issue here.
 • We would like to know if any other caches contain
 the block to be written by a processor.
 o If there are none, then the write
 need not be placed on the bus.
 o This reduces the time to complete the
 write and reduces memory bandwidth.
 • This can be tracked by adding an extra state bit (in
 addition to the valid and dirty bits) that indicates if
 the block is shared.
 15 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 o If the bit is set (the block is shared), the
 cache generates an invalidation on the
 bus and marks the block as private.
 o If another processor later requests the
 block, the miss is snooped and the
 "owner" sets the state bit to shared.
Implementation of Write Invalidate Protocol on Write-Back caches.
 • Note that every bus transaction checks cache-
 address tags.
 o This could potentially interfere with
 CPU cache access.
 This interference can be reduced by:
 •
 • Duplicating the tags.
 o Bus access can proceed in parallel with
 CPU access.
 o On misses, the processor must arbitrate
 for and update both sets of tags.
 ▪ The same is true for the snoop
 (to perform an invalidate or to
 update the shared bit).
 o However, a snoop may require fetching a
 block.
 ▪ This is the only instance that
 may cause a stall.
 16 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
Implementation of Write Invalidate Protocol on Write-Back caches.
 • Employing a multilevel cache with inclusion.
 o Every entry in L1 is in L2.
 ▪ Therefore, snooping can be
 directed to L2, where there are
 fewer processor accesses.
 o If a snoop gets a hit, then it must
 arbitrate for L1 to update state and
 possibly retrieve data.
 ▪ This usually stalls the processor.
 o Since it is popular to use multi-level
 caches in multiprocessors (to reduce
 memory bandwidth), this solution is
 usually adopted.
 o It is also possible to duplicate the tags in
 L2 to further reduce contention.
An Example Centralized Shared-Memory Snooping Protocol
 • Implemented by incorporating a finite state
 controller in each node.
 • The controller responds to requests from the
 processor and bus:
 To simplify the controller, write hits and write
 misses to shared blocks are treated as write misses.
 17 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 Request Source Function
 Read hit Processor Read data in cache.
 Write hit Processor Write data in cache.
 Read Request data from cache or
 Bus
 miss memory.
 Write Request data from cache or memory
 Bus
 miss (perform any needed invalidates).
 o This causes processors with copies to
 invalidate them.
An Example Centralized Shared-Memory Snooping Protocol
 • Write invalidation and a write-back cache
 assumed:
An Example Centralized Shared-Memory Snooping Protocol
These state transitions have no analog in a uniprocessor cache
controller.
 18 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
An Example Centralized Shared-Memory Snooping Protocol
 Complications we have ignored:
 •
 • Assumes that operations are atomic .
 o In reality, a write miss is not atomic --
 just too much work to do.
 o Also, read misses on a split transaction
 bus are not atomic.
 • Nonatomic actions introduce the possibility that the
 protocol can deadlock .
 o See Appendix E for a fix.
 Two major simplifications:
 •
 • Real protocols distinguish between write hits and write
 misses.
 19 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 From the shared state, a write
 o
 miss would require the action shown
 previously.
 o However, a write hit does not require
 that the data be fetched since it is up-to-
 date.
 ▪ All that is needed is an
 invalidate operation.
 • Real protocols distinguish between shared and clean data
 in exactly one cache.
 o A "clean and private" state eliminates
 the need to generate a bus transaction on
 a write to a "clean and private" block.
Architecture of Distributed Shared
Memory(DSM)
Distributed Shared Memory (DSM) implements the distributed systems
shared memory model in a distributed system, that hasn’t any physically
shared memory. Shared model provides a virtual address area shared
between any or all nodes.
To beat the high forged of communication in distributed system. DSM
memo, model provides a virtual address area shared between all nodes.
systems move information to the placement of access. Information moves
between main memory and secondary memory (within a node) and
between main recollections of various nodes.
Every Greek deity object is in hand by a node. The initial owner is that the
node that created the object. possession will amendment as the object
moves from node to node. Once a method accesses information within the
shared address space, the mapping manager maps shared memory address
to physical memory (local or remote).
 20 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
DSM permits programs running on separate reasons to share information while
not the software engineer having to agitate causation message instead
underlying technology can send the messages to stay the DSM consistent
between compute. DSM permits programs that wont to treat constant laptop to
be simply tailored to control on separate reason. Programs access what seems to
them to be traditional memory.
Hence, programs that Pine Tree State DSM square measure sometimes shorter
and easier to grasp than programs that use message passing. But, DSM isn’t
appropriate for all things. Client-server systems square measure typically less
suited to DSM, however, a server is also wont to assist in providing DSM
practicality for information shared between purchasers.
Architecture of Distributed Shared Memory (DSM) :
Every node consists of 1 or additional CPU’s and a memory unit. High-speed
communication network is employed for connecting the nodes. A straightforward
 21 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
message passing system permits processes on completely different nodes to
exchange one another.
Memory mapping manager unit :
Memory mapping manager routine in every node maps the native memory onto
the shared computer storage. For mapping operation, the shared memory house
is divided into blocks.
Information caching may be a documented answer to deal with operation
latency. DMA uses information caching to scale back network latency. the most
memory of the individual nodes is employed to cache items of the shared
memory house.
Memory mapping manager of every node reads its native memory as an
enormous cache of the shared memory house for its associated processors. The
bass unit of caching may be a memory block. Systems that support DSM,
information moves between secondary memory and main memory also as
between main reminiscences of various nodes.
Communication Network Unit :
Once method access information within the shared address house mapping
manager maps the shared memory address to the physical memory. The mapped
layer of code enforced either within the operating kernel or as a runtime routine.
Physical memory on every node holds pages of shared virtual–address house.
Native pages area unit gift in some node’s memory. Remote pages in some other
node’s memory.
Interconnection Network
Interconnection networks are composed of switching elements. Topology
is the pattern to connect the individual switches to other elements, like
processors, memories and other switches. A network allows exchange of
data between processors in the parallel system.
 • Direct connection networks − Direct networks have point-to-point
 connections between neighboring nodes. These networks are
 static, which means that the point-to-point connections are fixed.
 Some examples of direct networks are rings, meshes and cubes.
 • Indirect connection networks − Indirect networks have no fixed
 neighbors. The communication topology can be changed
 dynamically based on the application demands. Indirect networks
 can be subdivided into three parts: bus networks, multistage
 networks and crossbar switches.
 o Bus networks − A bus network is composed of a number of
 bit lines onto which a number of resources are attached.
 When busses use the same physical lines for data and
 22 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 addresses, the data and the address lines are time
 multiplexed. When there are multiple bus-masters attached
 to the bus, an arbiter is required.
 o Multistage networks − A multistage network consists of
 multiple stages of switches. It is composed of ‘axb’ switches
 which are connected using a particular interstage connection
 pattern (ISC). Small 2x2 switch elements are a common
 choice for many multistage networks. The number of stages
 determine the delay of the network. By choosing different
 interstage connection patterns, various types of multistage
 network can be created.
 o Crossbar switches − A crossbar switch contains a matrix of
 simple switch elements that can switch on and off to create
 or break a connection. Turning on a switch element in the
 matrix, a connection between a processor and a memory can
 be made. Crossbar switches are non-blocking, that is all
 communication permutations can be performed without
 blocking.
Evaluating Design Trade-offs in Network
Topology
If the main concern is the routing distance, then the dimension has to be
maximized and a hypercube made. In store-and-forward routing,
assuming that the degree of the switch and the number of links were not
a significant cost factor, and the numbers of links or the switch degree
are the main costs, the dimension has to be minimized and a mesh built.
In worst case traffic pattern for each network, it is preferred to have high
dimensional networks where all the paths are short. In patterns where
each node is communicating with only one or two nearby neighbors, it is
preferred to have low dimensional networks, since only a few of the
dimensions are actually used.
Routing
The routing algorithm of a network determines which of the possible
paths from source to destination is used as routes and how the route
followed by each particular packet is determined. Dimension order
routing limits the set of legal paths so that there is exactly one route from
each source to each destination. The one obtained by first traveling the
correct distance in the high-order dimension, then the next dimension and
so on.
 23 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
Routing Mechanisms
Arithmetic, source-based port select, and table look-up are three
mechanisms that high-speed switches use to determine the output
channel from information in the packet header. All of these mechanisms
are simpler than the kind of general routing computations implemented
in traditional LAN and WAN routers. In parallel computer networks, the
switch needs to make the routing decision for all its inputs in every cycle,
so the mechanism needs to be simple and fast.
Deterministic Routing
A routing algorithm is deterministic if the route taken by a message is
determined exclusively by its source and destination, and not by other
traffic in the network. If a routing algorithm only selects shortest paths
toward the destination, it is minimal, otherwise it is non-minimal.
Deadlock Freedom
Deadlock can occur in a various situations. When two nodes attempt to
send data to each other and each begins sending before either receives,
a ‘head-on’ deadlock may occur. Another case of deadlock occurs, when
there are multiple messages competing for resources within the network.
The basic technique for proving a network is deadlock free, is to clear the
dependencies that can occur between channels as a result of messages
moving through the networks and to show that there are no cycles in the
overall channel dependency graph; hence there is no traffic patterns that
can lead to a deadlock. The common way of doing this is to number the
channel resources such that all routes follow a particular increasing or
decreasing sequences, so that no dependency cycles arise.
Switch Design
Design of a network depends on the design of the switch and how the
switches are wired together. The degree of the switch, its internal routing
mechanisms, and its internal buffering decides what topologies can be
supported and what routing algorithms can be implemented. Like any
other hardware component of a computer system, a network switch
contains data path, control, and storage.
Ports
The total number of pins is actually the total number of input and output
ports times the channel width. As the perimeter of the chip grows slowly
compared to the area, switches tend to be pin limited.
 24 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
Internal Datapath
The datapath is the connectivity between each of the set of input ports
and every output port. It is generally referred to as the internal cross-bar.
A non-blocking cross-bar is one where each input port can be connected
to a distinct output in any permutation simultaneously.
Channel Buffers
The organization of the buffer storage within the switch has an important
impact on the switch performance. Traditional routers and switches tend
to have large SRAM or DRAM buffers external to the switch fabric, while
in VLSI switches the buffering is internal to the switch and comes out of
the same silicon budget as the datapath and the control section. As the
chip size and density increases, more buffering is available and the
network designer has more options, but still the buffer real-estate comes
at a prime choice and its organization is important.
Flow Control
When multiple data flows in the network attempt to use the same shared
network resources at the same time, some action must be taken to
control these flows. If we don’t want to lose any data, some of the flows
must be blocked while others proceed.
The problem of flow control arises in all networks and at many levels. But
it is qualitatively different in parallel computer networks than in local and
wide area networks. In parallel computers, the network traffic needs to
be delivered about as accurately as traffic across a bus and there are a
very large number of parallel flows on very small-time scale.
Memory Consistency Model
 • A consistency model is contract between a distributed
 data store and processes, in which the processes agree to
 obey certain rules in contrast the store promises to work
 correctly.
 • A consistency model basically refers to the degree of
 consistency that should be maintained for the shared
 memory data.
 • If a system supports the stronger consistency model, then
 the weaker consistency model is automatically supported
 but the converse is not true.
 25 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 • The types of consistency models are Data-Centric and
 client centric consistency models.
1.Data-Centric Consistency Models
A data store may be physically distributed across multiple
machines. Each process that can access data from the store is
assumed to have a local or nearby copy available of the entire
store.
i.Strict Consistency model
 • Any read on a data item X returns a value corresponding
 to the result of the most recent write on X
 • This is the strongest form of memory coherence which has
 the most stringent consistency requirement.
 • Strict consistency is the ideal model but it is impossible to
 implement in a distributed system. It is based on absolute
 global time or a global agreement on commitment of
 changes.
ii.Sequential Consistency
 • Sequential consistency is an important data-centric
 consistency model which is a slightly weaker consistency
 model than strict consistency.
 • A data store is said to be sequentially consistent if the
 result of any execution is the same as if the (read and
 write) operations by all processes on the data store were
 executed in some sequential order and the operations of
 each individual process should appear in this sequence in
 a specified order.
 • Example: Assume three operations read(R1), write(W1),
 read(R2) performed in an order on a memory address.
 Then (R1,W1,R2),(R1,R2,W1),(W1,R1,R2)(R2,W1,R1) are
 acceptable provided all processes see the same ordering.
iii.Linearizability
 • It that is weaker than strict consistency, but stronger than
 sequential consistency.
 26 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 • A data store is said to be linearizable when each operation
 is timestamped and the result of any execution is the
 same as if the (read and write) operations by all
 processes on the data store were executed in some
 sequential order
 • The operations of each individual process appear in
 sequence order specified by its program.
 • If tsOP1(x)< tsOP2(y), then operation OP1(x) should
 precede OP2(y) in this sequence.
iv.Causal Consistency
 • It is a weaker model than sequential consistency.
 • In Casual Consistency all processes see only those
 memory reference operations in the correct order that are
 potentially causally related.
 • Memory reference operations which are not related may
 be seen by different processes in different order.
 • A memory reference operation is said to be casually
 related to another memory reference operation if the first
 operation is influenced by the second operation.
 • If a write(w2) operation is casually related to another
 write (w1) the acceptable order is (w1, w2).
v.FIFO Consistency
 • It is weaker than causal consistency.
 • This model ensures that all write operations performed by
 a single process are seen by all other processes in the
 order in which they were performed like a single process
 in a pipeline.
 • This model is simple and easy to implement having good
 performance because processes are ready in the pipeline.
 • Implementation is done by sequencing write operations
 performed at each node independently of the operations
 performed on other nodes.
 • Example: If (w11) and (w12) are write operations
 performed by p1 in that order and (w21),(w22) by p2. A
 process p3 can see them as [(w11,w12),(w21,w2)] while
 p4 can view them as [(w21,w2),(w11,w12)].
 27 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
vi.Weak consistency
 • The basic idea behind the weak consistency model is
 enforcing consistency on a group of memory reference
 operations rather than individual operations.
 • A Distributed Shared Memory system that supports the
 weak consistency model uses a special variable called a
 synchronization variable which is used to synchronize
 memory.
 • When a process accesses a synchronization variable, the
 entire memory is synchronized by making visible the
 changes made to the memory to all other processes.
vii.Release Consistency
 • Release consistency model tells whether a process is
 entering or exiting from a critical section so that the
 system performs either of the operations when a
 synchronization variable is accessed by a process.
 • Two synchronization variables acquire and release are
 used instead of single synchronization variable. Acquire is
 used when process enters critical section and release is
 when it exits a critical section.
 • Release consistency can be viewed as synchronization
 mechanism based on barriers instead of critical sections.
viii.Entry Consistency
 • In entry consistency every shared data item is associated
 with a synchronization variable.
 • In order to access consistent data, each synchronization
 variable must be explicitly acquired.
 • Release consistency affects all shared data but entry
 consistency affects only those shared data associated with
 a synchronization variable.
2.Client-Centric Consistency Models
 28 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 • Client-centric consistency models aim at providing a
 system wide view on a data store.
 • This model concentrates on consistency from the
 perspective of a single mobile client.
 • Client-centric consistency models are generally used for
 applications that lack simultaneous updates were most
 operations involve reading data.
i.Eventual Consistency
 • In Systems that tolerate high degree of inconsistency, if
 no updates take place for a long time all replicas will
 gradually and eventually become consistent. This form of
 consistency is called eventual consistency.
 • Eventual consistency only requires those updates that
 guarantee propagation to all replicas.
 • Eventual consistent data stores work fine as long as
 clients always access the same replica.
 • Write conflicts are often relatively easy to solve when
 assuming that only a small group of processes can
 perform updates. Eventual consistency is therefore often
 cheap to implement.
ii.Monotonic Reads Consistency
 • A data store is said to provide monotonic-read consistency
 if a process reads the value of a data item x, any
 successive read operation on x by that process will always
 return that same value or a more recent value.
 • A process has seen a value of x at time t, it will never see
 an older version of x at a later time.
 • Example: A user can read incoming mail while moving.
 Each time the user connects to a different e-mail server,
 that server fetches all the updates from the server that
 the user previously visited. Monotonic Reads guarantees
 that the user sees all updates, no matter from which
 server the automatic reading takes place.
iii.Monotonic Writes
 29 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 • A data store is said to be monotonic-write consistent if a
 write operation by a process on a data item x is completed
 before any successive write operation on X by the same
 process.
 • A write operation on a copy of data item x is performed
 only if that copy has been brought up to date by means of
 any preceding write operations, which may have taken
 place on other copies of x.
 • Example: Monotonic-write consistency guarantees that if
 an update is performed on a copy of Server S, all
 preceding updates will be performed first. The resulting
 server will then indeed become the most recent version
 and will include all updates that have led to previous
 versions of the server.
iv.Read Your Writes
 • A data store is said to provide read-your-writes
 consistency if the effect of a write operation by a process
 on data item x will always be a successive read operation
 on x by the same process.
 • A write operation is always completed before a successive
 read operation by the same process no matter where that
 read operation takes place.
 • Example: Updating a Web page and guaranteeing that the
 Web browser shows the newest version instead of its
 cached copy.
v.Writes Follow Reads
 • A data store is said to provide writes-follow-reads
 consistency if a process has write operation on a data
 item x following a previous read operation on x then it is
 guaranteed to take place on the same or a more recent
 value of x that was read.
 • Any successive write operation by a process on a data
 item x will be performed on a copy of x that is up to date
 with the value most recently read by that process.
 • Example: Suppose a user first reads an article A then
 posts a response B. By requiring writes-follow-reads
 30 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 consistency, B will be written to any copy only after A has
 been written.
Cluster Computers
Cluster is a set of loosely or tightly connected computers working together as
a unified computing resource that can create the illusion of being one
machine. Computer clusters have each node set to perform the same task,
controlled and produced by software.
The components of a clusters are usually connected to each other using fast
area networks, with each node running its own instance of an operating
system. In most circumstances, all the nodes uses same hardware and the
same operating system, although in some setups different hardware or
different operating system can be used.
Types of Clusters –
Computer Clusters are arranged together in such a way so as to support
different purpose from general purpose business needs such as web-service
support, to computation intensive scientific calculation. Basically there are
three types of Clusters, they are:
 • Load-Balancing Cluster – A cluster requires an effective capability
 for balancing the load among available computers. In this, cluster
 nodes share computational workload so as to enhance the overall
 performance. For example- a high-performance cluster used for
 31 | P a g e
Contact: 7008443534, 9090042626
Subject: Computer System Architecture
Created By: Asst. Prof. SK ABDUL ISRAR College: ABA, BLS
 scientific calculation would balance load from different algorithms
 from the web-server cluster, which may just use a round-robin
 method by assigning each new request to a different node.This type
 of cluster is used on farms of Web servers (web farm).
 • Fail-Over Clusters – The function of switching applications and
 data resources over from a failed system to an alternative system in
 the cluster is referred to as fail-over. These types are used to
 cluster database of critical mission, mail, file and application
 servers
 • High-Availability Clusters – These are also known as “HA
 clusters”. They offer a high probability that all the resources will be
 in service. If a failure does occur, such as a system goes down or a
 disk volume is lost, then the queries in progress are lost. Any lost
 query, if retried, will be serviced by different computer in the cluster.
 This type of cluster is widely used in web, email, news or FTP
 servers.
Benefits –
 • Absolute scalability – It is possible to create a large clusters that
 beats the power of even the largest standalone machines. A cluster
 can have dozens of multiprocessor machines.
 • Additional scalability – A cluster is configured in such a way that it
 is possible to add new systems to the cluster in small increment.
 Clusters have the ability to add systems horizontally. This means
 that more computers may be added to the clusters to improve its
 performance, redundancy and fault tolerance(the ability for a
 system to continue working with a malfunctioning of node).
 • High availability – As we know that each node in a cluster is a
 standalone computer, the failure of one node does not mean loss of
 service. A single node can be taken down for maintenance, while
 the rest of the clusters takes on the load of that individual node.
 • Preferable price/performance – Clusters are usually set up to
 improve performance and availability over single computers, while
 typically being much more cost effective than single computers of
 comparable speed or availability.
 32 | P a g e
Contact: 7008443534, 9090042626