01 - Introduction to Distributed Systems

Distributed Systems – An Introduction CS4262 Distributed Systems Dilum Bandara Dilum.Bandara@uom.lk Some slides extracted from Dr. Srinath Perera & Dr. Rajkumar Buyya’s Presentation Deck

What is a Distributed System? 2 Source: http://people.reed.edu/~jimfix/442ds/

What is a Distributed System? 3 Source: www.cs.bham.ac.uk/~nagarajs/courses/DSS/

What is a Distributed System? “A distributed system is one on which I cannot get any work done because some machine I have never heard of has crashed.” --Leslie Lamport 4

What is a Distributed System? “A system in which hardware or software components located at networked computers communicate & coordinate their actions only by message passing.” - [Coulouris] “A distributed system is a collection of independent computers that appear to the users of the system as a single coherent system.” - [Tanenbaum] 5

Key Characteristics  Many independent computers  Only communicate via message passing  Coordinate/communicate to fulfill a common goal 6

Characteristics of Distributed Systems 7

Question  Which of the following is true? a) Horizontal scaling is used in enterprise applications b) Vertical scaling is used in cloud computing c) Elasticity is the ability to scale in & out d) Stateful servers/components are easier to scale 8

Why Distributed Systems?  Some systems are inherently distributed  e.g., file sharing, mobile social networks, vehicles  Some problems are too big for a single system  For scalability  e.g., Google, Amazon, YouTube  For better QoS/QoE  e.g., YouTube, Google  For reliability  e.g., Google, Amazon  For specific demographics  e.g., Amazon, Yahoo, Google, CircInfo  Economic reasons  Resource sharing, Cloud  System evolution  Heterogeneous system with time 9

Applications – Online Stores  Many items  List of all items, with their specs  Index items by many dimensions & support search  Many sellers  Many byers  Byers behaviour  Support checkout, tracks delivery, returns, ratings, & complains  Supported by partitioning sellers/ items across many nodes  Business analytics 10

Applications – Desktop Grids  Many volunteering their computing power  Scientists submit computing jobs to system  Broker match resources with jobs  Resource run jobs & return results  Handle failures  Avoid free riding  Biggest computers on earth www.seti.cl/aprendiendo-mas-sobre-boinc-y-la-computacion-voluntaria/ 11

Applications – IoT  Many sensors – Weather, travel, traffic, surveillance, stock exchange, smart grid, production lines  Monitor, understand, & react to events  Usually handled with Stream Processing, Complex Event Processing, or custom applications Source: www.flickr.com/photos/imuttoo/4257813689/ by Ian Muttoo, www.flickr.com/photos/eastcapital/4554220770/ , www.flickr.com/photos/patdavid/4619331472/ by Pat David copyright CC 12

Applications – Mobile Crowdsourcing  Modern mobile phones are like a weather centre  GPS, Barometer, Temperature, Light Proximity, & Moisture sensors  Get volunteer phones to send sensor data (Crowd source)  Report on weather  Crop diseases (agriculture officials)  Epidemics (from hospitals, doctors)  Use that to do weather predictions, crop disease & epidemic spread Source: www.fotopedia.com/items/flickr-2548697541 , www.geograph.org.uk/photo/1534209, and www.yourbdnews.com/2011/10/17/samsung-files-to-halt-iphone-4s- in-japan-australia/iphone-4s, Licensed CC Mobile: Solving the Last Mile Problem 13

Applications – Data Storages & Provenance (Sky Server)  Telescopes (Square Kilometer Array) keep collecting data from sky (TBs per day)  Sky Survey lets scientists to come & see the sky of a given location, as seen at a given time  Moving data take a long time 1TB takes  100 Mbps network - 30 hrs  1 Gbps network - 3 hrs  10 Gbps network - 20 minuteswww.geograph.org.uk/photo/103069, Licensed CC 14 http://cas.sdss.org

Applications – Theoretical Computer Science  Concerns with  Coordination algorithms  Leader election, multicast, distributed locks, barriers, snapshot algorithms  Impossibility results, upper & lower bounds  Distributed versions of some centralized algorithms  e.g., shortest path  A lot of work done on 70s lay the ground work for Distributed Systems http://www.flickr.com/photos/lodz_na_nowo/5690492370/ http://xkcd.com/384/ http://www.flickr.com/photos/quinnanya/4990131194/sizes/z/in/photostream/, Licensed CC 15

Parallel vs. Distributed Systems  Convergence of concepts of parallel & distributed systems  Differentiation with parallel systems is blurring  New middleware  Extensibility of clusters leads to heterogeneity  As new hardware is added 16 Parallel Systems Distributed Systems Tight coupling Loose coupling Physical proximity Server room to Global Homogeneous Heterogeneous Threads & MPI RPC, Web Services, REST

Distributed Systems Timeline/History 17 Period Topics 1965-late 70s Parallel Programming, Self Stabilization, Fault Tolerance, ER Model/ Transactions, Time Clock 1980s Consensus & impossibility, SQL, Distributed Snapshots, Replications, Group Communication Early 90s Linearizability, Parallel DB, Transactional Memory, RAID, MPI Late 90s Volunteer Computing, P2P file sharing, Complex event processing Early 2000 Oceanostore, Web Services, Symantec Web, REST, DHT, Pub/Sub, Grid, Autonomic Computing, Google File System, Virtualization, SOA, Map reduce 2005-2010 Cloud, NoSQL, Mobile Apps, Data Provenance

Design Goals  Transparency  Differences between computers & the way they communicate are hidden from users  Single System View (SSV/SSI)  Users & applications interact with a distributed system in a consistent & uniform way regardless of where or when the interaction takes place  Scalability  Relatively easy to expand or scale  Availability  Continuously available even though certain parts may be temporarily out of order 18

Basic Design Issues  Naming  Flat, hierarchical  Communication  Dynamic, random, deterministic, unicast, anycast, multicast, broadcast, RPC, HTTP  System architecture  Client-server, hierarchical, random, structured, hybrid  Software structure  API, SOA, ROA (Resource Oriented Architecture), Micro-services  Workload allocation  Static, dynamic  Consistency maintenance  Soft vs. hard 19

Fallacies of Distributed Systems  Network is reliable  Latency is zero  Bandwidth is infinite  Transport cost is zero  Network is secure  Topology doesn't change  There is one administrator  Infrastructure is homogeneous 20 By Arnon Rotem-Gal-Oz

Distributed System Challenges  Concurrency  No global state  Failures in different elements  Transparency  Fault tolerance  Scalability  Heterogeneity  Security  Openness 21

1. Concurrency Challenges  Every software or hardware component  Autonomous, enable resource sharing, & synchronize & coordinate via message passing  A & B are concurrent, if either A can happen before B, or B can happen before A  Typical problems  Deadlocks  Unreliable communication  Provide & manage concurrent access to shared resources  Preserve dependencies, e.g., distributed transactions  Fair scheduling 22

2. Lack of Global State  Absence of a global state  Typically no single process would have a knowledge of current global state of the system  Due to concurrency & message passing communication  Absence of a global clock  Hard to say who’s first  There are limits on precision with which processes in a can synchronize their clocks  However, problem can now be addressed in some application with GPS time stamps 23

3. Failures in Different Elements  Failures are more common than in centralized systems  Processes run autonomously, in isolation  Failures of individual processes may remain undetected  Individual processes may be unaware of failures in the system context  Network failures isolate processes & partition system 24

4. Transparency  Present system to users & applications as a single computer system  Hides the fact that resources are physically distributed across multiple computers  There’s a trade-off between high degree of transparency & system performance  Attempting to blindly hide all distribution aspects from users is not always a good idea  Transparency can be applied to several aspects 25

Forms of Transparency  Access transparency  Access to local or remote resources is identical  e.g., Network File System  Location transparency  Access without knowledge of location  e.g., URLs, e-mail addresses  Failure transparency  Tasks can be completed despite failures  e.g., retransmission of e-mails, failure of a Web server node should not bring down the website  Replication transparency  Access to replicated resources as if there was just one  Provide enhanced reliability & performance 26

Forms of Transparency (Cont.)  Migration (mobility/relocation) transparency  Movement of resources & clients within a system without affection operation of users or applications  e.g., switching from one name server to another, migration of a VM from physical server to another  Concurrency transparency  A process shouldn’t notice that there are others sharing same resources  Performance transparency  Allows system to be dynamically reconfigured to improve performance as loads vary  Scaling transparency  Allows system & applications to expand in scale without change to system structure or application algorithms 27

Question(s)  Migration transparency a) Allows access without knowledge of location b) Enables multiple instances of resources c) Enables the movement of resources d) Enables the concealment of faults  Higher degree of transparency is always desirable True / False 28

5. Fault Tolerance  Failure  When an offered service no longer complies with its specification  Fault  Cause of a failure  Fault tolerance  No failure despite faults  Failures in distributed systems are partial  Some components fail while others continue to function  Therefore, handling failure is particularly difficult 29

Fault Tolerance Mechanisms  Detecting failures  Checksums, heartbeat  Masking failures  Retransmission of corrupt messages, redundancy  Tolerating failures  Exception handling, timeouts, redundancy  Recovering from failures  Snapshots, rollback mechanisms 30

6. Scalability  At many different scales  No of applications, users, transactions, products to be sold, attributes of products  Goal is to remain effective when there is a significant increase in no of resources & users  3 different dimensions:  Size scalability  Limitations due to centralized services, centralized data, centralized algorithms  Geographic scalability  Unreliable communication, lack of performance guarantees  Administrative scalability  Conflicting policies for resource usage, security, etc. 31

Scaling Techniques  Scalability problems typically appear as performance problems  3 basic scaling techniques  Hiding communication latencies  Distribution  Replication 32

Scalability Concerns  Cost of physical resources  Cost should linearly increase with system size  Performance loss  Finding things in large & distributed systems are hard  Looking for algorithms with O(log n), n is size of data  Preventing software resources from running out  Nos used to represent resources, users, services, etc.  e.g., IP v4 to V6, Y2K problem, Year 2038 problem  Avoiding performance bottlenecks  Maintaining global state  Difficulty of maintaining up to date state 33

7-9. Other Challenges  Heterogeneity  Heterogeneous components must be able to interoperate  Applies to hardware, software, middleware, & protocols  Security  System should only be used in the way intended  Distributed resources, networks, & users  Distributed authentication, authorization, enforcing integrity, non-repudiation, & accounting is hard  Openness  Interfaces should be publicly available to ease adding new components 34

Summary  We use them without realizing they are distributed  Goals  Transparency, single system view, scalability, availability  Challenges  Concurrency, no global state, failures, transparency, fault tolerance, scalability, heterogeneity, security, debugging 35

01 - Introduction to Distributed Systems

In this document