0% found this document useful (0 votes)
99 views49 pages

System Design

The document provides a comprehensive overview of system design, covering its importance, types of systems, scalability, performance considerations, and various design patterns. It includes detailed sections on database design, caching strategies, networking, security, monitoring, and fault tolerance, along with real-world examples and interview preparation tips. The content is structured to guide readers through the principles and practices necessary for designing scalable and reliable systems.

Uploaded by

Nellai Akshay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views49 pages

System Design

The document provides a comprehensive overview of system design, covering its importance, types of systems, scalability, performance considerations, and various design patterns. It includes detailed sections on database design, caching strategies, networking, security, monitoring, and fault tolerance, along with real-world examples and interview preparation tips. The content is structured to guide readers through the principles and practices necessary for designing scalable and reliable systems.

Uploaded by

Nellai Akshay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

1.

Introduction to System Design

- Importance and Fundamentals of System Design

- Types of Systems (Monolithic vs Microservices, etc.)

- Scalability and Performance Considerations

2. Designing Large-Scale Systems

- High-Level Design vs Low-Level Design

- Designing for Scalability

- Designing for Fault Tolerance

- Load Balancing and Caching

3. Key Concepts in System Design

- Latency vs Throughput

- CAP Theorem

- Consistency, Availability, Partition Tolerance

- Load Balancing Techniques

- Caching Strategies

- Distributed Systems Principles

4. Database Design

- SQL vs NoSQL Databases

- Database Normalization and Denormalization

- Partitioning and Sharding

- ACID vs BASE Properties

- Consistency Models (Eventual Consistency, Strong Consistency)

5. Design Patterns and Practices


- Microservices Architecture

- Monolithic Architecture

- Event-Driven Design

- CQRS (Command Query Responsibility Segregation)

- SAGA Pattern for Distributed Transactions

6. Data Storage and Management

- Relational Databases (SQL) Design

- NoSQL Databases (Document, Key-Value, Columnar, Graph Databases)

- Indexing and Query Optimization

- Data Replication and Sharding

- Event Sourcing and Data Streams

7. Caching

- Caching Principles

- Types of Caches (Memory Caches, Distributed Caches)

- Cache Invalidation Strategies

- Cache Consistency Models

- Content Delivery Networks (CDNs)

8. Networking and Communication

- HTTP vs WebSocket vs gRPC

- Message Queues (Kafka, RabbitMQ)

- Pub/Sub (Publish/Subscribe)

- REST vs GraphQL APIs

- Load Balancing Strategies


9. Security Considerations

- Authentication and Authorization

- OAuth and JWT Tokens

- Encryption at Rest and In Transit

- Secure APIs and Rate Limiting

10. Monitoring, Logging, and Alerting

- Metrics Collection and Analysis

- Distributed Tracing

- Logging Best Practices

- Alerting and Notifications

- Service-Level Agreements (SLAs)

11. Fault Tolerance and Reliability

- Redundancy and High Availability

- Replication Strategies

- Backup and Recovery Plans

- Graceful Degradation and Failover Mechanisms

12. Performance Optimization

- Bottleneck Analysis

- Load Testing and Stress Testing

- Optimizing Database Queries

- Network Optimization

13. Case Studies and Real-World Examples

- Design a URL Shortener (e.g., Bit.ly)


- Design a Social Media Platform (e.g., Facebook)

- Design a Messaging System (e.g., WhatsApp)

- Design a File Storage System (e.g., Dropbox)

- Design a Video Streaming Platform (e.g., YouTube)

14. Interview Preparation

- Frequently Asked System Design Questions

- Mock Design Exercises

- Evaluating Trade-Offs and Solutions

1. Introduction to System Design


1.1. Importance and Fundamentals of System Design
 What is System Design?
o System Design is the process of designing the architecture and
components of a system to meet certain requirements,
considering factors like scalability, reliability, maintainability, and
performance. It's about making high-level decisions about how
the system should function, breaking it into smaller components,
and how they interact with each other.
 Why is System Design Important?
o It's essential for building scalable and reliable systems.
o Helps in designing systems that can handle high traffic, massive
datasets, and numerous users.
o It aids in managing system constraints and trade-offs, such as
latency, bandwidth, and storage requirements.
o Helps optimize resources, minimizing cost and improving user
experience.
 Key Concepts in System Design:
o Scalability: The ability of the system to handle a growing
amount of work or an increase in demand.
o Reliability: The system’s ability to operate without failure and
recover from failure quickly.
o Maintainability: The ease with which the system can be
modified and updated.
o Availability: The system’s ability to be operational and
accessible when needed.
o Performance: The system’s ability to handle a large number of
operations within a specified time frame.
1.2. Types of Systems
 Monolithic Systems
o A single, unified software system where all components are
tightly coupled and work together within a single environment.
o Example: Traditional monolithic web applications, where the front
end, back end, and database are all part of one codebase.
o Advantages: Easier to develop initially, simple deployment.
o Disadvantages: Harder to scale, maintain, and deploy as the
system grows.
 Microservices
o A distributed approach where a system is broken down into
smaller, independent services, each handling a specific business
functionality.
o Advantages: Scalable, independent deployment of services,
flexibility in choosing technology stacks.
o Disadvantages: Complexity in management, potential
communication overhead between services.
 Serverless
o An architecture where you don’t manage servers directly but
instead rely on cloud service providers to handle infrastructure
automatically.
o Advantages: Automatic scaling, reduced operational overhead,
cost-effective for intermittent workloads.
o Disadvantages: Limited control over underlying infrastructure,
possible cold start delays, vendor lock-in.
 Event-Driven Systems
o A system that responds to events and triggers other actions as a
result of those events. Useful for real-time data processing.
o Advantages: Asynchronous processing, real-time
responsiveness, highly scalable.
o Disadvantages: Complexity in managing events, event ordering
and idempotency issues.
1.3. Scalability and Performance Considerations
 Vertical Scaling (Scale-Up)
o Increasing the capacity of a single server, such as adding more
CPUs, memory, or storage.
o Limitations: There’s a limit to how much you can scale
vertically before you hit hardware constraints.
 Horizontal Scaling (Scale-Out)
o Adding more machines to a pool, distributing the workload across
multiple servers or instances.
o Benefits: Scalable to an almost unlimited degree; better fault
tolerance.
o Challenges: Requires load balancing, data distribution, and
replication strategies to maintain consistency and availability.
 Load Balancing
o Distributes incoming network traffic across multiple servers to
ensure no single server becomes a bottleneck.
o Types:
 Round Robin: Distributes requests in a circular manner.
 Least Connections: Routes requests to the server with
the fewest active connections.
 IP Hash: Routes requests based on the client’s IP address.
 Caching
o Storing copies of frequently accessed data in a fast-access
medium (like memory) to reduce the load on databases and
speed up response times.
o Examples: Redis, Memcached
o Cache strategies:
 Write-Through Cache: Data is written to both the cache
and the database simultaneously.
 Write-Back Cache: Data is written to the cache, and then
periodically written to the database.
 Database Partitioning
oSplitting large datasets into smaller, more manageable chunks
called partitions.
o Types:
 Horizontal Partitioning (Sharding): Data is distributed
across multiple servers based on some key (e.g., user ID).
 Vertical Partitioning: Data is split based on columns
(e.g., separate tables for user details and user logs).
1.4. System Design Process
1. Understanding Requirements
o Gather functional (what the system needs to do) and non-
functional requirements (how the system should perform, e.g.,
latency, availability).
2. High-Level Design
o Identify major components of the system and their interactions.
This includes defining services, databases, APIs, load balancers,
and communication methods.
3. Low-Level Design
o Break down high-level components into detailed designs,
defining algorithms, data structures, and class diagrams. This
step includes writing code for individual components.
4. Consider Trade-Offs
o Evaluate different solutions to requirements, weighing factors
like cost, complexity, scalability, and performance. For example,
choosing between SQL and NoSQL, or a monolithic vs
microservice architecture.
5. Validation and Testing
o Testing the system under simulated conditions, checking for
scalability and reliability, and making adjustments as needed.

Unit 2: Scalability and Load Balancing


2.1. Scalability in System Design
 What is Scalability?
Scalability refers to the system's ability to handle increased loads by adding resources,
either by scaling up (vertical scaling) or scaling out (horizontal scaling).
 Types of Scalability:
1. Vertical Scalability (Scale-Up):
Increasing the capacity of a single server (e.g., adding CPU, RAM, or disk space).
Advantages:
 Easier to implement and manage.
 No need for complex architecture changes.
 Disadvantages:
 Limited by hardware capacity.
 Single points of failure.
2. Horizontal Scalability (Scale-Out):
Adding more servers to distribute the load across multiple instances.
 Advantages:
 Unlimited scalability as more machines can be
added.
 Fault tolerance and high availability.
 Disadvantages:
 Complex architecture, requires load balancing and
data partitioning.
 When to Scale Vertically vs Horizontally?
o Scale Vertically when the application can work with a single
instance and the demand isn’t expected to increase dramatically.
o Scale Horizontally when the load is expected to grow
significantly, and multiple instances or servers are required.
2.2. Load Balancing
 What is Load Balancing?
Load balancing is the technique used to distribute incoming traffic across multiple servers
or resources to ensure no single server becomes a bottleneck.
 Types of Load Balancers:
1. Hardware Load Balancers:
Physical devices used to distribute traffic, commonly seen in large-scale data
centers. They are robust but expensive.
2. Software Load Balancers:
Managed by software, these load balancers are flexible, cost-effective, and
scalable. Examples include HAProxy, Nginx, and AWS Elastic Load Balancer
(ELB).
 Types of Load Balancing Algorithms:
1. Round Robin:
Distributes requests evenly to each server in the pool. Simple but does not
consider the server's current load.
2. Least Connections:
Directs traffic to the server with the fewest active connections. Useful when the
servers have different capacities or workloads.
3. Weighted Round Robin:
Similar to Round Robin but with weights assigned to each server. Servers with
higher weights receive more traffic.
4. IP Hash:
Requests are distributed based on the IP address of the client, ensuring a client is
always routed to the same server.
5. Random:
Traffic is distributed randomly to servers. Less predictable but easy to implement.
2.3. Achieving Scalability Through Load Balancing
 Stateless vs. Stateful Load Balancing:
1. Stateless Load Balancing:
Each request is independent, and no session data is stored on the server. It
simplifies scaling, as there is no need to maintain session information across
requests.
2. Stateful Load Balancing:
Requires the load balancer to route requests from the same client to the same
server to maintain session state (sticky sessions).
 Challenges:
 More complex to scale and manage.
 Need for session replication if the server goes down.
 Load Balancing with Microservices:
In a microservices architecture, load balancing can occur at different levels:
1. Service-Level Load Balancing:
Distributes traffic between different instances of the same microservice.
2. API Gateway Load Balancing:
The API gateway distributes requests to different microservices based on the
client’s needs.
2.4. Scaling Databases
 Vertical Database Scaling:
Increasing the capacity of a single database instance (e.g., adding more CPU, memory).
Works well for small systems but becomes limited as traffic grows.
 Horizontal Database Scaling (Sharding):
Distributing data across multiple servers. Each server holds a portion of the database, and
data is partitioned.
o Benefits:
 Improves scalability and fault tolerance.
o Challenges:
 Requires complex data distribution strategies.
 Involves dealing with consistency, partition tolerance, and
data distribution logic.
 Read and Write Splitting:
Distribute read and write operations across different database instances:
o Reads: Handled by replica databases.
o Writes: Handled by the primary database.
2.5. Handling Traffic Surges
 Auto-Scaling:
Automatically adding or removing servers based on the traffic load. Cloud providers
(AWS, GCP, Azure) offer auto-scaling solutions to scale applications dynamically.
 Caching:
Frequently requested data can be stored in memory, reducing the number of requests
hitting the database.
o Examples of Caching Solutions:
 Distributed Caching: Redis, Memcached.
 Content Delivery Networks (CDNs): Cloudflare, Akamai.
2.6. Fault Tolerance and High Availability
 Fault Tolerance:
The ability of a system to continue functioning despite failures in components. Achieved
by having backup resources or redundancy (multiple servers, data replication).
 High Availability:
A system's ability to remain operational and accessible with minimal downtime.
Techniques include:
1. Redundant Systems: Having backups of critical components.
2. Failover: Automatically switching to a standby server or system
if the primary one fails.
3. Geo-Replication: Distributing system replicas across different
data centers or geographic locations.
2.7. Examples of Scalable Systems:
1. Facebook:
Uses a combination of horizontal scaling and load balancing. Traffic is distributed across
millions of servers with a complex caching system to handle billions of active users.
2. Amazon:
Amazon scales its e-commerce platform by partitioning its services into smaller
microservices and using sophisticated load balancing and caching techniques.
3. Netflix:
Netflix uses horizontal scaling and microservices, with auto-scaling infrastructure to
handle the varying loads based on traffic demands. It also uses CDN to deliver content
efficiently.
Unit 3: Caching, Content Delivery, and Data Consistency
3.1. Introduction to Caching
 What is Caching?
o Caching is the process of storing copies of data in temporary
storage locations (called caches) to reduce access times for
frequently accessed data. The main aim of caching is to speed up
data retrieval by storing data closer to the client or server.
 Why is Caching Important?
o Performance: Reduces latency by storing frequently accessed
data in faster storage.
o Cost Efficiency: Minimizes redundant database queries,
reducing database load and infrastructure costs.
o Scalability: Allows systems to handle increased traffic without
overwhelming the backend databases.
3.2. Types of Caching
 1. Client-Side Caching:
o What is it?
Data is stored on the client (e.g., browser or mobile app) to
minimize redundant server requests.
o Example:
Storing static assets like images, JavaScript, and CSS files in a
web browser’s cache.
o Pros:
Faster access to previously fetched data, reducing load on
servers.
o Cons:
May cause data staleness if not invalidated properly.
 2. Server-Side Caching:
o What is it?
Data is cached on the server, reducing load on databases and
speeding up access to frequently requested data.
o Example:
Using an in-memory data store like Redis or Memcached to
store frequently queried data.
o Pros:
Helps in reducing the number of database calls.
o Cons:
Can cause issues if not synchronized with the main data source.
 3. Distributed Caching:
o What is it?
A distributed caching system is used when a single server is
insufficient, and you need multiple servers to handle large-scale
data. Common in cloud architectures.
o Example:
Redis Cluster or Amazon ElastiCache.
o Pros:
Scalable and fault-tolerant.
o Cons:
Requires careful management of cache consistency and
invalidation across nodes.
 4. Application-Level Caching:
o What is it?
Caching data that is specific to application logic, e.g., storing
intermediate results of computation.
o Example:
Caching the result of a computationally expensive operation that
doesn't change often.
o Pros:
Speeds up applications that require frequent processing of
similar data.
o Cons:
May cause memory issues if data is not cleared regularly.
3.3. Cache Invalidation
 What is Cache Invalidation?
o Cache invalidation is the process of removing or updating stale
data from the cache. Without proper cache invalidation, stale
data can cause inconsistencies in the system.
 Types of Cache Invalidation:
1. Time-Based Expiry (TTL - Time to Live):
 Data is automatically removed from the cache after a
specific time interval.
 Example: HTTP Cache-Control Headers (e.g., max-
age=600).
2. Manual Invalidations (Push-based):
 When data is modified in the backend, the cache is
manually cleared or updated.
 Example: If a user updates their profile, the related cache
is updated.
3. LRU (Least Recently Used) Eviction:
 Caches automatically evict the least recently used data
when the cache size limit is reached.
4. Cache Versioning:
 A new version of data is stored in the cache when the
backend data changes.
 Example: In case of schema changes or major updates, all
cached data may need to be replaced with a newer
version.
3.4. Caching Strategies
 1. Write-Through Caching:
o How it works:
When a write request is made, it first updates the cache, and
then the main database is updated.
o Advantages:
 Keeps the cache and database synchronized at all times.
o Disadvantages:
 Write operations can be slower due to two updates (cache
+ DB).
 2. Write-Behind (Write-Back) Caching:
o How it works:
Writes are first done to the cache, and the changes are later
propagated to the database asynchronously.
o Advantages:
 Reduces write load on the database and improves write
performance.
o Disadvantages:
 Can lead to data loss if there is a crash before the cache is
written back to the database.
 3. Read-Through Caching:
o How it works:
When a read request is made, the system checks the cache first.
If the data is present, it returns it from the cache; if not, it
fetches from the database and updates the cache.
o Advantages:
 Transparent to the application (no need to manage cache
explicitly).
o Disadvantages:
 May lead to delays if the cache miss rate is high.
3.5. Content Delivery Networks (CDNs)
 What is a CDN?
o A Content Delivery Network (CDN) is a distributed network of
servers designed to deliver content (e.g., images, videos, scripts)
quickly to users based on their geographical location.
 How CDNs Work:
o Edge Servers:
CDNs use edge servers located geographically closer to the user
to serve cached content, reducing latency and speeding up
content delivery.
o Caching:
Static assets (images, videos, scripts) are cached on CDN
servers, ensuring faster access.
 Benefits of CDNs:
1. Reduced Latency:
Content is served from servers closest to the user.
2. Scalability:
CDNs can handle sudden spikes in traffic by offloading traffic
from the origin server.
3. Global Reach:
Improves user experience globally by serving content from
various edge locations.
4. DDoS Mitigation:
Many CDNs provide built-in protection against Distributed Denial
of Service (DDoS) attacks.
3.6. Data Consistency in Distributed Systems
 What is Data Consistency?
o Data consistency refers to ensuring that all nodes in a distributed
system have the same data at any given time.
 CAP Theorem:
o The CAP Theorem states that a distributed database system
can guarantee at most two of the following three properties:
1. Consistency: All nodes have the same data at any given
time.
2. Availability: The system responds to requests, even if
some nodes are down.
3. Partition Tolerance: The system continues to function
even if network partitions occur between nodes.
 Consistency Models:
1. Strong Consistency:
Guarantees that all reads will return the most recent write.
 Example: MongoDB's default consistency model.
2. Eventual Consistency:
Guarantees that all replicas will eventually converge to the same state, but there
may be a lag.
 Example: Amazon DynamoDB.
3. Causal Consistency:
Guarantees that if one operation causally precedes another, the system will respect
that order.
4. Session Consistency:
Ensures that within a session, the user will see their most recent write.
3.7. Techniques to Ensure Data Consistency
 1. Two-Phase Commit (2PC):
o A distributed protocol used to ensure all nodes in a system either
commit or roll back a transaction.
o Phases:
1. Prepare Phase:
The coordinator asks all nodes to prepare for a commit.
2. Commit Phase:
Once all nodes acknowledge, the transaction is committed.
 2. Quorum-Based Replication:
o Data is written to multiple replicas, but a majority (quorum) of
replicas must acknowledge before a write is considered
successful.
 3. Eventual Consistency with Conflict Resolution:
o Systems like Cassandra provide eventual consistency, with
conflict resolution mechanisms like Last-Write-Wins (LWW) or
vector clocks.
3.8. Examples of Caching and Data Consistency Systems:
1. Redis:
A popular in-memory data store used for caching. It provides advanced features like
persistence, replication, and pub/sub systems, along with support for various data types.
2. CDNs like Cloudflare:
Serves static content globally with minimal latency, often used by websites to reduce
server load and improve speed.
3. Amazon DynamoDB:
A highly scalable database that uses eventually consistent replication to achieve high
availability and fault tolerance.
4. MongoDB:
A NoSQL database that offers configurable consistency levels and strong consistency
when required.

Unit 4: Load Balancing, Rate Limiting, and Message Queues

4.1. Load Balancing


 What is Load Balancing?
Distributing incoming network traffic across multiple servers to ensure no single server is
overwhelmed.
 Why is Load Balancing Important?
o Improves system scalability and availability.
o Prevents server overload and failure.
o Ensures better response time and reliability.
 Types of Load Balancers:
1. Hardware Load Balancers (e.g., F5, Cisco).
2. Software Load Balancers (e.g., HAProxy, NGINX).
3. Cloud Load Balancers (e.g., AWS ELB, Azure LB).
 Load Balancing Algorithms:
1. Round Robin:
Cycles through servers one by one.
2. Least Connections:
Sends traffic to the server with the fewest active connections.
3. IP Hash:
Hashes client IP to consistently send traffic to the same server.
4. Weighted Round Robin:
Servers get a weight, and traffic is distributed proportionally.
5. Consistent Hashing:
Minimizes changes in routing when servers are added/removed.
Used in caching and sharded databases.
 Layer 4 vs. Layer 7 Load Balancing:
o Layer 4 (Transport Layer): Uses IP and TCP/UDP info to route
traffic.
o Layer 7 (Application Layer): Uses HTTP headers, cookies, URL
paths to make routing decisions.
 Health Checks: Load balancers monitor the health of servers and automatically reroute
traffic away from failed ones.

4.2. Rate Limiting


 What is Rate Limiting?
Controlling the number of requests a user or client can make to a service within a defined
time window.
 Why is Rate Limiting Needed?
o Prevents abuse and denial-of-service (DoS) attacks.
o Protects APIs from overload.
o Ensures fair resource usage among users.
 Rate Limiting Strategies:
1. Fixed Window Counter:
 Tracks requests in a fixed time window.
 Simple but may cause spikes at window boundaries.
2. Sliding Window Log:
 Logs timestamps of requests and counts those within the
sliding window.
 Accurate but consumes more memory.
3. Sliding Window Counter:
 Combines fixed windows and weights based on proximity
to current time.
 Reduces spikes and is memory-efficient.
4. Token Bucket Algorithm:
 Tokens are added at a fixed rate. Each request consumes a
token.
 Allows bursts as long as tokens are available.
5. Leaky Bucket Algorithm:
 Requests are queued and processed at a steady rate.
 Smoothens traffic but drops excess requests.
 Distributed Rate Limiting:
o Requires coordination across multiple servers.
o Use centralized data stores like Redis for counter storage.
 Real-World Tools:
o Nginx Rate Limiting Modules
o Envoy, API Gateway, Cloudflare

4.3. Message Queues


 What is a Message Queue?
A messaging system that allows components to communicate asynchronously by sending
and receiving messages via a queue.
 Why Use Message Queues?
o Decouples services for asynchronous communication.
o Improves scalability, resilience, and throughput.
o Handles spikes in traffic by queuing tasks.
 Core Components:
1. Producer: Sends messages to the queue.
2. Consumer: Processes messages from the queue.
3. Broker: Manages queues and routes messages (e.g., RabbitMQ,
Kafka).
4. Queue: Temporary storage for pending messages.
 Use Cases:
o Background jobs (e.g., sending emails, processing uploads).
o Log aggregation.
o Order processing.
o Microservices communication.
 Popular Message Queue Systems:
1. RabbitMQ:
 Traditional message broker using AMQP.
 Supports acknowledgments, retries, routing.
2. Apache Kafka:
 Distributed event streaming platform.
 High throughput, durable, partitioned topics.
 Ideal for real-time data pipelines.
3. Amazon SQS:
 Fully managed cloud queue service.
 Supports standard and FIFO queues.
 Message Delivery Semantics:
1. At Most Once: Message may be lost, but never duplicated.
2. At Least Once: Message is retried until acknowledged. May
cause duplicates.
3. Exactly Once: Message is delivered only once. Most complex to
implement.
 Design Patterns Using Queues:
1. Fan-out: One message triggers multiple consumers (e.g.,
publish/subscribe).
2. Work Queue: Multiple consumers share the load (e.g., task
queue).
3. Delayed Processing: Delay queues allow tasks to be deferred.
 Challenges in Message Queues:
o Duplicate processing.
o Ordering of messages.
o Queue backpressure and overflow.
o Consumer failure and retries.

.
Unit 5: Database Scaling and Sharding

5.1. Database Scaling


 What is Database Scaling?
The process of increasing a database’s capacity to handle more load—queries, data, and
connections.
 Two Primary Approaches:
1. Vertical Scaling (Scaling Up):
 Add more resources (CPU, RAM, SSD) to a single server.
 Easy but has a hardware limit and downtime.
2. Horizontal Scaling (Scaling Out):
 Add more machines/nodes to share the load.
 Complex but scalable, reliable, and fault-tolerant.

5.2. Read and Write Scaling


 Read Scaling:
o Use Read Replicas:
 Data is replicated asynchronously from a master (primary)
to one or more replicas.
 Reads go to replicas, writes to master.
o Tools: MySQL Replication, PostgreSQL Streaming Replication,
MongoDB Replica Set.
 Write Scaling:
o Harder than read scaling due to consistency.
o Use sharding (see below).
o Alternatives: partitioning, CQRS (Command Query Responsibility
Segregation).

5.3. Caching Before Scaling


 Why Cache First?
o Avoid hitting DB for every request.
o Decreases read latency, improves performance.
 Caching Layers:
o Application-level: Local in-memory (e.g., LRU cache).
o Distributed cache: Redis, Memcached.
o Cache commonly queried data, e.g., user profile, product info.
 Write-Through vs Write-Back vs Write-Around Caching:
o Write-through: Write to cache + DB.
o Write-back: Write to cache only, sync to DB later.
o Write-around: Write to DB only, read populates cache later.

5.4. Database Partitioning


 What is Partitioning?
Dividing a large table into smaller, more manageable pieces (partitions), still stored in the
same server.
 Types:
1. Horizontal Partitioning (Sharding):
Each row goes into different DB instances.
E.g., Users with id 1–1000 in DB1, 1001–2000 in DB2.
2. Vertical Partitioning:
Split table columns into different tables/DBs.
E.g., keep frequently accessed columns in one table and move
less-used fields elsewhere.

5.5. Sharding
 What is Sharding?
Horizontal partitioning across multiple databases or servers to spread write and read load.
 Why Shard?
o Overcome single node storage/performance limits.
o Support massive scale with acceptable performance.
 Sharding Strategies:
1. Range-Based Sharding:
 Based on value range (e.g., user_id 1–1000 in shard1).
 Easy, but risk of uneven load (hot shards).
2. Hash-Based Sharding:
 Apply a hash function to shard key (e.g., user_id % 4).
 More uniform, but harder to add/remove shards
dynamically.
3. Directory-Based Sharding:
 Maintain lookup table to track where each key lives.
 Flexible, but the lookup table is a single point of failure and
bottleneck.
 Shard Key Design:
o Choose a key with high cardinality and uniform distribution.
o Avoid sequential or low-distribution keys (e.g., timestamp).
 Challenges in Sharding:
o Rebalancing shards (when adding/removing nodes).
o Joins across shards are complex.
o Global consistency is hard—often favor eventual consistency.
o Cross-shard transactions require complex 2PC (Two-Phase
Commit).

5.6. CAP Theorem & Consistency Models


 CAP Theorem:
o A distributed system can guarantee only 2 out of 3:
 Consistency
 Availability
 Partition Tolerance
 Consistency Models:
o Strong Consistency: All clients see the same data instantly.
o Eventual Consistency: All replicas converge eventually.
o Causal Consistency: Preserves ordering of related operations.
 BASE vs ACID:
o ACID (traditional DBs): Atomicity, Consistency, Isolation,
Durability.
o BASE (distributed systems): Basically Available, Soft state,
Eventual consistency.
5.7. Tools and Technologies
 Relational Databases: PostgreSQL, MySQL (support partitioning,
replication).
 NoSQL Systems:
o MongoDB: Built-in sharding, replica sets.
o Cassandra: Peer-to-peer, partitioned.
o DynamoDB: Key-value store, auto-sharding.
 Cloud Services:
o Amazon Aurora: MySQL/PostgreSQL-compatible, read scaling.
o Google Cloud Spanner: Horizontal scaling with global
consistency.

Unit 6: CDN, DNS, and Caching Layers

6.1. Content Delivery Network (CDN)


 What is a CDN?
A geographically distributed network of proxy servers and data centers that delivers static
content (images, JS, CSS, videos) from the closest edge location to the user.
 Why Use a CDN?
o Reduces latency by serving from edge locations.
o Offloads traffic from origin servers.
o Improves content delivery speed, availability, and reliability.
 How It Works:
1. User requests content (e.g., image).
2. CDN checks cache at edge server.
 If hit, serves directly.
 If miss, fetches from origin, caches it, and serves.
 Popular CDNs:
o Cloudflare, AWS CloudFront, Akamai, Fastly.
 Use Cases:
o Static website assets, media streaming, software updates, global
apps.

6.2. Domain Name System (DNS)


 What is DNS?
oThe Internet’s phonebook: maps human-readable domain
names (e.g., example.com) to IP addresses (e.g., 192.0.2.1).
 How DNS Resolution Works:
1. User types example.com
2. Resolver checks local cache
3. If miss → goes to Root DNS Server
4. Then → TLD server (.com, .org, etc.)
5. Then → Authoritative DNS (returns final IP)
6. Result is cached at each layer.
 DNS TTL (Time to Live):
o Duration for which a DNS record is cached.
o Lower TTL → frequent updates.
o Higher TTL → better performance.
 DNS Load Balancing:
o Multiple IPs returned in round-robin or geo-based fashion.
 DNS Failover:
o Automatically reroutes traffic to backup IPs if primary is down.

6.3. Caching Layers


Caching is essential to improve speed, reduce load, and increase scalability.

6.3.1. Client-side Caching (Browser)


 What: Web browsers store static resources locally.
 Mechanism: Controlled via HTTP headers.
 Key Headers:
o Cache-Control: max-age, no-cache
o ETag: validation of modified content
o Expires: timestamp when content becomes stale

6.3.2. CDN Caching


 What: CDNs cache static content near users.
 Benefits: Reduces hits to origin servers and cuts latency.
 Cache Invalidation:
o Manually or automatically remove stale objects (e.g., /purge).

6.3.3. Application-Level Caching


 What: Caching done inside application code.
 Tools:
o Local LRU caches
o In-process hashmaps
o Function memoization
 Examples:
o Store computed results of expensive functions.
o Reduce API/database calls for frequently accessed data.

6.3.4. Distributed Caching


 What: Shared cache accessed by multiple application servers.
 Technologies: Redis, Memcached
 Types of Cached Data:
o DB query results
o Session data
o User profile info
 Cache Consistency:
o Write-through: Write to cache + DB
o Write-around: Write to DB only; cache on read
o Write-back: Write to cache, sync to DB later

6.3.5. Database Query Caching


 Many databases (like MySQL, PostgreSQL) have internal query cache.
 Application-level query result caching often offers better control.

6.3.6. HTTP Reverse Proxy Caching


 What: Cache sits between client and server (e.g., Nginx, Varnish).
 How: Caches full HTML responses or partial views.
 Benefits: Reduces DB + app load for frequently accessed pages.

Best Practices for Caching


 Use consistent cache keys
 Invalidate caches on update/delete
 Monitor hit/miss ratios
 Use LRU or TTL policies to manage memory
 Avoid stale data and race conditions

Let me know when you’re ready for Unit 7: Rate Limiting, Queues, and Reliability Patterns
or if you want mini-projects, diagrams, or interview Q&A on this unit.
Here's Unit 6: CDN, DNS, and Caching Layers — detailed, clear, and formatted for direct
copy-paste:

Unit 6: CDN, DNS, and Caching Layers


6.1. Content Delivery Network (CDN)
 What is a CDN?
A geographically distributed network of proxy servers and data centers that delivers static
content (images, JS, CSS, videos) from the closest edge location to the user.
 Why Use a CDN?
o Reduces latency by serving from edge locations.
o Offloads traffic from origin servers.
o Improves content delivery speed, availability, and reliability.
 How It Works:
1. User requests content (e.g., image).
2. CDN checks cache at edge server.
 If hit, serves directly.
 If miss, fetches from origin, caches it, and serves.
 Popular CDNs:
o Cloudflare, AWS CloudFront, Akamai, Fastly.
 Use Cases:
o Static website assets, media streaming, software updates, global
apps.

6.2. Domain Name System (DNS)


 What is DNS?
o The Internet’s phonebook: maps human-readable domain
names (e.g., example.com) to IP addresses (e.g., 192.0.2.1).
 How DNS Resolution Works:
1. User types example.com
2. Resolver checks local cache
3. If miss → goes to Root DNS Server
4. Then → TLD server (.com, .org, etc.)
5. Then → Authoritative DNS (returns final IP)
6. Result is cached at each layer.
 DNS TTL (Time to Live):
o Duration for which a DNS record is cached.
o Lower TTL → frequent updates.
o Higher TTL → better performance.
 DNS Load Balancing:
o Multiple IPs returned in round-robin or geo-based fashion.
 DNS Failover:
o Automatically reroutes traffic to backup IPs if primary is down.

6.3. Caching Layers


Caching is essential to improve speed, reduce load, and increase scalability.
6.3.1. Client-side Caching (Browser)
 What: Web browsers store static resources locally.
 Mechanism: Controlled via HTTP headers.
 Key Headers:
o Cache-Control: max-age, no-cache
o ETag: validation of modified content
o Expires: timestamp when content becomes stale

6.3.2. CDN Caching


 What: CDNs cache static content near users.
 Benefits: Reduces hits to origin servers and cuts latency.
 Cache Invalidation:
o Manually or automatically remove stale objects (e.g., /purge).

6.3.3. Application-Level Caching


 What: Caching done inside application code.
 Tools:
o Local LRU caches
o In-process hashmaps
o Function memoization
 Examples:
o Store computed results of expensive functions.
o Reduce API/database calls for frequently accessed data.

6.3.4. Distributed Caching


 What: Shared cache accessed by multiple application servers.
 Technologies: Redis, Memcached
 Types of Cached Data:
o DB query results
o Session data
o User profile info
 Cache Consistency:
o Write-through: Write to cache + DB
o Write-around: Write to DB only; cache on read
o Write-back: Write to cache, sync to DB later

6.3.5. Database Query Caching


 Many databases (like MySQL, PostgreSQL) have internal query cache.
 Application-level query result caching often offers better control.
6.3.6. HTTP Reverse Proxy Caching
 What: Cache sits between client and server (e.g., Nginx, Varnish).
 How: Caches full HTML responses or partial views.
 Benefits: Reduces DB + app load for frequently accessed pages.

Best Practices for Caching


 Use consistent cache keys
 Invalidate caches on update/delete
 Monitor hit/miss ratios
 Use LRU or TTL policies to manage memory
 Avoid stale data and race conditions

UNIT 7: Rate Limiting, Message Queues & Reliability Patterns

1. Rate Limiting
Used to control the number of requests a user/client can make in a given time window (e.g., 100
requests per minute).
Common Algorithms:
 Fixed Window Counter: Simple counter resets every window.
Example: Map<user, count>, reset every 60s.
 Sliding Window Log: Store timestamps of requests; clean old entries.
 Sliding Window Counter: Buckets + smooth decaying counters.
 Token Bucket: Tokens are added at a rate. Requests consume tokens.
 Leaky Bucket: Requests are queued and served at a constant rate.
Use cases:
API throttling, login rate limiting, brute force protection.

2. Message Queues (MQ)


Decouple producers and consumers to handle different processing speeds or batch workloads.
Popular MQ Tools:
 Kafka, RabbitMQ, Redis Streams, AWS SQS
Why use MQ?
 Prevent overloading backend
 Enable async processing
 Retry on failure
 Improve system scalability
Common Patterns:
 Publish–Subscribe (Pub/Sub): Publisher sends to multiple
subscribers.
 Point-to-Point: Message sent to one receiver only.
 Dead Letter Queue (DLQ): Stores failed messages for later
inspection.
Use cases:
 Sending notifications
 Order processing
 Logging
 Event-driven architecture

3. Reliability Patterns
Ensuring the system handles failure gracefully.
a. Retry Pattern Retry failed operation after a delay.
Add exponential backoff + jitter to avoid retry storms.
b. Circuit Breaker Breaks the call to a downstream service if it's continuously failing.
 States: Closed (normal) → Open (breaks) → Half-Open (test recovery)
c. Bulkhead Pattern Isolate components/services to prevent cascading failures.
d. Timeout + Fallback Timeout after a threshold and use a default response or cached result.
e. Idempotency Ensure same operation repeated multiple times has the same effect.
→ Key for payment systems, updates.

4. Applying This in Real System Design


 Rate Limiting on login API, uploads
 Message Queues for async video processing, sending emails
 Reliability Patterns for microservices communicating with DB,
payment gateways
UNIT 8: Data Consistency, Caching, and Distributed Transactions

1. Data Consistency Models


Consistency defines how a system ensures the correctness and visibility of data when accessed or
modified across multiple nodes.
 Strong Consistency:
Every read returns the most recent write. It ensures data accuracy but increases latency.
Used in: banking systems, critical systems.
Example: Traditional RDBMS, Zookeeper.
 Eventual Consistency:
System guarantees that, given enough time, all reads will return the most recent write.
Used in: highly available systems like social networks, DNS.
Example: DynamoDB, Cassandra, S3.
 Causal Consistency:
If operation A influences operation B, then every node must see A before B.
Used in: collaborative tools like Google Docs.
 Read-After-Write Consistency:
A system ensures that after you perform a write, you’ll immediately be able to read that
value.
 Monotonic Reads:
Ensures that if a process has seen a particular value of data, it will never see an older
value in future reads.

2. CAP Theorem
CAP Theorem states that in the presence of a network partition, a distributed system can choose
either:
 Consistency (C) – All nodes see the same data at the same time.
 Availability (A) – Every request receives a response (success/failure).
 Partition Tolerance (P) – The system continues to work despite
network partitions.
You can only have 2 out of 3 in any distributed system:

3. Caching
Caching improves performance by storing frequently accessed data closer to the user or the
service.
Types of Caching:
 Browser/Client-side cache: Static resources.
 CDN (Content Delivery Network): Distributed network delivering
static assets.
 Reverse Proxy Cache: NGINX, Varnish.
 Application-level cache: In-memory cache (e.g., using Python
dictionary or Java Map).
 Distributed Cache: Shared among nodes. E.g., Redis, Memcached.
Cache Strategies:
 Write-through: Data is written to cache and DB at the same time.
Ensures consistency.
 Write-around: Data written only to DB, and cache is populated on
read.
 Write-back: Writes only to cache, later persisted to DB. Fast but risky.
Eviction Policies:
 LRU (Least Recently Used) – Evicts the least recently accessed item.
 LFU (Least Frequently Used) – Evicts the least frequently accessed
item.
 FIFO – First in, first out.
 TTL (Time-To-Live) – Data expires after a certain period.
Cache Invalidation Problems:
 What to remove and when?
 Use versioning or manual busting.
 Use tools like stale-while-revalidate or soft TTLs.
Avoid Cache Stampede:
 Use locks (mutex), queuing, or exponential backoff to prevent DB
overload on cache miss.

4. Distributed Transactions
In distributed systems, a transaction may span multiple microservices or databases.
Challenges:
 Ensuring atomicity, consistency, isolation, and durability (ACID) across
services.
Approaches:
 Two-Phase Commit (2PC):
1. Prepare phase – coordinator asks all participants if they can
commit.
2. Commit phase – if all say yes, coordinator instructs commit.
Drawback: Blocking protocol, slow, failure-prone.
 Three-Phase Commit (3PC):
Adds timeout and pre-commit step. More failure-tolerant than 2PC.
5. Saga Pattern (Asynchronous Transactions)
Breaks a transaction into a sequence of local transactions. Each has a compensating action for
rollback.
Types:
 Choreography-based – Each service emits events and listens to
other events.
 Orchestration-based – Central coordinator controls the flow.
Example:
 Book hotel → Book flight → Book cab. If cab fails, cancel flight & hotel.

6. CQRS (Command Query Responsibility Segregation)


Separates read and write operations into different models.
 Command model: Handles writes, performs validations.
 Query model: Optimized for reads, denormalized structure.
Use Case: Read-heavy applications or systems with complex writes and high scalability needs.

UNIT 9: Load Balancing, Proxies, and CDNs

1. Load Balancers
Load balancers distribute network or application traffic across multiple servers to ensure
reliability, availability, and performance.
Types:
 Layer 4 (Transport Layer) – Based on IP address and TCP/UDP ports
(e.g., NGINX TCP LB, HAProxy).
 Layer 7 (Application Layer) – Based on HTTP headers, URL paths,
cookies (e.g., NGINX HTTP LB, Envoy).
Algorithms:
 Round Robin: Cycles through servers.
 Least Connections: Chooses server with fewest active connections.
 IP Hash: Uses client's IP for sticky sessions.
 Weighted Round Robin: Assigns weights based on server capacity.
 Consistent Hashing: Useful in caching and sharding.
Health Checks:
 Pings backend services periodically.
 Removes unhealthy servers from rotation.
Sticky Sessions:
 Ensures client’s session sticks to the same server (used in stateful
apps).

2. Horizontal Scaling with Load Balancing


 Add more servers behind a load balancer to scale out.
 Each instance is stateless (preferred) and handles part of the total
traffic.
 Load balancer becomes the entry point (reverse proxy).

3. Reverse Proxy vs Forward Proxy


Reverse Proxy:
 Acts on behalf of server.
 Hides internal structure.
 Caches responses, terminates SSL.
Forward Proxy:
 Acts on behalf of client.
 Used in offices, schools to filter/block traffic.

4. CDN (Content Delivery Network)


A CDN is a globally distributed network of edge servers that cache static content closer to users.
Benefits:
 Reduces latency.
 Offloads traffic from origin server.
 Increases download speed.
 Mitigates DDoS attacks.
How it Works:
 User requests image/video/script.
 CDN edge node serves cached version if available.
 If not available, fetches from origin and caches it.
Popular CDNs:
 Cloudflare, Akamai, Amazon CloudFront, Fastly.
Types of Data Cached:
 Static: JS, CSS, images, videos, HTML.
 Dynamic (partially): APIs with caching headers (e.g., max-age, ETag).
5. SSL Termination
Load balancers or reverse proxies can handle SSL encryption/decryption, reducing the load on
backend servers.
 Benefit: Offloads SSL computation from app servers.
 TLS Certificates are installed on the load balancer.

6. Global Load Balancing


Used to serve users from the closest geographical region (low latency).
Techniques:
 GeoDNS: Maps users to nearest region via DNS.
 Anycast: Same IP routes users to nearest datacenter.
 GSLB (Global Server Load Balancing): Smart traffic routing based
on health, location, load.

7. Auto Scaling + Load Balancing


 When server CPU/memory crosses threshold, auto-scaling launches
new instances.
 Load balancer automatically adds new instances to traffic pool.

8. DDoS Protection
 Load balancers + CDNs absorb and mitigate DDoS attacks.
 Tools: Cloudflare, AWS Shield, Google Cloud Armor.

1. Load Balancing
Advanced Concepts:
 Sticky Sessions: Also known as session persistence, this is when a
load balancer sends the client requests to the same server based on
session IDs (like cookies).
 Global Load Balancing: Distributes traffic across multiple data
centers globally, based on factors like server health, geographic
location, or capacity.
 Health Checks: Load balancers continuously monitor backend
servers' health by making periodic requests to ensure they are
operational. Servers deemed unhealthy are excluded from traffic
routing.
Tools:
 AWS Elastic Load Balancer (ELB): Scalable and managed load
balancing for various AWS resources.
 Kemp LoadMaster: Hardware and virtual load balancing solution.
When to Use Load Balancing:
 High availability, scalability, and resilience are required.
 Complex applications need to serve millions of users.

2. Reverse Proxy
Key Differences (Compared to Forward Proxy):
 Forward Proxy: Used by clients to access resources on the internet
(client-side).
 Reverse Proxy: Used by web servers to manage incoming traffic
(server-side).
Advanced Reverse Proxy Features:
 SSL Offloading: A reverse proxy can offload SSL decryption and
encryption work from the web server, reducing the computational load.
 Caching: Reduces server load by caching the content for frequent
requests. This improves response time and decreases backend load.
Real-World Use Cases:
 API Gateways: A reverse proxy acting as an entry point for various
APIs.
 Microservices: In a microservices architecture, reverse proxies help in
routing requests to different services.
Popular Tools:
 Traefik: A modern HTTP reverse proxy designed for microservices.
 Envoy: A high-performance proxy used in microservices environments
for routing and load balancing.

3. Content Delivery Networks (CDNs)


Key Features:
 Origin Pull vs Push CDN:
o Origin Pull: CDN servers pull content from the origin server
when requested by the user.
o Origin Push: Content is pushed from the origin server to the
CDN network proactively.
 Geo-replication: Content is replicated across multiple geographic regions, ensuring
quick access from any part of the world.
 Edge Computing: CDNs with edge computing capabilities run code closer to the user,
allowing for dynamic content manipulation and reduced latency.
How CDNs Optimize Performance:
 Compression: CDNs can compress images, videos, and other assets
to reduce file size and speed up transfer.
 Minification: JavaScript, CSS, and HTML files can be minified to reduce
unnecessary characters, further improving load times.
 Dynamic Content Delivery: While static content (images, files) is
cached, CDNs can also optimize dynamic content delivery based on
user-specific parameters.
Examples of Use Cases:
 Global E-Commerce Sites: To ensure fast page loads for users
around the world.
 Video Streaming: CDNs help with the efficient delivery of high-quality
video content across different regions.
 Gaming: Reducing latency for online multiplayer games using edge
servers closer to players.
Popular CDN Providers:
 Fastly: A high-performance CDN with real-time caching and instant
purging.
 KeyCDN: A reliable and cost-effective CDN solution.

Unit 10: Caching, Queues, and Messaging Systems

1. Caching
Key Concepts:
 Cache Miss: Occurs when the requested data is not found in the
cache, resulting in a fallback to the original source (e.g., database).
 Cache Hit: Happens when the requested data is found in the cache,
leading to faster retrieval and reduced load on the source.
 Eviction Policies:
o LRU (Least Recently Used): Removes the least recently
accessed data to free up space.
o LFU (Least Frequently Used): Removes data that is used the
least often.
o TTL (Time-To-Live): Cached items have a specific lifespan after
which they are removed.
Distributed Caching:
 Redis: An in-memory data structure store commonly used as a cache.
It's fast and supports a variety of data types (strings, hashes, lists,
sets).
 Memcached: A high-performance, distributed memory caching system
for speeding up dynamic web applications by reducing database load.
Caching in Distributed Systems:
 Cache Coherence: Ensuring that all copies of a cache are
synchronized and consistent.
 Cache Replication: Data is replicated across multiple nodes to
improve reliability and availability.
When to Use Caching:
 High-frequency read operations with low-frequency writes.
 Data that is not highly dynamic and doesn’t need real-time freshness.

2. Message Queues and Pub/Sub Systems


Message Queue (MQ):
 Use Case: Enables decoupling of application components. Producers
send messages to queues, and consumers retrieve messages
asynchronously.
 Advantages: Reduces dependencies between services, handles high-
volume traffic, and provides fault tolerance.
Popular MQ Systems:
 RabbitMQ: A message broker that implements the AMQP (Advanced
Message Queuing Protocol), ensuring reliable message delivery.
 Amazon SQS (Simple Queue Service): Fully managed message
queuing service in AWS, supporting decoupling and scaling of
microservices.
Message Queue Components:
 Producer: Application or service that sends messages.
 Consumer: Application or service that processes received messages.
 Queue: A buffer that temporarily holds messages until consumed.
Pub/Sub Systems:
 Pub/Sub (Publisher/Subscriber): A communication pattern where
publishers send messages to a topic, and subscribers receive those
messages in real-time.
 Advantages: Promotes loose coupling, scalability, and allows
broadcast messaging.
Popular Pub/Sub Systems:
 Google Cloud Pub/Sub: A globally distributed messaging system for
event-driven applications.
 Apache Kafka: A distributed streaming platform that supports
publishing and subscribing to streams of records.
When to Use MQ and Pub/Sub:
 Microservices Architecture: Communication between microservices
is often implemented with message queues.
 Asynchronous Communication: When tasks need to be processed
independently without blocking the main workflow.
 Real-time Notifications: Pub/Sub is ideal for delivering notifications
or updates to users in real-time.

3. Event-Driven Architecture (EDA)


Core Concepts:
 Event: A significant change or update in the system (e.g., user
registration, order placed).
 Event Producer: Component that generates events and publishes
them to a message broker or event stream.
 Event Consumer: Component that listens for events and processes
them as they arrive.
Advantages of EDA:
 Loose Coupling: Producers and consumers don’t need to know about
each other, which makes systems more scalable and maintainable.
 Scalability: As the system grows, new event producers or consumers
can be added without affecting existing components.
 Real-Time Processing: Events can be processed as they happen,
enabling real-time decision-making.
Tools for Event-Driven Systems:
 Apache Kafka: A distributed event streaming platform for real-time
data pipelines and streaming applications.
 Amazon SNS (Simple Notification Service): A fully managed
service for pub/sub messaging.
 Azure Event Grid: A fully managed event routing service that allows
for the easy building of event-driven architectures.
When to Use Event-Driven Systems:
 Real-time data processing systems.
 Systems that require high scalability and responsiveness.
Unit 11: Security, Monitoring, and Logging

1. Security in Distributed Systems


Core Concepts:
 Authentication: Verifying the identity of users or systems. Common methods:
o Basic Authentication: Username and password.
o OAuth: Authorization protocol that allows third-party services to
access user resources without exposing credentials.
o JWT (JSON Web Token): A compact, URL-safe token used for
securely transmitting information between parties.
 Authorization: Granting access based on permissions.
o Role-Based Access Control (RBAC): A strategy where
permissions are granted based on user roles.
o Attribute-Based Access Control (ABAC): Access is granted
based on user attributes (e.g., department, job title).
 Encryption:
o Data Encryption at Rest: Encrypting stored data to prevent
unauthorized access.
o Data Encryption in Transit: Encrypting data during
transmission (e.g., TLS/SSL).
o End-to-End Encryption: Ensures that data remains encrypted
from source to destination, preventing any intermediate
decryption.
 Common Threats:
o Man-in-the-Middle (MITM): Attackers intercept communication
between parties.
o SQL Injection: Malicious queries are used to manipulate
databases.
o Cross-Site Scripting (XSS): Attackers inject malicious scripts
into web pages.
Best Practices:
 Use TLS/SSL for secure communication.
 Implement API rate-limiting to prevent denial-of-service (DoS)
attacks.
 Always hash passwords (e.g., bcrypt).
 Regular security audits and penetration testing.
 Use firewalls and security groups in cloud environments.

2. Monitoring
Core Concepts:
 Monitoring: The process of tracking the health and performance of a
system, detecting anomalies, and ensuring availability.
Types of Monitoring:
 System Monitoring: Tracks the overall health of servers and
infrastructure (e.g., CPU usage, memory usage, disk I/O).
 Application Monitoring: Focuses on the performance of applications,
identifying bottlenecks or failures (e.g., response time, error rates).
 Infrastructure Monitoring: Monitors network components, load
balancers, and databases for performance and failures.
Key Metrics to Monitor:
 Availability: The percentage of time a system is operational.
 Latency: The time it takes for a request to travel from the source to
the destination and back.
 Throughput: The number of requests handled by a system in a given
time period.
 Error Rate: The rate at which errors occur within the system.
Monitoring Tools:
 Prometheus: Open-source tool for monitoring and alerting, with a
powerful query language (PromQL) and integrations with Grafana.
 Datadog: A cloud-based service that provides infrastructure
monitoring, application performance monitoring (APM), and log
management.
 Nagios: Open-source monitoring solution for monitoring servers,
networks, and applications.
 Zabbix: An open-source monitoring tool that offers flexibility in
tracking metrics for both infrastructure and applications.
Best Practices:
 Set up alerting thresholds for critical metrics (e.g., CPU usage >
80%).
 Use dashboards to visualize key metrics and trends.
 Regularly review and optimize the monitoring infrastructure for
cost-efficiency.

3. Logging
Core Concepts:
 Logging: Recording system activities and events to help debug,
analyze performance, and track operations.
o Logs should be structured, with timestamped entries, log levels
(e.g., INFO, ERROR, DEBUG), and clear message content.
Types of Logs:
 Application Logs: Generated by the application to record key events,
errors, and debug information.
 System Logs: Generated by the operating system or infrastructure to
capture system-level events (e.g., server crashes, service restarts).
 Access Logs: Track HTTP requests to web servers, including client IPs,
timestamps, and response codes.
Log Management:
 Centralized Logging: Aggregating logs from multiple sources into
one system for easier analysis (e.g., Elasticsearch, Logstash, and
Kibana stack – ELK stack).
 Log Rotation: The practice of archiving old logs and creating new
ones to prevent file system overload.
 Log Parsing: Analyzing log files to extract useful data and automate
response actions.
Popular Logging Tools:
 ELK Stack: A combination of Elasticsearch (for search), Logstash (for
log collection), and Kibana (for visualization). Often used for
centralized logging and analysis.
 Splunk: A powerful tool for searching, analyzing, and visualizing
machine-generated data, including logs from various sources.
 Fluentd: An open-source data collector for logs that allows for real-
time streaming and log aggregation.
Best Practices:
 Ensure logs are written in structured formats (JSON is a common
format).
 Implement log aggregation for a centralized view across
microservices.
 Monitor logs regularly to identify trends, performance issues, and
security breaches.
 Use log retention policies to balance the cost of storage and
compliance requirements.

4. Incident Management and Response


Core Concepts:
 Incident Management: The process of identifying, analyzing, and
resolving incidents that disrupt normal operations.
 Incident Response: The actions taken after an incident occurs to
mitigate damage and restore service.
Steps in Incident Management:
1. Detection: Identifying that an incident has occurred. This is typically
through monitoring or an external notification.
2. Assessment: Understanding the severity of the incident and potential
impact.
3. Response: Containing the incident and taking necessary actions to
minimize damage.
4. Recovery: Restoring systems and services to normal operation.
5. Post-Incident Review: Analyzing the incident to identify root causes
and improve future responses.
Incident Response Tools:
 PagerDuty: An incident management platform for automating alerts
and response workflows.
 Opsgenie: A service for managing on-call schedules, incident alerts,
and escalation procedures.
 VictorOps: Another platform for incident management that integrates
with monitoring and alerting tools.
Best Practices:
 Have an incident response plan that clearly defines roles,
responsibilities, and actions to take during incidents.
 Perform regular drills to ensure readiness for handling incidents.
 Continuously improve response processes based on learnings from
past incidents.

5. Disaster Recovery and Backup


Core Concepts:
 Disaster Recovery (DR): The process of recovering from catastrophic
failures or outages to restore business operations.
 Backup: Storing copies of data to recover it in case of data loss or
corruption.
DR Strategies:
 Cold Site: A backup facility with minimal infrastructure, which requires
setting up after an incident.
 Warm Site: A facility with some infrastructure, ready to be used for
operations.
 Hot Site: A fully operational backup facility, ready for immediate use.
Backup Strategies:
 Full Backup: A complete copy of all data.
 Incremental Backup: Only the data that has changed since the last
backup is copied.
 Differential Backup: Copies data that has changed since the last full
backup.
Best Practices:
 Ensure backup redundancy (multiple locations).
 Regularly test disaster recovery plans to ensure recovery is
effective and fast.
 Automate backup processes to reduce human errors.

Unit 12: Fault Tolerance, High Availability, and Scalability

1. Fault Tolerance
Core Concepts:
 Fault Tolerance refers to the system's ability to continue operating
even in the event of partial system failures. This is achieved by
designing systems that can recover from failures without impacting
overall performance.
Key Techniques:
 Redundancy: Having backup systems or components to take over when one part fails.
Redundancy can be at the hardware, software, or network level.
o Hardware Redundancy: Using multiple machines, disks, or
power supplies to handle the load if one fails.
o Software Redundancy: Running multiple copies of critical
software services (e.g., microservices, replicated databases).
 Replication: Creating copies of data or services in different locations. In case of failure,
another replica can be used to ensure continuous service.
o Master-Slave Replication: One primary server handles writes,
and replicas handle reads.
o Multi-Master Replication: Multiple servers can handle both
reads and writes, with synchronization.
 Graceful Degradation: In case of failure, the system continues operating but with
reduced functionality, so users are still able to perform critical tasks.
 Error Detection and Recovery: Mechanisms for detecting failures (e.g., timeouts, health
checks) and automatically recovering from them (e.g., automatic restart, failover).
Fault Tolerance Strategies:
 Active-Passive: One node is active, and others are passive. Upon
failure, a passive node becomes active.
 Active-Active: All nodes are active, and traffic is distributed among
them. If a node fails, others take over the load.
Best Practices:
 Monitor system health continuously to detect issues early.
 Use failover strategies to quickly switch to backup systems when
failure occurs.
 Test failover scenarios regularly to ensure system resilience.

2. High Availability (HA)


Core Concepts:
 High Availability ensures that a system is operational and accessible
with minimal downtime, even in the event of failures.
Key Principles:
 Uptime: The amount of time the system is fully functional. The goal is
typically 99.99% (four 9s) or better.
 Failover: Switching to a backup system automatically when the
primary system fails.
 Load Balancing: Distributing requests across multiple servers or
resources to avoid overloading a single system and ensure high
availability.
Availability Levels:
 N+1 Availability: A configuration where there is one more instance
than required. This extra instance handles failures of the primary
instance.
 N+M Availability: A configuration with multiple spare components to
ensure redundancy.
High Availability Strategies:
 Active-Passive Failover: One server is active, while others are
passive. The passive servers take over when the active one fails.
 Multi-Region Deployment: Distribute servers across different
geographic regions to ensure availability even in case of region-specific
failures.
HA in Databases:
 Clustered Database: Multiple database nodes that work together to
handle high loads and provide redundancy.
 Replication: Data is replicated across multiple nodes to avoid data
loss.
Best Practices:
 Ensure that there are multiple failover points within the system
architecture.
 Use load balancers to distribute traffic and reduce the risk of
overloading a single resource.
 Ensure that critical systems have backup power and network
connectivity.

3. Scalability
Core Concepts:
 Scalability refers to the system's ability to handle increased loads by
either adding more resources (vertical scaling) or adding more nodes
(horizontal scaling).
Types of Scalability:
 Vertical Scaling (Scaling Up): Adding more resources (CPU, RAM, disk space) to a
single machine to improve performance. It’s simpler but eventually has limitations.
o Example: Upgrading a web server's RAM from 16GB to 64GB.
 Horizontal Scaling (Scaling Out): Adding more machines to a system to handle
increased load. This is more complex but allows for greater flexibility and fault tolerance.
o Example: Adding more web servers behind a load balancer to
distribute traffic.
Scalability Techniques:
 Sharding: Splitting data across multiple machines or databases to handle more traffic and
reduce load on individual systems. This is commonly used in databases.
o Horizontal Sharding: Distribute data across multiple machines.
o Vertical Sharding: Split data based on types or categories (e.g.,
customers vs orders).
 Load Balancing: Distributing network traffic evenly across multiple servers to ensure
that no single server becomes overwhelmed.
 Caching: Storing frequently accessed data in memory (e.g., using Redis or Memcached)
to reduce the load on the database and improve response times.
 Auto-Scaling: Automatically adjusting the number of active resources (servers,
databases) based on load. This is often used in cloud environments (e.g., AWS Auto
Scaling, Azure Scale Sets).
Scalability Patterns:
 Microservices: Breaking down an application into smaller,
independently scalable services, allowing each part of the system to
scale independently.
 Event-Driven Architecture: Using asynchronous events to decouple
components and scale them independently.
Best Practices:
 Monitor the system load to determine when scaling is needed.
 Use containerization (e.g., Docker) for better resource isolation and
scaling.
 Optimize code and use efficient algorithms to minimize the need for
unnecessary scaling.
 Use content delivery networks (CDNs) to cache static content and
offload traffic.

4. Horizontal vs Vertical Scaling


Vertical Scaling:
 Pros:
o Simple to implement (just add more resources to a server).
o Useful for single-node applications or databases with heavy
read/write needs.
 Cons:
o There’s a limit to how much a single machine can scale.
o Can be costly because the hardware required for vertical scaling
is often expensive.
Horizontal Scaling:
 Pros:
o More scalable than vertical scaling. No theoretical limit to how
much you can scale horizontally.
o Can handle massive traffic spikes by adding more nodes as
needed.
 Cons:
o More complex to manage and requires distributed systems
architecture.
o Can introduce problems such as data consistency,
partitioning, and load balancing challenges.

5. Managing Load
Core Concepts:
 Load Balancer: A device or software that distributes incoming traffic
across multiple servers to prevent any single server from becoming
overwhelmed.
Types of Load Balancers:
 Round-Robin Load Balancing: Requests are distributed evenly
across all servers.
 Least Connections Load Balancing: Directs traffic to the server with
the least number of active connections.
 IP Hash Load Balancing: Routes requests based on a hash of the
client's IP address, ensuring that each client consistently connects to
the same server.
Cloud Load Balancers:
 Elastic Load Balancer (AWS): Automatically distributes incoming
application traffic across multiple targets (e.g., EC2 instances) to
ensure high availability and fault tolerance.
 Azure Load Balancer: A high-performance, low-latency load balancer
that distributes network traffic.

6. Data Consistency and Replication


Core Concepts:
 CAP Theorem: A principle that states that a distributed system can
only guarantee at most two out of three properties: Consistency,
Availability, and Partition Tolerance.
o Consistency: All nodes have the same data at the same time.
o Availability: Every request gets a response (either success or
failure).
o Partition Tolerance: The system can still function even if
there’s a network partition between nodes.
Replication Types:
 Master-Slave Replication: The master handles all writes, and slaves
replicate the data for read operations.
 Multi-Master Replication: Multiple nodes can handle reads and
writes, ensuring redundancy and fault tolerance.
Consistency Models:
 Eventual Consistency: Data is not immediately consistent but will
eventually become consistent over time.
 Strong Consistency: All nodes always reflect the same data at any
given time.

Unit 13: Caching, Indexing, and Data Stores

1. Caching
Core Concepts:
 Caching is the practice of storing data in a temporary storage (cache)
to reduce the time it takes to retrieve it and reduce the load on
underlying data sources like databases or APIs.
Types of Caching:
 In-Memory Caching: Data is stored in the memory of the application
server.
o Example: Redis, Memcached.
 Disk Caching: Data is cached on disk storage for quicker access but
slower than in-memory.
o Example: Varnish, HTTP Cache.
 Distributed Caching: Cache is distributed across multiple servers or
locations.
o Example: Redis Cluster, Amazon ElastiCache.
Caching Strategies:
 Write-Through Cache: Data is written to both the cache and the
database simultaneously. This ensures cache consistency with the
underlying data store.
 Write-Back Cache: Data is written to the cache first, and the data
store is updated later. This increases speed but can lead to consistency
issues if there are failures.
 Cache Eviction Policies:
o LRU (Least Recently Used): Removes the least recently used
data when the cache is full.
o LFU (Least Frequently Used): Removes the least frequently
accessed data.
o FIFO (First In First Out): Removes the oldest data first.
Benefits of Caching:
 Speed: Faster data access by storing frequently requested data in a
faster, more accessible location.
 Reduced Load: Reduces the load on backend services (like
databases).
 Cost Efficiency: Reduces the need for repeated computation or data
fetching.
Challenges with Caching:
 Cache Invalidation: Ensuring that outdated data in the cache is
removed or updated when the underlying data changes.
 Cache Consistency: Ensuring that the cache is always in sync with
the data source.

2. Indexing
Core Concepts:
 Indexing improves the speed of data retrieval operations on a
database table at the cost of additional space and increased
maintenance on data updates.
Types of Indexes:
 B-Tree Indexes: Balanced tree structures that allow for efficient searching, inserting, and
deleting of data.
o Used in: Relational databases like MySQL and PostgreSQL.
 Hash Indexes: Uses a hash function to map a key to its corresponding value. Best for
equality searches but not for range queries.
o Used in: NoSQL databases like MongoDB.
 Full-Text Indexes: Allows for fast text-based searching by indexing words or phrases
within text fields.
o Used in: Elasticsearch, MySQL.
 Bitmap Indexes: Uses a bitmap to represent the data, allowing for fast queries in
columns with low cardinality (few distinct values).
 Composite Indexes: Indexes that involve multiple columns to speed up complex queries.
Benefits of Indexing:
 Faster Search: Indexing drastically speeds up search operations,
especially for large datasets.
 Optimized Queries: Complex queries involving joins and WHERE
clauses benefit greatly from indexing.
Challenges with Indexing:
 Storage Overhead: Indexes consume additional storage.
 Slow Insert/Update/Delete: Modifications to data (insert, update,
delete) can be slower since indexes need to be updated.

3. Types of Data Stores


Core Concepts:
 Data Stores refer to storage systems used to save and retrieve data.
Depending on the use case, different types of data stores may be used.
Types of Data Stores:
 Relational Databases (RDBMS):
o Data is stored in tables with predefined relationships (using SQL).
o Examples: MySQL, PostgreSQL, Oracle.
 NoSQL Databases:
o Schema-less databases that can handle unstructured or semi-
structured data.
o Types:
 Document Store: Stores data in documents, typically in
JSON format.
 Example: MongoDB, CouchDB.
 Key-Value Store: Stores data as key-value pairs, with fast
access to values by keys.
 Example: Redis, Amazon DynamoDB.
 Columnar Store: Stores data by columns rather than
rows, optimized for analytical queries.
 Example: Cassandra, HBase.
 Graph Database: Stores data in graphs (nodes and
edges), ideal for relationship-heavy data.
 Example: Neo4j, ArangoDB.
 Distributed Databases:
o Databases that store data across multiple machines to improve
availability and scalability.
o Example: Google Spanner, Cassandra.
 Object Storage:
o Data is stored as objects, each identified by a unique key, and is
typically used for large files (images, videos, backups).
o Example: Amazon S3, Google Cloud Storage.
 In-Memory Databases:
o Databases that store data in memory to reduce access times and
improve performance.
o Example: Redis, Memcached.
Choosing the Right Data Store:
 Transactional Workloads: Use relational databases (e.g., MySQL,
PostgreSQL) for systems that need ACID transactions.
 High Throughput and Low Latency: Use NoSQL systems (e.g.,
MongoDB, Redis) for applications that require high-speed reads and
writes.
 Complex Queries and Analytics: Use columnar stores or distributed
databases for systems with heavy analytical workloads (e.g.,
Cassandra, Google BigQuery).

4. Distributed Caching
Core Concepts:
 Distributed Caching involves using multiple cache servers or nodes
that work together to store and manage data in a distributed manner.
This helps in scaling out cache storage across multiple servers,
ensuring data is available even if a server fails.
Examples:
 Redis Cluster: A distributed version of Redis that allows for data
partitioning across multiple Redis nodes.
 Memcached: A high-performance, distributed memory caching system
that can be scaled horizontally.
Benefits of Distributed Caching:
 Scalability: Cache storage can grow as your data needs increase.
 Availability: Even if one cache node fails, other nodes can continue
serving requests.
Challenges:
 Consistency: Keeping data consistent across distributed cache nodes
can be complex.
 Network Latency: Communication between cache nodes introduces
overhead.

5. Data Store Performance Optimization


Core Concepts:
 Optimizing data stores is critical to improving system performance,
scalability, and availability.
Key Techniques:
 Sharding: Distribute data across multiple servers based on some partitioning key (e.g.,
user ID or region).
 Replication: Create copies of data across multiple nodes or regions to increase
availability and reliability.
 Indexing: Use indexes for faster query execution. Choose the correct type of index based
on query patterns (e.g., B-tree, hash, full-text).
 Query Optimization: Write optimized queries by using efficient joins, avoiding
SELECT *, and utilizing indexing.
 Caching: Cache frequently accessed data in memory using in-memory caches (e.g.,
Redis, Memcached) to reduce load on the database.
 Data Denormalization: In cases where read performance is critical, denormalize your
data (duplicate data) to avoid complex joins and improve speed.

That’s Unit 13 on Caching, Indexing, and Data Stores! Let me know if you’d like to discuss
any of these topics further or dive into specific concepts. 😎

You might also like