Posted on Aug 5

Introduction to System Design for Interviews

#systemdesign #interview #faang #distributedsystems

System design is the process of defining a software system’s architecture, components, and interfaces to meet specific requirements. In tech interviews (especially at FAANG and similar companies), system design has become a crucial skill – top companies like Google and Amazon emphasize it, with roughly 40% of interviewers prioritizing system design expertise. A strong system design demonstrates that you can build robust, scalable systems that handle real-world demands. This introductory guide will cover fundamental concepts – from scalability and load balancing to caching and the CAP theorem – with analogies and real-world examples (Netflix, YouTube, WhatsApp, etc.) to illustrate each point. We’ll start with what system design entails and its core goals, then dive into key topics, and end with a summary of what’s next in this series.

What is System Design? Why It Matters

System Design refers to creating a high-level architecture that meets certain goals like performance, scalability, availability, reliability, and more. Unlike coding problems with one correct answer, system design is open-ended – you must define how different pieces (databases, services, APIs, etc.) work together to fulfill requirements. It’s important in interviews because it tests your ability to think big-picture and make trade-offs for complex systems (common in senior engineering roles).

Why do interviewers care? Good system design shows you can build systems that scale (grow to serve more users), stay available (up 24/7), remain reliable (few failures), and keep data consistent and latency low. These qualities are essential for large-scale products, such as social networks, e-commerce sites, or streaming services. As one resource notes, mastering system design helps in building robust, scalable solutions, and even 40% of tech recruiters prioritize it. Beyond interviews, these skills help you design systems right the first time in real jobs, preventing outages like early Twitter’s infamous “Fail Whale” (which happened due to poor design under load)

Key System Design Goals (Non-Functional Requirements):

Scalability: The ability of a system to handle increasing load (more users, data, traffic) without degrading performance. A scalable system can grow smoothly when demand grows. For example, a social app that goes from 1,000 to 1 million users should still perform well if designed to scale (perhaps by adding servers or optimizing code). Scalability comes in two flavors: vertical and horizontal scaling (discussed later).

Latency: The end-to-end response time of the system – how long it takes to fulfill a request. Low latency means the system responds quickly to user actions. (Contrast with throughput, which is how many requests can be handled per unit time.) For instance, when you tap a video on YouTube, latency refers to the delay before the video starts playing. Latency is critical for real-time services (gaming, chat). Techniques like caching and CDNs help reduce latency by serving data from closer locations. Throughput and latency often trade off (ultra-low latency modes might reduce overall throughput).

Availability: The proportion of time the system is operational and accessible. Often measured in “nines” (e.g., 99.99% availability means only ~52 minutes of downtime per year). High availability ensures users can use the service anytime, even if components fail. For example, WhatsApp is distributed across multiple data centers to stay available even if one site goes down. A highly available system returns some response for every request (it never simply crashes or hangs).

Reliability: The ability of the system to function correctly and consistently over time without failures. In simple terms, reliability is about correctness and continuity: the system does what it’s supposed to, day after day. For instance, a reliable storage system won’t lose or corrupt your data. Reliability is often improved by redundancy and thorough testing. (Note: Availability and reliability are related but distinct – availability is about uptime, reliability is about error-free operation. A system could be up (available) but returning incorrect results, which means it’s available but not reliable.)

Consistency: In system design, consistency usually refers to data consistency – ensuring that all users or nodes see the same data at the same time. In a consistent system, if you write (update) data and then read it, you will get the latest write every time. For example, if you update your profile picture, a consistent system ensures everyone who loads your profile sees the new picture immediately. Consistency is critical in domains like banking (your account balance must be correct across all systems). We’ll discuss consistency trade-offs more with the CAP theorem.

These goals often conflict with each other, so designing a system is about balancing trade-offs. For instance, achieving strong consistency might reduce availability (if you prefer to reject requests during updates), or maximizing scalability might increase latency (if you add network hops). In interviews, you’re expected to clarify which aspects are top priority for the given scenario (e.g., a banking system prioritizes consistency and reliability, while a social feed might favor availability and scalability).

Types of System Design: High-Level vs Low-Level Design

When approaching a design problem, engineers think in terms of High-Level Design (HLD) and Low-Level Design (LLD). These are complementary phases:

High-Level Design (HLD): the big-picture architecture of the system. HLD outlines the major components or modules, their interactions, and the overall flow of data/control. It’s analogous to an architect’s city map or initial sketch of a building, focusing on what the system comprises without too much detail. HLD documents might include system architecture diagrams, module descriptions, and how users will interact with the system. It considers both functional requirements (features the system must have) and high-level non-functional requirements (the “-ilities” like scalability, security, etc.). HLD is typically created in early stages by system architects or senior engineers. Example: In a web application, the HLD would specify there is a client app, a backend service, a database, perhaps a cache, and a load balancer, and how these pieces connect, but not the internal code of each.
Low-Level Design (LLD): the detailed design of individual components and modules. LLD dives into the implementation-level details – the exact algorithms, data structures, class designs, APIs, and interface definitions for each module. It’s like the engineer’s detailed blueprint or construction plan, filling in how each part will be built. LLD is usually done by developers after HLD is set and serves as a guide during coding. It ensures that the system’s components, as defined in HLD, can be implemented efficiently and correctly. Example: For the same web app, LLD might specify the database schema, the specific REST API endpoints and their request/response formats, the classes and functions in each service, and how caching logic works.

In simpler terms, HLD is the “architecture” (macro-level) – deciding which pieces are there and how they interact – while LLD is the “design of each module” (micro-level) – deciding how each piece works internally. Both are important: HLD ensures an overall coherent structure, and LLD ensures each part is well thought out.

For instance, think of planning a wedding as an analogy. The HLD of the wedding covers the overall plan – the venue, number of guests, high-level schedule of ceremony vs. reception, etc. The LLD of the wedding gets into specifics – the menu for dinner, the playlist for the DJ, the seating chart, floral arrangements, etc. The high-level plan guides the detailed prep, and the detailed decisions must still fit within the high-level plan.

When to use HLD vs LLD: HLD comes first – in initial project stages or system design interviews, you start with HLD to outline the solution. Once the high-level architecture is agreed upon, you move to LLD to work out the internals of each component before actual implementation. In an interview, if asked to design (say) YouTube, the interviewer expects mostly an HLD (how users upload videos, how videos are stored/ streamed, overall components like web servers, databases, CDN, etc.). In a follow-up or a separate interview (often called “object-oriented design” or similar), you might be asked LLD questions, like designing the classes and methods for a particular module (e.g., the video recommendation algorithm or a messaging system’s class design).

Scalability Basics (Horizontal vs Vertical Scaling)

One of the first goals in system design is scalability – can your system handle growth in users or data? Scalability means the system can accommodate increasing load by adding resources, without a major drop in performance. To design scalable systems, it’s crucial to understand two strategies:

Vertical Scaling (Scale-Up): adding more power to a single server. This means upgrading the machine’s hardware – e.g., adding a faster CPU, more RAM, or more disk space to handle more load. It’s like upgrading a car’s engine to go faster. Vertical scaling is conceptually simple (you don’t change the architecture; you just run it on a bigger machine) and can be effective up to a point. Pros: simple to implement, no need to partition data, and your application complexity remains the same (only one node to manage). Cons: there are hard limits – you can only make a single machine so big (hardware has limits and gets exponentially expensive). Also, relying on one super-server means a single point of failure (if it crashes, the whole system is down). For example, an early-stage startup might vertically scale a database by moving from a 4-core server to a 16-core server when load grows. This improves capacity, but eventually they’ll hit a ceiling where one machine can’t handle more, or becomes too costly.
Horizontal Scaling (Scale-Out): adding more machines to distribute the load. This entails running the system across multiple servers and splitting the traffic or data among them. It’s like adding more delivery vans to a fleet to handle more packages, instead of using one giant truck. Horizontal scaling often requires a distributed architecture, where components like load balancers, caches, or database sharding split the work. Pros: In theory, horizontal scaling lets you grow without a near limit – you keep adding servers (cheap commodity machines) as load increases. It also improves fault tolerance: if one server fails, others can pick up the load, so the system can stay up (no single point of failure). Cons: it adds complexity – with many servers, you need mechanisms to route traffic (load balancers), keep data consistent across nodes, and handle partial failures gracefully. Managing 100 servers is much harder than 1 big server. There’s also network overhead – nodes must communicate, which can introduce latency.

Illustration: Imagine an e-commerce website experiencing traffic growth. Vertical scaling would mean upgrading its single database server with more CPU/RAM so it can handle more queries. Horizontal scaling would mean adding multiple database servers and splitting the customers/orders among them (this requires sharding or replication, which we’ll cover). In practice, modern systems use horizontal scaling for large-scale systems because it’s more cost-effective and resilient beyond a certain point. For example, Netflix handles millions of users by running on thousands of servers globally (horizontal), not on one huge mainframe.

A simple analogy is delivery vehicles: If one delivery van can’t handle all package deliveries in a day, you have two options – get a bigger, faster van (vertical scaling) or get multiple vans and drivers to split the deliveries (horizontal scaling). The first option might double capacity, but eventually you can’t find a van big enough; the second option allows scaling to an arbitrary number of packages by adding more vans.

Common scalability challenges: Scaling out a system introduces new issues. Coordination between servers becomes necessary (to keep data in sync, etc.). Network partitions or latency can affect consistency (we’ll see this with the CAP theorem). Cost can also rise (many servers + more complex software). There’s often a latency vs throughput trade-off when scaling: e.g., distributing a database across many nodes (sharding) can handle more queries overall, but an individual query might be slightly slower if it needs to gather results from multiple shards. Good system design mitigates these challenges with techniques like caching (to reduce work), load balancing, and careful choice of algorithms.

Load Balancing (and Role of Reverse Proxies)

When you have multiple servers (for horizontal scaling or redundancy), you need a way to distribute incoming requests so that no single server is overwhelmed. This is where a load balancer comes in. A load balancer is like a traffic cop sitting in front of your servers – it receives all incoming client requests and then routes each request to one of the backend servers, typically using some algorithm. By doing so, it ensures no one server gets too much traffic, improving overall throughput, reducing response times, and increasing reliability (if one server fails, the load balancer can send traffic to others).

Why load balancers are essential: Without a load balancer, clients might all hit a single server directly. That server could become a single point of failure and a bottleneck, while other servers are idle. A load balancer spreads the work out and can detect if a server is down, automatically redirecting traffic to healthy servers. This setup increases a system’s availability and fault tolerance. In cloud environments and large-scale systems (Netflix, Google, etc.), load balancers are everywhere – from user-facing edge levels (distributing traffic across data centers) down to internal service layers (balancing requests among microservices).

Common load balancing algorithms: Load balancers use various policies to decide which server gets the next request :

Round Robin: the simplest method – cycle through the server list, sending each new request to the next server in turn (Server1, then Server2, then 3, … and back to 1). This ensures a roughly equal number of requests to each server (assuming similar capacity). It’s easy to implement, but it doesn’t account for differences in server load or capacity.
Least Connections: a dynamic strategy that sends each new request to the server that currently has the fewest active connections (i.e., the least busy at that moment). This helps when some requests take longer than others – a server bogged down with slow requests will get less new traffic until it catches up. Most modern load balancers support least-connections because it balances load more evenly in real-time than round-robin.
IP Hash: This method uses a hash of the client’s IP address (and sometimes the request target) to consistently route the same client to the same server. This is useful for session stickiness – if a user’s session data is stored in memory on Server A, IP-hash ensures that the user always goes to Server A, so you don’t need to share session state between servers. It’s commonly used in cases where maintaining user state or cache locality is important.
Weighted Round Robin / Least Connections: variations that account for servers with different capacities. For example, if Server1 is twice as powerful as Server2, you assign weights so Server1 gets 2x the requests of Server2. This way, stronger servers do more work. Weighted least-connections similarly factor in server capacity when balancing load.

There are many more algorithms (random, shortest expected delay, etc.), but the above are the classics. In practice, reverse proxies or dedicated load balancer appliances implement these. For instance, Nginx or HAProxy (popular reverse proxy servers) can do round-robin or least-connection load balancing for web traffic. Cloud providers offer managed load balancers, which often use health checks and a combination of methods.

Reverse Proxy vs Load Balancer: A reverse proxy is a server that sits between external clients and your internal servers, forwarding client requests to the appropriate server. A load balancer is essentially a specialized reverse proxy focused on distributing load evenly. Many reverse proxy servers (like Nginx, HAProxy, Apache httpd) can act as load balancers. The reverse proxy’s role is not only load distribution; it can also handle common tasks like terminating SSL (handling HTTPS encryption), caching static content, and filtering requests. By doing so, it protects backend servers and can improve performance:

A reverse proxy can serve as a security barrier – clients only communicate with the proxy, not directly with backend servers. The proxy can filter out malicious traffic or block unwarranted requests (e.g., basic DDoS mitigation or IP whitelisting). This keeps the origin servers from exposure to the internet.
It can cache responses for static resources or frequent requests. For example, a reverse proxy might cache your site’s images, CSS, and JS files. When a client requests those, the proxy returns them immediately without bothering the backend, reducing load and latency.
It can perform content compression, logging, and other cross-cutting concerns. Essentially, a reverse proxy is an intermediary that can offload various tasks from the main servers.

In system design discussions, when we draw a load balancer, it’s effectively a type of reverse proxy. Services like Cloudflare, AWS Elastic Load Balancer, or Google Front End act as global reverse proxies to route and manage traffic into systems. For example, WhatsApp’s architecture uses a distributed network of data centers and likely employs load balancers so that each user’s connection is served from the nearest center, reducing latency and ensuring the system remains available even if one data center goes down.

With vs. without load balancing example: To visualize the impact of a load balancer, consider the diagrams below. The first diagram shows a scenario without a load balancer – all users send requests directly to a single server. That server becomes a bottleneck (overloaded, marked in red), while the two other servers are not utilized at all. There’s also no failover: if that one server crashes, the service is down for everyone.

 [User1] [User2] [User3] | | | +-------------+-------------+ | [ Server A ] (❌ Overloaded) | [ Service Down ]

Without a load balancer, all traffic goes to one server, which becomes overloaded (red), while others sit idle.

In the second diagram, a load balancer (yellow node) is introduced in front of the servers. Now, user requests hit the load balancer first, and it distributes the requests across Server1, Server2, Server3 (green, indicating healthy load levels). No single server is overwhelmed, and if one server fails, the load balancer can stop sending traffic to it and use the others – thus the application stays available.

 [User1] [User2] [User3] | | | +-------------+-------------+ | [ Load Balancer ] | +----------------+----------------+ | | | [ Server A ] [ Server B ] [ Server C ] (✅ OK) (✅ OK) (✅ OK)

With a load balancer, an intermediary node routes incoming requests evenly across multiple servers, preventing any single server from overload and improving redundancy.

In summary, load balancing is fundamental for scaling out and building highly available systems. Virtually every large-scale service (from Netflix streaming to Google Search) relies on tiers of load balancers to manage traffic. In design interviews, if you plan for multiple servers, you should mention a load balancer or reverse proxy to distribute requests.

Caching

Caching is a technique to speed up responses and reduce load by storing copies of frequently accessed data in a faster storage layer (like memory) closer to the user or application. The idea is simple: if you repeatedly need the same data, it’s inefficient to fetch it from a slow source (like a disk or a remote database) every time. Instead, keep a copy in a fast medium (RAM, or even CPU cache) so subsequent requests get it quickly.

Think of a library analogy: normally, retrieving a book from deep in the library stacks takes time. A clever librarian might keep a small cart at the front with the most popular books. If someone asks for a bestseller that’s on the cart, they get it immediately – no trip into the aisles. The cart is like a cache for the library. It’s limited in size, so the librarian only keeps recently borrowed or very popular books there. If a requested book isn’t on the cart (a cache miss), the librarian goes inside to get it (and maybe then adds it to the cart for next time). If it is in the cache (cache hit), the response is much faster.

In computer terms, caches exist at many levels: your CPU has small caches for RAM data, web browsers cache web pages, CDNs cache content at network edges, and applications use caches (in-memory stores like Redis or Memcached) to avoid repetitive database queries. The goal is always to trade off a bit of storage (and complexity of cache invalidation) in exchange for lower latency and lower load on the primary data source.

Where and why caching is used: Anywhere you have read-heavy workloads or expensive computations, caching can help. Examples: - Web caching: Browsers and CDNs save copies of images, HTML, and files so that repeat visits don’t fetch everything from the origin server. This dramatically reduces page load time and bandwidth. For instance, YouTube and Netflix use CDN servers worldwide to cache video content closer to users, so a popular show in Mumbai streams from a local server in Mumbai instead of from a US datacenter, reducing latency and load. - Database query caching: An application might store the results of expensive database queries in an in-memory cache. E.g., Twitter might cache the timeline for a user so that if that user refreshes again, the timeline is served from cache instead of recomputing from many database lookups. When a tweet “goes viral,” Twitter would be hit with tons of requests for the same tweet data. To handle this efficiently, they serve subsequent requests from a cache (RAM) rather than hitting the database every time. The first request for that tweet is a cache miss (fetch from DB and store it in cache), and the next N requests are cache hits served quickly from memory. - Computations and HTML fragments: Sometimes expensive computations (like rendering a heavy part of a webpage or an expensive search query) are cached so that the result doesn’t have to be recomputed for a while.

How caching works (basics): When a client or application needs data, it first checks the cache: - If the data is present (cache hit) and still valid, return it immediately (this is fast, e.g., memory lookup). - If it’s not present (cache miss), the application fetches from the original source (e.g., database), then usually stores a copy in the cache before returning to the client. This way, the next request can be a hit.

To ensure the cache doesn’t serve stale or irrelevant data forever, we use strategies for updating or invalidating cache entries:

Caching strategies & write policies:

Cache Aside (Lazy loading): The application explicitly checks the cache first, and on a miss, loads from the DB and populates the cache. This is a common approach (used in the example above). The cache is just a passive store – the application is responsible for keeping it in sync. This strategy usually goes with eviction policies (discussed below) to eventually remove old data.
Write-Through Cache: On data update, write is done through the cache to the database – i.e., data is simultaneously written to the cache and the persistent store. This ensures the cache is always up-to-date with the source of truth (consistency), at the cost of slightly higher write latency (every write hits two places). In a write-through system, reads can be served directly from cache (which is always fresh). Example: updating a user’s profile info might immediately update the cache and the database in one transaction. This way, any subsequent read of that profile from the cache is correct.
Write-Back (Write-Behind) Cache: On update, write only to the cache initially, and defer writing to the database until later (asynchronously). This makes writes very fast (you’re just updating in-memory data), but introduces a risk: if the cache node dies before it has written to the DB, that write could be lost. Also, the database is temporarily inconsistent (lagging), but the cache has the latest. Write-back is useful for scenarios where you can tolerate a bit of eventual consistency and want to absorb write bursts. The cache will batch or periodically persist changes to the DB. Example: an analytics counter might use a write-back cache to count events in memory and flush to the DB every few seconds – this reduces DB load dramatically.
Time-to-Live (TTL): This is a setting on cache entries to expire them after a certain time. For instance, you might cache a user’s profile for 5 minutes. After 5 minutes, the entry is considered stale and will be fetched fresh next time. TTLs ensure that even if you don’t explicitly invalidate cache on writes, the data won’t stay stale forever. Many caches (like Redis) let you set a TTL per key, after which the key is evicted. It’s a simple way to balance freshness vs performance: e.g., you might accept showing a slightly stale count of “likes” on a post for up to 10 seconds, so you cache it with a 10s TTL. After that, the next read will fetch the updated count from the DB. (In DNS, TTL is used so that records auto-refresh after some time – the concept is similar in app caching.) TTL is a time-based invalidation – one source describes it as “automatic expiration after a certain time”.
Explicit Invalidation: The application can also explicitly invalidate or update cache entries when the underlying data changes. For example, if a user changes their profile picture, the service could immediately invalidate the cache entry for that user’s profile so that subsequent reads will fetch the new picture from the database. This approach ensures freshness but adds complexity (you have to track and invalidate many possible cached pieces that might be affected by a single write).

Eviction Policies: Since caches have limited size (you usually can’t cache everything), we need rules to evict (remove) some entries when the cache is full or when certain entries are not useful. Common eviction policies include:

Least Recently Used (LRU): Evict the entry that hasn’t been accessed for the longest time. This works on the assumption that if data hasn’t been used in a while, it’s less likely to be used soon (and if it is needed again, at least you freed space for more currently relevant data). LRU is very popular and is often the default policy in cache systems.
Least Frequently Used (LFU): Evict the item that has been used the least often (frequency count). This targets items that are not often needed (even if recently accessed once, for example). It can better handle cases where some items are popular long-term but accessed just occasionally, though LFU can be tricked by infrequently accessed items that were all accessed recently.
First-In First-Out (FIFO): Evict in the order items were added – the oldest entry (by insertion time)goes out first. This one doesn’t consider usage patterns; it’s simpler and rarely optimal except in specific scenarios.

Many real-world caches use a variant or combination (e.g., LRU-K or ARC algorithms). But as an interviewee, knowing LRU is usually enough to discuss.

Real-world examples: Practically every large-scale system uses caching. For instance, **Netflix **caches content on its Open Connect CDN servers at ISP locations so that popular movies are delivered from local cache rather than across the world. **WhatsApp **uses in-memory caches (like Redis/Memcached clusters) to store frequently accessed metadata (user profiles, message states) to reduce hits to the primary database. This helps them serve billions of messages a day with low latency. In our library analogy, imagine the chaos if every book request required a trek to the back cases (like the front desk cart of books) alleviate that.

In system design interviews, you should identify which parts of the system would benefit from caching. For example, if designing YouTube, video metadata and thumbnails might be cached in memory, and a CDN will cache video files. If designing an online store, product catalog data might be cached, as well as the results of expensive search queries. Mention caching as a way to improve latency and throughput – with a citation, one source put it: Caching significantly decreases access time, lowers database load, and increases overall efficiency.

Of course, caching is not a silver bullet – you must consider cache coherence (stale data issues) and complexity of invalidation (as the saying goes: there are two hard things in CS – naming, cache invalidation, and off-by-one errors!). But effective use of caching is often the key to scaling read-heavy systems.

CAP Theorem (Consistency, Availability, Partition Tolerance)

No discussion of system design fundamentals is complete without the CAP theorem. It’s a cornerstone concept for distributed systems that states: In any distributed data system, you can only fully guarantee two out of the following three properties at the same time: Consistency, Availability, and Partition Tolerance. In other words, CAP = C + A + P (pick two). This theorem (also known as Brewer’s Theorem) guides how we design databases and distributed services, especially when network failures occur.

Let’s clarify the terms in context:

Consistency (C): Every read receives the most recent write or an error. This means all nodes see the same data at the same time. If you write something and then read it (from any node in the distributed system), you will get that write – there is a single up-to-date value of the data. In a consistent system, clients never see out-of-date data. Another way to put it: a transaction or operation is atomic and fully replicated before it’s considered successful. (This is analogous to the strict consistency in databases, like ACID transactions, where once committed, everyone must see the commit.) For example, a strongly consistent banking system would never allow two different ATMs to show two different balances for your account – they will coordinate such that either both see the updated balance or one ATM will throw an error until it can synchronize.
Availability (A): Every request receives some (non-error) response, even under failures. That is, the system is always available to serve requests – it doesn’t hang or refuse to respond. In an available system, each operation eventually succeeds (it may return stale data, but it won’t fail). Concretely, this implies no “global wait” – every operational node will return a response. For example, an available DNS service will return a possibly cached (older) IP address rather than timeout if the authoritative server can’t be reached, because returning something is prioritized over absolute freshness. Availability is measured at the system level: even if parts of the system are down, the service as a whole continues to function for clients.
Partition Tolerance (P): The system continues to operate despite network partitions. A partition is a communication break between nodes – say, half your servers can’t talk to the other half due to a network failure. Partition tolerance means the system can tolerate this – it won’t crash or stop working entirely just because messages are dropped or delayed between components. In practice, network partitions are a fact of life in distributed systems (especially across data centers or regions). So, partition tolerance is usually non-negotiable: you must handle partitions if you want a reliable distributed system. That effectively means in a partition, you have to make a trade-off: do you sacrifice consistency or availability?

The CAP theorem says you cannot have 100% consistency, 100% availability, and 100% partition tolerance simultaneously in a distributed system. Intuitively, if a network partition occurs, you have two basic choices: 1. Favor Consistency (sacrifice Availability): halt some operations or return errors on some nodes until the partition is resolved, to ensure nobody reads stale data. This yields a CP (consistent, partition-tolerant) system. 2. Favor Availability (sacrifice Consistency): allow all nodes to continue serving requests (including possibly serving stale or conflicting data) despite the partition. This yields an AP (available, partition-tolerant) system.

_ (A third option, CA, would mean you try to have consistency and availability but not tolerate partitions – but in a practical distributed system, partitions will happen. A “CA” system is essentially a single-node system or assumes a perfect network, which isn’t realistic for distributed deployments. Relational databases on a single server can be CA (consistent and available when no partition because it’s just one node), but across nodes , you can’t avoid P.)_

So designers of distributed databases and services explicitly choose between CP and AP when a partition happens. For example: - CP systems (Consistency + Partition tolerance) will refuse to respond (or throw an error) if they cannot be sure of up-to-date data. They prefer to be consistent and will tolerate downtime for some operations during network issues. MongoDB, in its default configuration, is often cited as a CP example. It uses a primary-secondary model; if a network partition isolates the primary or there’s any ambiguity, MongoDB will not process writes on the minority side – it would rather be unavailable to those writes than risk inconsistency. In effect, during a partition, some part of the cluster becomes read-only or offline until consistency can be guaranteed again. This is suitable for use cases where correctness is critical – e.g., finance: you’d rather reject a transaction than have two conflicting truths in different places. - AP systems (Availability + Partition tolerance) will serve requests even if the data might not be fully synchronized. They prefer to keep the service running 100% of the time, even if some reads might be stale or some writes might conflict (resolving them later). Apache Cassandra is a classic AP datastore. In Cassandra, any node can accept writes at any time (there’s no single master). If a partition happens, each side will keep accepting writes; they don’t shut anything off. This means two nodes might have inconsistent data for a key temporarily. Cassandra provides eventual consistency – after the partition heals, it has mechanisms (hinted handoff, read repair, etc.) to reconcile differences so all nodes converge to the latest state. During the partition, however, availability was maintained (no downtime), at the cost that a read on one side might not see a recent write that happened on the other side. Many large-scale web services choose AP designs for things like user feeds, product catalogs, etc., where it’s okay if a change takes a short time to propagate, but the service should rarely refuse requests. For example, **Netflix **uses Cassandra (AP) for certain features like viewing history or ratings – it’s better for Netflix that you can always stream videos (high availability) even if, say, your recently watched list might take a few seconds to update across all your devices.

To put it succinctly: CP means if you have a network failure, you stop some operations to stay consistent; AP means you keep going (serve all requests) but accept that some data may be stale or conflict until things sync up. Both approaches are valid depending on the application’s needs.

What about Consistency vs Availability in practice? It’s often a spectrum. Some systems allow tuning between strong and eventual consistency. For instance, Cassandra lets you configure read/write quorum settings to trade consistency for latency on a per-query basis (you can demand strong consistency by waiting for all replicas, or go for lower latency by accepting a response from just one replica, which might be slightly stale). This is sometimes known as “tunable consistency.” The CAP theorem, as originally stated, deals with a binary condition (system either chooses C or A during a partition), but in real life, engineers make nuanced choices and also consider factors outside CAP (like latency performance, which gave rise to the PACELC extension of CAP, beyond our scope here).

Examples with MongoDB and Cassandra: As mentioned, MongoDB is generally CP – if the primary node is lost or a partition occurs, it will sacrifice availability (writes are halted) until a new primary is elected and data is consistent. This ensures you don’t have two primaries (split brain) accepting divergent writes – a conscious consistency choice. Cassandra is AP – it sacrifices immediate consistency for uptime: all nodes can accept writes, and the system will reconcile data later rather than make you wait or error out. These choices map to their use cases: MongoDB (CP by default) is often used when your data model requires document-level consistency and you can tolerate some waiting on failover; Cassandra (AP) is used when you need high write throughput, geographical distribution, and can tolerate eventual consistency (like analytics, big data, or feed data).

To highlight: In a network partition, you must give up either consistency or availability. If you try to maintain both (CA) during a partition, that violates the theorem – you’d have to magically sync data across broken links (impossible) or clone data instantly, etc. Therefore, understanding CAP is about understanding your trade-off under failure conditions. In normal operation (no partition), a well-designed system can be both consistent and available (like many databases achieve consistency and availability when the network is fine). But you plan for the worst: when (not if) a partition happens, what property do you prioritize?

Real systems use combinations: for example, a typical e-commerce site might use a strongly consistent database for orders (you don’t want to double-sell an item – so CP approach), but an eventually consistent system for product catalog and cache to ensure high availability of browsing. Recognizing which parts of a system need strong consistency vs which can be eventual is a key design skill. Many modern NoSQL databases explicitly discuss their CAP stance: Redis (in cluster mode) chooses availability (AP), Zookeeper chooses consistency (CP), etc. As another source notes, many NoSQL systems chose to relax consistency (providing eventual consistency) to remain highly available and partition-tolerant for distributed scale.

Database Sharding

When a single database can no longer handle the scale (too much data or too many queries), one key strategy is sharding. Database sharding means splitting a large database into smaller parts (shards) and distributing them across multiple database servers. Each shard holds a subset of the data, and collectively they make up the entire dataset. Sharding is a form of horizontal scaling for databases: instead of one huge DB server, you might have N smaller ones, each responsible for a portion of the data.

The classic analogy (as GeeksforGeeks humorously put it) is to think of your data as a pizza – instead of handing one enormous pizza to a single person, you cut it into slices and share among friends. Each slice (shard) is easier to handle. Similarly, if you have a user database with 100 million users, you could shard it into, e.g., 10 shards, each with 10 million users. Each shard is a full database in itself (same schema) but contains only the users in its partition. Queries for a particular user go to the shard that holds that user’s data, thereby spreading the load.

Sharding Types: There are a few ways to shard:

Horizontal Sharding (Range or Key-Based): This is the most common meaning of sharding – split rows by some key or range. Each shard holds a subset of the rows of a table, usually defined by a key’s value range or a hash. For example, imagine a user table sharded by the first letter of username: Shard 1 has users A–P, Shard 2 has Q–Z. Or you might hash the user ID modulo the number of shards to distribute uniformly. Horizontal sharding keeps the same schema on each shard (same columns), but different rows. It’s called horizontal because if you imagine a table, you’re cutting it horizontally into row chunks. Example: Instagram might shard its user data such that users are partitioned by user_id ranges – queries about a specific user go to one shard. Horizontal sharding is great for scaling out reads and writes when no single machine’s CPU/RAM can handle it. However, it needs a good shard key choice to avoid hotspots (if data isn’t evenly distributed, one shard could still become a bottleneck). Also, queries that need data from multiple shards (like a query for all users named John in A–Z) become more complex (the application might need to query all shards and combine results).
Vertical Sharding (Functional Partitioning): Splitting by feature or by columns. In vertical sharding, each shard does not have the same schema – instead, you break the database by tables or by modules. For instance, an application’s database might be split such that user profile info is in one database, and user posts or messages are in another. Each is a shard handling a portion of the overall functionality. Another form is splitting a wide table by columns into separate tables (though this is less common in practice compared to just separating tables). The idea is to isolate different data types or usage patterns. Example: On Twitter, one could shard by feature – store the user account info in one shard, the user’s followers list in another, and the user’s tweets in a third. Each shard can be scaled or optimized independently (the followers shard might need a very fast lookup, the tweets shard might be huge and need sharding within itself too). Vertical partitioning is often done for microservices: e.g., you create a User Service with its own DB, a Billing Service with its own DB, etc., effectively sharding by domain. Pros: simpler than horizontal in some ways (no single table is split across servers; each shard is a self-contained domain). Cons: each shard is still monolithic for its data – if one grows too large, you may still need horizontal sharding within it. Also, cross-shard (cross-domain) queries are like querying two different systems (you have to join at the application level).
Directory-Based Sharding: This is a more advanced approach where you keep a lookup table (directory) that maps each data item to the shard where it lives. For example, a service could maintain a mapping of user_id -> shard number in a small, fast database. Then, for any request, it first consults the directory to find the right shard. This adds an extra lookup, but allows flexibility (you can move data between shards and update the directory). Some systems use this to handle non-uniform distributions or to relocate “hot” keys to dedicated shards. The directory itself must be highly available and partition-tolerant, or you have a new bottleneck.

Most real systems start simple: pick a horizontal sharding key or do vertical splits, then add directory indirection if needed as they scale further.

Advantages of Sharding:

Scalability: By spreading data across multiple servers, you can handle much larger datasets and higher throughput. Each shard handles a fraction of the load, so overall capacity increases linearly with the number of shards. If you need to handle more load, you can add more shards (though resharding isn’t trivial, it’s doable with planning). This makes scaling horizontally possible in the database tier, not just the app tier.
Performance: Queries can be faster because each shard is dealing with a smaller dataset. For example, a search through 10 million rows on one shard is faster than a search through 100 million rows on one big table. Also, shards can operate in parallel – 10 shards each handling 1k queries per second can collectively handle 10k QPS (if queries are mostly independent per shard). In other words, it increases throughput by parallelization. As long as each shard is on a separate host, you multiply your I/O and CPU capacity.
Reduced Single Point of Failure: If one shard goes down, the others are still operational (though part of your data is unavailable). This is better than one big database going down and taking out everything. With proper design, shard failures can be isolated and even recovered (with replication on each shard for reliability). For instance, if you shard by user ID, and shard #7 is temporarily down, only users mapped to #7 are affected; others continue normally – partial availability is better than full downtime.
Cost Efficiency: Instead of an extremely expensive high-end server, you can use multiple commodity servers (common in cloud setups). Many smaller boxes might be cheaper and more fault-tolerant than one giant machine. Plus, you can incrementally add shards as needed (pay as you grow).

However, sharding adds complexity and has downsides:

Increased Complexity in Application: The logic to route to the correct shard needs to reside in the application or a middleware. The app must know (or figure out) which shard to query for a given piece of data. This complicates development and testing. Developers must also handle merges of results from multiple shards if a query spans shards, which is non-trivial (like performing a join across shards, or an aggregate query across all shards).
Rebalancing/Data Movement: Over time, shards can become unbalanced – maybe one shard got a lot more users or data than others (especially if the shard key wasn’t perfectly uniform, or certain users generate way more data). You then have to reshard – split a shard or move some data from one shard to another. This is a challenging operation that can require downtime or complex live migration tools. It’s one of the trickiest operational aspects of sharding. For example, if shard #1 has 2x the data of the others, you might decide to create a new shard #11 and move half of shard #1’s data to #11. Doing that without downtime and without confusing the application requires careful coordination (and often the directory-based approach or hash-based sharding can mitigate the need for manual rebalancing).
Cross-shard queries and transactions: If your query needs data from multiple shards, the operation becomes slower and more complex. You might have to query all shards and combine results in memory (scatter-gather). Joins across shards are generally not possible directly (each shard only knows its data). Similarly, transactions that need to update data on multiple shards (distributed transactions) are much more complex to support (two-phase commit, etc., with performance hits). Often, designers try to choose a shard key such that most operations are localized to one shard (to avoid this issue). But some multi-shard operations are inevitable, and they will be less efficient than operations on a single monolithic DB .
Operational Overhead: Now you have, say, 10 databases instead of 1 – monitoring, backups, and schema changes all become more involved. If you need to alter a table schema, you must do it on all shards, potentially coordinating the timing. Each shard might need its own replication set-up for high availability, further multiplying components. Debugging issues can be harder (because the data is spread out).

Given these trade-offs, teams typically shard when they have no other option to scale. A common wisdom is “don’t shard prematurely.” Use easier scaling techniques first (caching, read replicas, vertical scaling). But once you reach a certain data size or traffic (for example, when vertical scaling and replication still can’t handle write load or dataset size), sharding is the way to go to continue scaling.

When to use sharding: When your single database can’t handle the volume of reads/writes or the data size is too large for one machine’s disk or memory, it’s time to shard. A rule of thumb might be that if your DB is into hundreds of gigabytes or more, and you have high sustained QPS, consider sharding. Companies like Facebook, Google, etc., shard extensively (often automatically managed by their infrastructure). As an example from our earlier notes, WhatsApp stores huge amounts of message data and user info – they use data partitioning (sharding) along with replication to achieve scalability and fault tolerance. By partitioning messages across multiple database nodes and replicating them, WhatsApp ensures it can handle billions of messages (shards provide scale) and not lose data (replicas provide reliability).

Another example: **YouTube **might shard its videos database by video ID or by uploader; **Netflix **might shard customer data by region, etc. In interviews, if asked to design something with a very large scale (millions of users, high write throughput), discussing sharding is often expected. You could say, “At X scale, we would implement database sharding. For instance, partition user data by user ID hash to distribute evenly across 10 shards, which reduces load per DB and allows scaling out. We’d also need a mechanism to map user IDs to shards – perhaps a consistent hashing scheme or a lookup service.”

To summarize the pros/cons: Sharding enhances performance and scalability by parallelizing workload across servers and provides fault isolation (one shard’s failure is not total failure). However, it introduces complexity in management and query logic, and challenges in rebalancing and multi-shard operations. As one source put it, sharding is a great solution when a single database can’t handle the load, but it also adds complexity to your system – a clear trade-off.

Summary & Next Steps

In this introduction, we covered the foundational concepts of system design:

System design fundamentals and goals: Designing systems with scalability, low latency, high availability, reliability, and consistency in mind. We saw how these goals influence architecture choices (and sometimes conflict, requiring trade-offs).
High-Level vs Low-Level Design: HLD captures the overall system architecture (e.g., services, data flow, interactions), while LLD details the internal design of components. HLD is like an architect’s blueprint, and LLD is like the engineer’s implementation plan – both critical in system planning.
Scalability: Designing for growth via vertical scaling (scale-up hardware) vs horizontal scaling (scale-out with multiple nodes). Horizontal scaling is usually key for web-scale systems, but it demands strategies like load balancing and sharding to work.
Load Balancing: Using reverse proxies or dedicated load balancers to distribute traffic and avoid overload. We reviewed common balancing algorithms (round robin, least connections, IP hash) and saw how reverse proxies not only balance load but also provide caching and security benefits. A load-balanced architecture eliminates single-server bottlenecks and enables high availability by routing around failed nodes.
Caching: A powerful optimization to serve frequent requests fast and reduce database/backend load. By storing frequently used data in memory (or closer to users via CDNs), caching lowers latency dramatically. We discussed cache strategies (write-through vs write-back), TTL-based expiration, and eviction policies like LRU. Effective caching is often the difference between a system that scales easily and one that crumbles under read load.
CAP Theorem: In distributed systems, you cannot simultaneously have perfect consistency, availability, and partition-tolerance. We explained consistency and availability and why a network partition forces a system to choose CP or AP. This understanding guides the design of data storage systems – e.g., choosing a database like Cassandra (AP, eventual consistency) vs a database like MongoDB or a SQL DB in a failover setup (CP, strong consistency with potential downtime on partitions). The CAP trade-off depends on the application’s needs – e.g., financial transactions lean towards CP, social media feeds lean towards AP.
Database Sharding: Splitting a database into shards to scale horizontally. We saw horizontal sharding (splitting rows by key/range) and vertical partitioning (splitting by feature or table). Sharding can greatly increase capacity and performance by parallelizing across servers, but it adds complexity in terms of data distribution, query handling, and operations. It’s a crucial technique when a single database instance can no longer handle the workload (used by virtually all big platforms at some point).

Throughout, we used real-world analogies and examples: from librarians and delivery vans to how Netflix, WhatsApp, and others apply these concepts. For instance, Netflix uses CDNs (caches) and Cassandra (AP database) for an always-on streaming service, or WhatsApp uses a distributed architecture with load balancing across data centers for global availability.

What’s next? In subsequent parts of this System Design Interview Series, we will apply these fundamentals to specific system design problems. You can expect deep dives into designing real systems end-to-end – for example, how to design a URL shortening service, an Instagram-like social network, or the backend of a messaging app like WhatsApp. We’ll explore putting these concepts together: using load balancers and caches to serve billions of requests, choosing databases or combinations (SQL/NoSQL) and sharding strategies, ensuring reliability via replication and failover, and so on. We’ll also cover other advanced topics like *microservices vs monoliths, message queues, event-driven design, security considerations, and designing for failure.
*
By mastering the fundamentals from this introduction – and understanding the reasoning behind each design decision (with the help of analogies and examples) – you’ll be well-prepared to tackle system design interview questions. In the next article, we’ll start with a practical example: tying these concepts together to design a simplified version of a real system. Stay tuned!

References & Further Reading

Disclaimer: Some concepts explained here are inspired by well-known system design resources and have been curated purely for educational purposes to help readers prepare for interviews.

Top comments (1)

Koda-Black • Aug 25

This is an absolutely lovely read. I really loved reading this. Thank you for this gem. When does the next part of the series come out?