DEV Community

Cover image for Single Points of Failure - Example Case Study
ZeeshanAli-0704
ZeeshanAli-0704

Posted on • Edited on

Single Points of Failure - Example Case Study

🏬 Avoiding SPOFs: Real-World Case Study (E-Commerce System Design Example)

β€œUnderstanding Single Points of Failure (SPOF) is easy in theory β€” but seeing it in action changes how you design systems forever.”


🧠 Why This Example?

Let’s apply the SPOF concept to a real, distributed system β€”
an E-Commerce Web Application similar to Flipkart, Amazon, or Shopify.

We’ll:

  • Identify Single Points of Failure in each layer
  • Understand how failures propagate
  • Learn how to design for resilience

πŸ—οΈ Step 1: Our E-Commerce Architecture

Here’s a simplified architecture to start with:

 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Users β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ [ Internet / DNS ] β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Load Balancer β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β” β”‚ Web Server 1 β”‚ β”‚ Web Server 2 β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ App Logic/API β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”‚ Database β”‚ β”‚ Redis β”‚ β”‚ FileStoreβ”‚ β”‚ (Orders) β”‚ β”‚ Cache β”‚ β”‚ (Images) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ 
Enter fullscreen mode Exit fullscreen mode

🧩 Step 2: Identify Single Points of Failure

Let’s walk through each layer and see where it can break.


1. Load Balancer β€” Traffic Entry Point

Problem:
Only one load balancer (LB) handles all incoming requests.
If it fails, no user can reach your app β€” even though servers are fine.

Symptoms:

  • Users see β€œSite Unavailable”
  • CPU or network spike on LB affects all traffic

SPOF: βœ… Yes β€” Single LB Instance

Fix:

  • Deploy multiple LBs (Active-Passive or Active-Active).
  • Use DNS-level failover (e.g., AWS Route53 Health Checks).
  • Use Elastic Load Balancer (ELB) in cloud environments for managed redundancy.

Better Architecture:

Users β†’ DNS β†’ [ LB1 | LB2 ] β†’ Web Servers 
Enter fullscreen mode Exit fullscreen mode

2. Web Server Layer

Problem:
One web server hosts your frontend and backend.
If it crashes (e.g., Nginx process dies, instance reboot) β€” website goes down.

SPOF: βœ… Yes β€” Single Web Node

Fix:

  • Run multiple web servers (3+ instances across AZs).
  • Use the load balancer to distribute requests.
  • Design web tier as stateless (no sessions or files stored locally).

Better Architecture:

[LB Cluster] β†’ [Web1, Web2, Web3] 
Enter fullscreen mode Exit fullscreen mode

Example:
AWS Auto Scaling Group running multiple EC2 or container replicas.


3. Application Layer

Problem:
If your backend API runs on a single app instance (say, Spring Boot),
any crash or deployment causes full downtime.

SPOF: βœ… Yes β€” Single App Instance

Fix:

  • Containerize and deploy multiple replicas (app1, app2, app3).
  • Use Kubernetes or Docker Swarm for orchestration and automatic restart.
  • Maintain stateless behavior (e.g., store sessions in Redis).

4. Database Layer

Problem:
You use one MySQL instance for all orders, users, and products.
If it crashes or storage fails β€” entire platform is unavailable.

SPOF: βœ… Yes β€” Database

Fix:

  • Deploy Primary-Replica (Master-Slave) setup.
  • Enable Automatic Failover (e.g., via RDS Multi-AZ, Patroni, or Vitess).
  • Use read replicas for scaling reads.
  • Perform regular backups and test restoration.

Example Topology:

Primary DB (Write) β†™οΈŽ β†˜οΈŽ Replica 1 Replica 2 (Read) 
Enter fullscreen mode Exit fullscreen mode

Outcome:
Even if the primary DB fails, a replica takes over automatically.


5. Cache Layer (Redis or Memcached)

Problem:
All sessions and cached product data are stored in a single Redis node.
If Redis dies β†’ users get logged out or site slows down drastically.

SPOF: βœ… Yes β€” Single Cache Node

Fix:

  • Use Redis Cluster or Sentinel for auto-failover.
  • Deploy replicas across multiple AZs.
  • Enable AOF persistence (to recover data on restart).
  • Implement graceful fallback to DB when cache unavailable.

6. Payment Gateway

Problem:
Your checkout process relies solely on one provider (e.g., Stripe).
If Stripe API is down β€” you lose all transactions.

SPOF: βœ… Yes β€” Single External Dependency

Fix:

  • Integrate multiple providers (Stripe + Razorpay + PayPal).
  • Implement retry + failover logic in your payment service.
  • Queue failed transactions for retry or reconciliation.

Example:

PaymentService β†’ [ Stripe | Razorpay | PayPal ] 
Enter fullscreen mode Exit fullscreen mode

7. File & Image Storage

Problem:
You store product images on one server (/uploads).
If it fails β†’ all images disappear from frontend.

SPOF: βœ… Yes β€” Local Disk Storage

Fix:

  • Use Object Storage (S3, GCS, Azure Blob).
  • Enable versioning and multi-region replication.
  • Cache via CDN (CloudFront, Cloudflare) for global availability.

8. DNS

Problem:
Your DNS is hosted by a single provider (say, Cloudflare).
If it faces downtime, users can’t resolve your domain.

SPOF: βœ… Yes β€” Single DNS Provider

Fix:

  • Use multi-provider DNS setup (Cloudflare + AWS Route53).
  • Keep TTL low (e.g., 60 seconds).
  • Monitor DNS resolution health.

9. Monitoring and Alerting

Problem:
You use a single Prometheus or ELK stack.
If it fails β€” you lose visibility during incidents.

SPOF: βœ… Yes β€” Central Monitoring Node

Fix:

  • Use federated Prometheus setup or multi-region observability.
  • Mirror logs to S3 or Kafka for durability.
  • Keep dashboards available independently from production.

10. Human and Process Level

Problem:
Only one DevOps engineer can deploy or access production.
If they’re unavailable, you’re blocked during outages.

SPOF: βœ… Yes β€” Human Process SPOF

Fix:

  • Cross-train multiple engineers.
  • Maintain clear runbooks and automated deployment pipelines.
  • Implement RBAC (Role-Based Access Control) instead of one admin.

πŸ’‘ Visual Summary

Layer SPOF Example Fix / Redundancy Strategy
Load Balancer One instance Multiple LBs + DNS failover
Web/App Servers Single instance Auto-scaling stateless replicas
Database One DB Replication + Auto Failover
Cache One Redis Redis Cluster / Sentinel
Payment One gateway Multi-provider fallback
Storage Local disk S3 + CDN
DNS One provider Multi-DNS setup
Monitoring Single ELK Federated, redundant setup
Identity One IdP Multi-region or cached tokens
Human One admin Cross-training + automation

🧱 Step 3: From SPOF to High Availability (HA)

Category Without SPOF Fix With SPOF Fix
Uptime ~95% >99.9%
Failure Impact Total outage Partial degradation
Recovery Time Hours Seconds–Minutes
Complexity Low Medium–High
Resilience Weak Strong & Predictable

🧰 Step 4: Architecture Evolution

Initial (SPOF everywhere)

Users β†’ LB β†’ Web β†’ DB β†’ Redis 
Enter fullscreen mode Exit fullscreen mode

Improved (HA and Resilient)

 DNS (Multi-Provider) ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ LB1 LB2 β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β” β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”‚ Web1 β”‚ β”‚ Web2 β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ App Cluster β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”Œβ”€β”€β–Όβ”€β”€β”€β” β”Œβ”€β”€β–Όβ”€β”€β”€β” β”Œβ”€β”€β–Όβ”€β”€β”€β” β”‚ DB1 β”‚ β”‚ DB2 β”‚ β”‚ Redisβ”‚ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ 
Enter fullscreen mode Exit fullscreen mode

Now, no single failure kills the system β€” it may degrade but remains available.


βš™οΈ Step 5: Testing Your SPOF Fixes

Once redundancy is in place:

  1. Run chaos experiments β€” kill random pods/nodes.
  2. Simulate DB failover and check if API recovers.
  3. Stop one LB and confirm DNS reroutes properly.
  4. Monitor latency and error rates during failover.

Use tools like:

  • Chaos Monkey / LitmusChaos / AWS FIS
  • Synthetic traffic probes
  • Load test under partial failures

🧭 Final Thoughts

Building a system without SPOFs means designing for:

  • Redundancy (no single node dependency)
  • Graceful degradation (service should still work partially)
  • Fast recovery (automated failover)
  • Observability (know what failed and why)

It’s not about being failure-free β€”
it’s about being failure-tolerant.


πŸš€ TL;DR β€” The Mindset Shift

Before:
β€œWhat happens if this fails?”

After:
β€œWhat continues to work when this fails?”

That’s the difference between a fragile system and a resilient one.


More Details:

Get all articles related to system design
Hastag: SystemDesignWithZeeshanAli

Git: https://github.com/ZeeshanAli-0704/SystemDesignWithZeeshanAli

Top comments (0)