Why Zero‑Downtime Matters
For a DevOps lead, every second of unavailability translates into lost revenue, eroded trust, and a damaged brand. Modern users expect services to stay online even when you push new features, security patches, or database migrations. Achieving zero‑downtime isn’t a magic trick; it’s a disciplined set of practices that combine container orchestration, smart proxying, and observability.
Prerequisites
Before you dive into the checklist, make sure you have the following in place:
- Docker Engine ≥ 20.10 on every host.
- Docker Compose (or a compatible orchestrator like Kubernetes) for multi‑service definitions.
- Nginx 1.21+ compiled with the
ngx_http_upstream_module
. - A CI/CD runner (GitHub Actions, GitLab CI, or Jenkins) that can push images to a registry.
- Basic monitoring stack (Prometheus + Grafana or Loki + Grafana) for health checks.
Docker & Nginx Blueprint
Dockerfile (Node.js example)
FROM node:20-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci && npm run build FROM node:20-alpine WORKDIR /app COPY --from=builder /app/dist ./dist COPY package*.json ./ RUN npm ci --production EXPOSE 3000 CMD ["node", "dist/index.js"]
The multi‑stage build keeps the final image lean (<50 MB) and isolates build‑time dependencies.
docker‑compose.yml
version: "3.9" services: api: build: . image: myorg/api:{{GIT_SHA}} restart: always environment: - NODE_ENV=production ports: - "3000" healthcheck: test: ["CMD", "curl", "-f", "http://localhost:3000/health"] interval: 10s timeout: 5s retries: 3 nginx: image: nginx:1.23-alpine ports: - "80:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf:ro depends_on: api: condition: service_healthy
The depends_on
clause ensures Nginx only starts routing traffic after the API passes its health check.
Nginx Reverse‑Proxy Config
worker_processes auto; error_log /var/log/nginx/error.log warn; pid /var/run/nginx.pid; events { worker_connections 1024; } http { upstream api_upstream { server api:3000 max_fails=3 fail_timeout=30s; } server { listen 80; location / { proxy_pass http://api_upstream; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; } } }
The upstream block lets Nginx gracefully handle a failing container by retrying other healthy instances.
CI/CD Pipeline Checklist
- Lint & Unit Test – Fail fast on code quality issues.
- Build Docker Image – Tag with commit SHA and
latest
. - Run Integration Tests against a temporary compose stack.
- Push Image to Registry – Ensure the registry is reachable from production nodes.
- Deploy to Staging – Use the same compose file; run smoke tests.
- Blue‑Green Switch – Deploy the new version alongside the old one.
- Health‑Check Validation – Verify
/health
returns200
before routing. - Promote to Production – Swap Nginx upstream without downtime.
- Post‑Deploy Monitoring – Watch error rates for 5 minutes; auto‑rollback if thresholds breach.
GitHub Actions Snippet
name: CI/CD on: push: branches: [main] jobs: build-and-deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Docker Buildx uses: docker/setup-buildx-action@v2 - name: Log in to Docker Hub uses: docker/login-action@v2 with: username: ${{ secrets.DHUB_USER }} password: ${{ secrets.DHUB_PASS }} - name: Build & Push uses: docker/build-push-action@v4 with: context: . push: true tags: myorg/api:${{ github.sha }},myorg/api:latest - name: Deploy Blue‑Green Stack run: | docker compose -f docker-compose.yml up -d --no-deps --scale api=2 # health check loop omitted for brevity
The --scale api=2
command brings up the new container while keeping the old one alive, enabling a true blue‑green rollout.
Blue‑Green Deployment Workflow
- Spin Up New Version –
docker compose up -d api_new
creates a parallel service. - Health Probe – Nginx only adds the new upstream after
/health
passes. - Swap Traffic – Update the upstream block (or use a DNS‑based service discovery) to point to the new container.
- Graceful Drain – Reduce
max_fails
on the old container; let existing connections finish. - Verify Metrics – Check latency, error rate, and CPU usage for anomalies.
- Retire Old Pods –
docker compose rm -f api_old
once the new version is stable.
Because Nginx reads the upstream configuration on each request, you can reload it with a zero‑second nginx -s reload
without dropping connections.
Observability & Logging
- Prometheus – Scrape
/metrics
from both Nginx and the API. - Grafana Dashboards – Visualize request latency, 5xx rates, and container restarts.
- Loki – Centralize Nginx access/error logs for quick grep.
- Alertmanager – Trigger a webhook to your CI system if error rate > 1% for 2 minutes.
Sample Prometheus target configuration:
scrape_configs: - job_name: "nginx" static_configs: - targets: ["nginx:9113"] - job_name: "api" static_configs: - targets: ["api:3000"]
Rollback Strategy
- Keep the Previous Image – Tag releases as
v1.2.3
and never GC them immediately. - Automated Rollback – If the health check fails three times in a row, run
docker compose down api_new && docker compose up -d api_old
. - Database Compatibility – Use backward‑compatible migrations or feature flags to avoid schema lock‑in.
- Post‑Mortem – Log the cause, update the checklist, and share findings with the team.
Final Thoughts
Zero‑downtime deployments are achievable with a modest amount of automation, solid health checks, and disciplined observability. By following the checklist above, a DevOps lead can confidently ship new code without upsetting users or triggering costly incidents. If you need help shipping this, the team at RamerLabs can help.
Top comments (0)