Posted on Sep 25

The Ultimate Checklist for Zero‑Downtime Deploys with Docker & Nginx

#cloud #devops #performance #architecture

Why Zero‑Downtime Matters

For a DevOps lead, every second of unavailability translates into lost revenue, eroded trust, and a damaged brand. Modern users expect services to stay online even when you push new features, security patches, or database migrations. Achieving zero‑downtime isn’t a magic trick; it’s a disciplined set of practices that combine container orchestration, smart proxying, and observability.

Prerequisites

Before you dive into the checklist, make sure you have the following in place:

Docker Engine ≥ 20.10 on every host.
Docker Compose (or a compatible orchestrator like Kubernetes) for multi‑service definitions.
Nginx 1.21+ compiled with the ngx_http_upstream_module.
A CI/CD runner (GitHub Actions, GitLab CI, or Jenkins) that can push images to a registry.
Basic monitoring stack (Prometheus + Grafana or Loki + Grafana) for health checks.

Docker & Nginx Blueprint

Dockerfile (Node.js example)

FROM node:20-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci && npm run build FROM node:20-alpine WORKDIR /app COPY --from=builder /app/dist ./dist COPY package*.json ./ RUN npm ci --production EXPOSE 3000 CMD ["node", "dist/index.js"]

The multi‑stage build keeps the final image lean (<50 MB) and isolates build‑time dependencies.

docker‑compose.yml

version: "3.9" services: api: build: . image: myorg/api:{{GIT_SHA}} restart: always environment: - NODE_ENV=production ports: - "3000" healthcheck: test: ["CMD", "curl", "-f", "http://localhost:3000/health"] interval: 10s timeout: 5s retries: 3 nginx: image: nginx:1.23-alpine ports: - "80:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf:ro depends_on: api: condition: service_healthy

The depends_on clause ensures Nginx only starts routing traffic after the API passes its health check.

Nginx Reverse‑Proxy Config

worker_processes auto; error_log /var/log/nginx/error.log warn; pid /var/run/nginx.pid; events { worker_connections 1024; } http { upstream api_upstream { server api:3000 max_fails=3 fail_timeout=30s; } server { listen 80; location / { proxy_pass http://api_upstream; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; } } }

The upstream block lets Nginx gracefully handle a failing container by retrying other healthy instances.

CI/CD Pipeline Checklist

Lint & Unit Test – Fail fast on code quality issues.
Build Docker Image – Tag with commit SHA and latest.
Run Integration Tests against a temporary compose stack.
Push Image to Registry – Ensure the registry is reachable from production nodes.
Deploy to Staging – Use the same compose file; run smoke tests.
Blue‑Green Switch – Deploy the new version alongside the old one.
Health‑Check Validation – Verify /health returns 200 before routing.
Promote to Production – Swap Nginx upstream without downtime.
Post‑Deploy Monitoring – Watch error rates for 5 minutes; auto‑rollback if thresholds breach.

GitHub Actions Snippet

name: CI/CD on: push: branches: [main] jobs: build-and-deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Docker Buildx uses: docker/setup-buildx-action@v2 - name: Log in to Docker Hub uses: docker/login-action@v2 with: username: ${{ secrets.DHUB_USER }} password: ${{ secrets.DHUB_PASS }} - name: Build & Push uses: docker/build-push-action@v4 with: context: . push: true tags: myorg/api:${{ github.sha }},myorg/api:latest - name: Deploy Blue‑Green Stack run: | docker compose -f docker-compose.yml up -d --no-deps --scale api=2 # health check loop omitted for brevity

The --scale api=2 command brings up the new container while keeping the old one alive, enabling a true blue‑green rollout.

Blue‑Green Deployment Workflow

Spin Up New Version – docker compose up -d api_new creates a parallel service.
Health Probe – Nginx only adds the new upstream after /health passes.
Swap Traffic – Update the upstream block (or use a DNS‑based service discovery) to point to the new container.
Graceful Drain – Reduce max_fails on the old container; let existing connections finish.
Verify Metrics – Check latency, error rate, and CPU usage for anomalies.
Retire Old Pods – docker compose rm -f api_old once the new version is stable.

Because Nginx reads the upstream configuration on each request, you can reload it with a zero‑second nginx -s reload without dropping connections.

Observability & Logging

Prometheus – Scrape /metrics from both Nginx and the API.
Grafana Dashboards – Visualize request latency, 5xx rates, and container restarts.
Loki – Centralize Nginx access/error logs for quick grep.
Alertmanager – Trigger a webhook to your CI system if error rate > 1% for 2 minutes.

Sample Prometheus target configuration:

scrape_configs: - job_name: "nginx" static_configs: - targets: ["nginx:9113"] - job_name: "api" static_configs: - targets: ["api:3000"]

Rollback Strategy

Keep the Previous Image – Tag releases as v1.2.3 and never GC them immediately.
Automated Rollback – If the health check fails three times in a row, run docker compose down api_new && docker compose up -d api_old.
Database Compatibility – Use backward‑compatible migrations or feature flags to avoid schema lock‑in.
Post‑Mortem – Log the cause, update the checklist, and share findings with the team.

Final Thoughts

Zero‑downtime deployments are achievable with a modest amount of automation, solid health checks, and disciplined observability. By following the checklist above, a DevOps lead can confidently ship new code without upsetting users or triggering costly incidents. If you need help shipping this, the team at RamerLabs can help.

DEV Community