💬
Home Blog Contact
🚀
HomeBlog › DevOps
🚀 DevOps

Zero-Downtime Kubernetes Deployments: A Production Playbook

After shipping hundreds of production systems over two decades, I've learned that the gap between "mostly up" and "always up" is where engineering excellence lives. Zero-downtime Kubernetes deployments aren't magic — they're the result of deliberate architecture decisions made before a single `kubectl apply` is run.

💡

What You'll Learn

Rolling updates, blue-green deployments, and canary releases — with automated rollback triggers and real production configurations you can use today.

The Fundamentals First

Most teams treat zero-downtime deployments as a feature request. It's not. It's a design constraint that shapes every decision from how you structure your containers to how you manage database migrations.

Before diving into the strategies, let's align on what "zero-downtime" actually means. It means no failed requests during deployment — not just "service is up somewhere." That distinction matters when you have load balancers, health checks, and connection draining to configure correctly.

🏗️
Kubernetes rolling update flow — old pods terminate only after new ones pass health checks

Strategy 1: Rolling Updates (The Default)

Rolling updates are Kubernetes' built-in mechanism and should be your baseline. The key is configuring them correctly — the defaults are dangerous in production.

YAML
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-api
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2        # Add 2 extra pods before removing old ones
      maxUnavailable: 0  # Never allow requests to be dropped
  template:
    spec:
      containers:
      - name: api
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 10
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 2
⚠️

Critical: Liveness vs Readiness

Liveness determines if a pod should be restarted. Readiness determines if it should receive traffic. Never confuse them. A pod failing readiness during a database migration should not trigger a restart.

Graceful Shutdown

Rolling updates alone aren't enough. Your application must handle SIGTERM gracefully — completing in-flight requests before exiting. Add this to your Node.js apps:

Node.js
process.on('SIGTERM', async () => {
  console.log('SIGTERM received. Closing server gracefully...');
  
  // Stop accepting new connections
  server.close(async () => {
    // Drain existing connections
    await closeDbPool();
    await closeRedisClient();
    console.log('Shutdown complete.');
    process.exit(0);
  });

  // Force exit after 30s if graceful shutdown fails
  setTimeout(() => {
    console.error('Forced shutdown after timeout.');
    process.exit(1);
  }, 30000);
});

Strategy 2: Blue-Green Deployments

When you can't afford the risk of a rolling update (schema changes, major API version bumps), blue-green is your tool. You maintain two identical environments and switch traffic atomically.

The implementation in Kubernetes uses a Service selector swap — the fastest and most reliable approach:

bash
#!/bin/bash
# deploy-blue-green.sh

CURRENT=$(kubectl get service my-api -o jsonpath='{.spec.selector.version}')
NEW_VERSION=${1:-"v2"}

echo "Current: $CURRENT → Deploying: $NEW_VERSION"

# Deploy new version (doesn't receive traffic yet)
kubectl apply -f deployment-${NEW_VERSION}.yaml

# Wait for new deployment to be ready
kubectl rollout status deployment/my-api-${NEW_VERSION} --timeout=5m

# Run smoke tests against the new deployment
./run-smoke-tests.sh ${NEW_VERSION}

if [ $? -eq 0 ]; then
  # Switch traffic to new version (atomic swap)
  kubectl patch service my-api \
    -p "{\"spec\":{\"selector\":{\"version\":\"${NEW_VERSION}\"}}}"
  echo "✅ Traffic switched to ${NEW_VERSION}"
  
  # Keep old version running for 10 minutes (easy rollback)
  sleep 600
  kubectl delete deployment my-api-${CURRENT}
else
  echo "❌ Smoke tests failed. Keeping $CURRENT live."
  kubectl delete deployment my-api-${NEW_VERSION}
  exit 1
fi

Strategy 3: Canary Releases

Canary releases let you test a new version with a small percentage of real traffic before committing fully. Combined with monitoring and automated rollback, this is the gold standard for high-risk changes.

Using Kubernetes' native traffic splitting with a weighted service setup, plus automatic rollback based on error rate metrics:

bash
# Start with 5% traffic to canary
kubectl patch service my-api --type='json' \
  -p='[{"op":"replace","path":"/spec/selector","value":{"app":"my-api"}}]'

# Monitor error rates for 15 minutes
for i in {1..30}; do
  ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query" \
    --data-urlencode "query=rate(http_requests_total{status=~'5.*',version='canary'}[2m]) / rate(http_requests_total{version='canary'}[2m]) * 100" \
    | jq '.data.result[0].value[1]' | tr -d '"')

  if (( $(echo "$ERROR_RATE > 1.0" | bc -l) )); then
    echo "🚨 Error rate ${ERROR_RATE}% — rolling back!"
    kubectl patch service my-api-canary --type='json' \
      -p='[{"op":"replace","path":"/spec/replicas","value":0}]'
    exit 1
  fi
  sleep 30
done

echo "✅ Canary healthy. Promoting to 100%."
🚨

Database Migrations

Zero-downtime database migrations require the "expand/contract" pattern. Never add NOT NULL columns without defaults, never drop columns immediately. Run migrations in backward-compatible phases over multiple deployments.

Monitoring & Automated Rollback

The final piece is automated rollback. Human reaction time isn't fast enough for production incidents. Your deployment pipeline should continuously monitor SLOs and roll back automatically when they're breached.

I use a simple but effective approach — a Prometheus query that triggers rollback if the P99 latency or error rate exceeds thresholds within 5 minutes of deployment:

YAML
# prometheus-alerts.yaml
groups:
- name: deployment
  rules:
  - alert: DeploymentErrorRateHigh
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[2m])) /
      sum(rate(http_requests_total[2m])) > 0.01
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Error rate > 1% — auto-rollback triggered"

The Checklist

Before every production deployment, run through this list:

  • maxUnavailable: 0 set in your Deployment spec
  • Liveness and readiness probes configured and tested
  • Graceful shutdown handling SIGTERM with connection draining
  • Database migrations applied in expand/contract phases
  • Smoke tests gate traffic switching in blue-green flows
  • Error rate monitoring with automated rollback triggers
  • Old version kept available for at least 10 minutes post-switch

Zero-downtime deployments aren't a magic trick — they're the compound result of disciplined engineering across every layer. Get these fundamentals right, and your deployments become a non-event. Which is exactly how it should be.