After shipping hundreds of production systems over two decades, I've learned that the gap between "mostly up" and "always up" is where engineering excellence lives. Zero-downtime Kubernetes deployments aren't magic — they're the result of deliberate architecture decisions made before a single `kubectl apply` is run.
What You'll Learn
Rolling updates, blue-green deployments, and canary releases — with automated rollback triggers and real production configurations you can use today.
The Fundamentals First
Most teams treat zero-downtime deployments as a feature request. It's not. It's a design constraint that shapes every decision from how you structure your containers to how you manage database migrations.
Before diving into the strategies, let's align on what "zero-downtime" actually means. It means no failed requests during deployment — not just "service is up somewhere." That distinction matters when you have load balancers, health checks, and connection draining to configure correctly.
Strategy 1: Rolling Updates (The Default)
Rolling updates are Kubernetes' built-in mechanism and should be your baseline. The key is configuring them correctly — the defaults are dangerous in production.
YAML
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-api
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # Add 2 extra pods before removing old ones
maxUnavailable: 0 # Never allow requests to be dropped
template:
spec:
containers:
- name: api
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 2
Critical: Liveness vs Readiness
Liveness determines if a pod should be restarted. Readiness determines if it should receive traffic. Never confuse them. A pod failing readiness during a database migration should not trigger a restart.
Graceful Shutdown
Rolling updates alone aren't enough. Your application must handle SIGTERM gracefully — completing in-flight requests before exiting. Add this to your Node.js apps:
Node.js
process.on('SIGTERM', async () => {
console.log('SIGTERM received. Closing server gracefully...');
// Stop accepting new connections
server.close(async () => {
// Drain existing connections
await closeDbPool();
await closeRedisClient();
console.log('Shutdown complete.');
process.exit(0);
});
// Force exit after 30s if graceful shutdown fails
setTimeout(() => {
console.error('Forced shutdown after timeout.');
process.exit(1);
}, 30000);
});
Strategy 2: Blue-Green Deployments
When you can't afford the risk of a rolling update (schema changes, major API version bumps), blue-green is your tool. You maintain two identical environments and switch traffic atomically.
The implementation in Kubernetes uses a Service selector swap — the fastest and most reliable approach:
bash
#!/bin/bash
# deploy-blue-green.sh
CURRENT=$(kubectl get service my-api -o jsonpath='{.spec.selector.version}')
NEW_VERSION=${1:-"v2"}
echo "Current: $CURRENT → Deploying: $NEW_VERSION"
# Deploy new version (doesn't receive traffic yet)
kubectl apply -f deployment-${NEW_VERSION}.yaml
# Wait for new deployment to be ready
kubectl rollout status deployment/my-api-${NEW_VERSION} --timeout=5m
# Run smoke tests against the new deployment
./run-smoke-tests.sh ${NEW_VERSION}
if [ $? -eq 0 ]; then
# Switch traffic to new version (atomic swap)
kubectl patch service my-api \
-p "{\"spec\":{\"selector\":{\"version\":\"${NEW_VERSION}\"}}}"
echo "✅ Traffic switched to ${NEW_VERSION}"
# Keep old version running for 10 minutes (easy rollback)
sleep 600
kubectl delete deployment my-api-${CURRENT}
else
echo "❌ Smoke tests failed. Keeping $CURRENT live."
kubectl delete deployment my-api-${NEW_VERSION}
exit 1
fi
Strategy 3: Canary Releases
Canary releases let you test a new version with a small percentage of real traffic before committing fully. Combined with monitoring and automated rollback, this is the gold standard for high-risk changes.
Using Kubernetes' native traffic splitting with a weighted service setup, plus automatic rollback based on error rate metrics:
bash
# Start with 5% traffic to canary
kubectl patch service my-api --type='json' \
-p='[{"op":"replace","path":"/spec/selector","value":{"app":"my-api"}}]'
# Monitor error rates for 15 minutes
for i in {1..30}; do
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode "query=rate(http_requests_total{status=~'5.*',version='canary'}[2m]) / rate(http_requests_total{version='canary'}[2m]) * 100" \
| jq '.data.result[0].value[1]' | tr -d '"')
if (( $(echo "$ERROR_RATE > 1.0" | bc -l) )); then
echo "🚨 Error rate ${ERROR_RATE}% — rolling back!"
kubectl patch service my-api-canary --type='json' \
-p='[{"op":"replace","path":"/spec/replicas","value":0}]'
exit 1
fi
sleep 30
done
echo "✅ Canary healthy. Promoting to 100%."
Database Migrations
Zero-downtime database migrations require the "expand/contract" pattern. Never add NOT NULL columns without defaults, never drop columns immediately. Run migrations in backward-compatible phases over multiple deployments.
Monitoring & Automated Rollback
The final piece is automated rollback. Human reaction time isn't fast enough for production incidents. Your deployment pipeline should continuously monitor SLOs and roll back automatically when they're breached.
I use a simple but effective approach — a Prometheus query that triggers rollback if the P99 latency or error rate exceeds thresholds within 5 minutes of deployment:
YAML
# prometheus-alerts.yaml
groups:
- name: deployment
rules:
- alert: DeploymentErrorRateHigh
expr: |
sum(rate(http_requests_total{status=~"5.."}[2m])) /
sum(rate(http_requests_total[2m])) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate > 1% — auto-rollback triggered"
The Checklist
Before every production deployment, run through this list:
maxUnavailable: 0set in your Deployment spec- Liveness and readiness probes configured and tested
- Graceful shutdown handling
SIGTERMwith connection draining - Database migrations applied in expand/contract phases
- Smoke tests gate traffic switching in blue-green flows
- Error rate monitoring with automated rollback triggers
- Old version kept available for at least 10 minutes post-switch
Zero-downtime deployments aren't a magic trick — they're the compound result of disciplined engineering across every layer. Get these fundamentals right, and your deployments become a non-event. Which is exactly how it should be.