Introduction
In the fast-paced world of software deployment, the ability to release new features safely and efficiently can make or break your application's reliability. Canary deployments have emerged as a critical strategy for minimizing risk while maintaining continuous delivery. In this comprehensive guide, we'll explore how to implement robust canary deployments using Flagger, a progressive delivery operator for Kubernetes.
What is Canary Deployment?
Canary deployment is a technique for rolling out new features or changes to a small subset of users before releasing the update to the entire system. Named after the "canary in a coal mine" practice, this approach allows you to detect issues early and rollback quickly if problems arise.
Instead of replacing your entire application at once, canary deployments gradually shift traffic from the stable version (primary) to the new version (canary), monitoring key metrics throughout the process. If the metrics indicate problems, the deployment automatically rolls back to the stable version.
Why Choose Flagger?
Flagger is a progressive delivery operator that automates the promotion or rollback of canary deployments based on metrics analysis. Here's why it stands out:
- Automated Traffic Management: Gradually shifts traffic between versions
- Metrics-Driven Decisions: Uses Prometheus metrics to determine deployment success
- Multiple Ingress Support: Works with NGINX, Istio, Linkerd, and more
- Webhook Integration: Supports custom testing and validation hooks
- HPA Integration: Seamlessly works with Horizontal Pod Autoscaler
Prerequisites and Setup
As shared above, Flagger provides multiple integration options but I used Nginx ingress controller and Prometheus for metrics.
Required Components
- NGINX Ingress Controller (v1.0.2 or newer)
- Horizontal Pod Autoscaler (HPA) enabled
- Prometheus for metrics collection and analysis
- Flagger deployed in your cluster
Verification Commands
# Check NGINX ingress controller kubectl get service --all-namespaces | grep nginx # Verify HPA is enabled kubectl get hpa --all-namespaces # Confirm Flagger installation kubectl get all -n flagger Step 1: Installing Flagger
Flagger can be deployed using Helm or ArgoCD. Once installed, it creates several Custom Resource Definitions (CRDs):
kubectl get crds | grep flagger # Expected output: # alertproviders.flagger.app # canaries.flagger.app # metrictemplates.flagger.app Step 2: Understanding Flagger's Architecture
When you deploy a canary with Flagger, it automatically creates and manages several Kubernetes objects:
Original Objects (You Provide)
deployment.apps/your-apphorizontalpodautoscaler.autoscaling/your-appingresses.extensions/your-appcanary.flagger.app/your-app
Generated Objects (Flagger Creates)
deployment.apps/your-app-primaryhorizontalpodautoscaler.autoscaling/your-app-primaryservice/your-appservice/your-app-canaryservice/your-app-primaryingresses.extensions/your-app-canary
Step 3: Creating Your First Canary Configuration
Here's a comprehensive canary configuration example:
apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: my-app namespace: production spec: provider: nginx # Reference to your deployment targetRef: apiVersion: apps/v1 kind: Deployment name: my-app # Reference to your ingress ingressRef: apiVersion: networking.k8s.io/v1 kind: Ingress name: my-app # Optional HPA reference autoscalerRef: apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler name: my-app # Maximum time for canary to make progress before rollback progressDeadlineSeconds: 600 service: port: 80 targetPort: 8080 portDiscovery: true analysis: # Analysis runs every minute interval: 1m # Maximum failed checks before rollback threshold: 5 # Maximum traffic percentage to canary maxWeight: 50 # Traffic increment step stepWeight: 10 # Metrics to monitor metrics: - name: "error-rate" templateRef: name: error-rate thresholdRange: max: 0.02 # 2% error rate threshold interval: 1m - name: "latency" templateRef: name: latency thresholdRange: max: 500 # 500ms latency threshold interval: 1m # Optional webhooks for testing webhooks: - name: load-test url: http://flagger-loadtester.test/ timeout: 15s metadata: cmd: "hey -z 1m -q 10 -c 2 http://my-app-canary:8080/" Step 4: Setting Up Service Monitors
For Prometheus to collect metrics from both primary and canary services, you need to create separate ServiceMonitor resources:
# Canary ServiceMonitor apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: my-app-canary spec: endpoints: - port: metrics path: /metrics interval: 5s selector: matchLabels: app.kubernetes.io/name: my-app-canary --- # Primary ServiceMonitor apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: my-app-primary spec: endpoints: - port: metrics path: /metrics interval: 5s selector: matchLabels: app.kubernetes.io/name: my-app-primary At this point, you may find metrics discovery in the Prometheus,
Step 5: Creating Custom Metric Templates
Flagger uses MetricTemplate resources to define how metrics are calculated. Here's an example for error rate comparison:
apiVersion: flagger.app/v1beta1 kind: MetricTemplate metadata: name: error-rate spec: provider: type: prometheus address: http://prometheus:9090 query: | sum( rate( http_requests_total{ service="my-app-canary", status=~"5.*" }[1m] ) or on() vector(0))/sum(rate( http_requests_total{ service="my-app-canary" }[1m] )) - sum( rate( http_requests_total{ service="my-app-primary", status=~"5.*" }[1m] ) or on() vector(0))/sum(rate( http_requests_total{ service="my-app-primary" }[1m] )) This query calculates the difference in error rates between canary and primary versions. The or on() vector(0) ensures the query returns 0 when no metrics are available instead of failing.
Understanding the Canary Analysis Process
The Promotion Flow
When Flagger detects a new deployment, it follows this process:
- Initialization: Scale up canary deployment alongside primary
- Pre-rollout Checks: Execute pre-rollout webhooks
- Traffic Shifting: Gradually increase traffic to canary (10% → 20% → 30% → 40% → 50%)
- Metrics Analysis: Check error rates, latency, and custom metrics at each step
- Promotion Decision: If all checks pass, promote canary to primary
- Cleanup: Scale down old primary, update primary with canary spec
Rollback Scenarios
Flagger automatically rolls back when:
- Error rate exceeds threshold
- Latency exceeds threshold
- Custom metric checks fail
- Webhook tests fail
- Failed checks counter reaches threshold
Monitoring Canary Progress
# Watch all canaries in real-time watch kubectl get canaries --all-namespaces # Get detailed canary status kubectl describe canary/my-app -n production # View Flagger logs kubectl logs -f deployment/flagger -n flagger-system Advanced Features
Webhooks for Enhanced Testing
Flagger supports multiple webhook types for comprehensive testing:
webhooks: # Manual approval before rollout - name: "confirm-rollout" type: confirm-rollout url: http://approval-service/gate/approve # Pre-deployment testing - name: "integration-test" type: pre-rollout url: http://test-service/ timeout: 5m metadata: type: bash cmd: "run-integration-tests.sh" # Load testing during rollout - name: "load-test" type: rollout url: http://loadtester/ metadata: cmd: "hey -z 2m -q 10 -c 5 http://my-app-canary/" # Manual promotion approval - name: "confirm-promotion" type: confirm-promotion url: http://approval-service/gate/approve # Post-deployment notifications - name: "slack-notification" type: post-rollout url: http://notification-service/slack HPA Integration
When using HPA with canary deployments, Flagger pauses traffic increases while scaling operations are in progress:
autoscalerRef: apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler name: my-app-primary primaryScalerReplicas: minReplicas: 2 maxReplicas: 10 Alerting and Notifications
Configure alerts to be notified of canary deployment status:
analysis: alerts: - name: "canary-status" severity: info providerRef: name: slack-alert namespace: flagger-system Production Considerations
Traffic Requirements
For effective canary analysis, you need sufficient traffic to generate meaningful metrics. If your production traffic is low:
- Consider using load testing webhooks
- Implement synthetic traffic generation
- Adjust analysis intervals and thresholds accordingly
Metrics Selection
Choose metrics that accurately reflect your application's health:
- Error Rate: Monitor 5xx responses
- Latency: Track P95 or P99 response times
- Custom Business Metrics: Application-specific indicators
Deployment Timing
Calculate your deployment duration:
Minimum time = interval × (maxWeight / stepWeight) Rollback time = interval × threshold For example, with interval=1m, maxWeight=50%, stepWeight=10%, threshold=5:
- Minimum deployment time: 1m × (50/10) = 5 minutes
- Rollback time: 1m × 5 = 5 minutes
Troubleshooting Common Issues
Missing Metrics
Problem: Canary fails due to missing metrics
Solution: Verify ServiceMonitor selectors match service labels
Webhook Failures
Problem: Load testing webhooks time out
Solution: Increase webhook timeout and verify load tester accessibility
HPA Conflicts
Problem: Scaling issues during canary deployment
Solution: Ensure HPA references are correctly configured for both primary and canary
Network Policies
Problem: Traffic routing issues
Solution: Verify network policies allow communication between services
Best Practices
- Start Small: Begin with low traffic percentages and gradual increases
- Monitor Actively: Set up comprehensive alerting for canary deployments
- Test Thoroughly: Use webhooks for automated testing at each stage
- Plan for Rollback: Ensure your rollback process is well-tested
- Document Everything: Maintain clear documentation of your canary processes
Conclusion
Flagger provides a robust, automated solution for implementing canary deployments in Kubernetes environments. By gradually shifting traffic while monitoring key metrics, it enables safe deployments with automatic rollback capabilities.
The combination of metrics-driven analysis, webhook integration, and seamless traffic management makes Flagger an excellent choice for teams looking to implement progressive delivery practices. Start with simple configurations and gradually add more sophisticated monitoring and testing as your confidence grows.
Remember that successful canary deployments depend not just on the tooling, but also on having appropriate metrics, sufficient traffic, and well-defined success criteria. With proper implementation, Flagger can significantly reduce deployment risks while maintaining the agility your development teams need.

Top comments (0)