DEV Community

Cover image for Azure DevOps CD Self-Reverting AKS Failed Deployment
Andrei Kniazev
Andrei Kniazev

Posted on • Edited on

Azure DevOps CD Self-Reverting AKS Failed Deployment

Recently, I was working on a migration from Azure Function Apps to Azure Kubernetes Service. While building a CI/CD pipeline, I decided to try applying a self-healing approach. The idea is that what if we can detect failures earlier and automatically recover from them?

(If you just need a full template, scroll to the end.)

How can problems be detected and reverted in Kubernetes deployment?

We use Deployment to manage workloads in Kubernetes. And the power of deployment that we can check status and rollback.

After applying, we can check the status of the deployment to detect problems.

If we just run this command:

kubectl rollout status deployment/some-deployment 
Enter fullscreen mode Exit fullscreen mode

It will exit with 0 if deployment is ready. But if deployment is not ready and fails due to reasons, it will be stuck in this command and wait for deployment to complete. To escape this behaviour and exit with 1 if deployment is not ready on time we need to add a timeout argument:

kubectl rollout status deployment/some-deployment --timeout=1m 
Enter fullscreen mode Exit fullscreen mode

This argument will timeout our deployment if it does not succeed within the timeframe.

Reverting changes.

To revert changes in Kubernetes we can use this command:

kubectl rollout undo deployment/some-deployment 
Enter fullscreen mode Exit fullscreen mode

Before
Image description

After

Image description

Rollback Concept & Prerequisites.

The rollback mechanism in Kubernetes is designed to revert a deployment to a previous state if the current deployment is unhealthy or fails to meet the desired criteria.

In this case you should be aware that any changes that are not reversable can lead to a situation where rollback will not help.

Good example of those changes a schema change in database. For example during deployment you deploy schema changes and new code that supports it. But during deployment something went wrong and the new code has error and the app cannot start.

In this case if you rollback to a previous version the app will not work as intended because of the schema changes.

To successfully work with automatic rollbacks your team should always apply concept of Rolling Migrations.

Example Scenario.

Consider an application that requires a database schema update. The rolling migration process would involve the following steps:

  • Deploy Code Supporting New Schema (Disabled): First, deploy the new application code that supports the new schema, but keep the new features disabled.
  • Deploy Infrastructure Changes: Apply the database schema changes.
  • Health Check: Introduce health checks to validate that the application can connect to the updated database schema.
  • Enable New Features: Once the health checks pass, enable the new features in the application.
  • Monitor Deployment: Continuously monitor the deployment to ensure everything is functioning correctly.
  • Rollback if Necessary: If any issues are detected, rollback to the previous version of the application.

In this example the code is compatable with previous schema and even in the case of rollback the app will work as intended.

Better Deployment Status.

The deployment can get stuck and never completed because of the following:

  • Insufficient quota
  • Readiness probe failures
  • Image pull errors
  • Insufficient permissions
  • Limit ranges
  • Application runtime misconfiguration

Most of those problems are due to misconfigured deployments or clusters, but there is one that can be customised and can make our check even more robust. But Readiness probe failures allow us to extend those reasons to a more sophisticated one.

If we specify the URL for Readiness probe failures as our health check and extend it to validate more scenarios, like whether our application is able to connect external services like databases with the needed permissions, it will allow us to improve our Deployment Status Check.

Template Itself.

Azure DevOps yml pipeline template to validate AKS deployment and roll back if needed.

parameters: - name: environment type: string displayName: Required. Represents the stage of deployment. Usualy it is Development, Test, Acceptance, Production. It is needed so you can have multiple stages in the same pipeline. - name: azureServiceConnection displayName: Required. Azure service connection that will provide access to ARM. - name: aksName displayName: Required. Name of the AKS. - name: aksRg displayName: Required. Name of the resource group where AKS is. - name: aksSubscription displayName: Required. Azure subscription ID where AKS is. - name: namespace displayName: Required. Namespace of the k8s deployment to check. - name: deploymentName displayName: Required. Name of the k8s deployment to check. - name: timeout displayName: Optional. Timeout for the deployment check. The default is 1m. Example 1m, 2m, 10m default: 1m - name: dependsOn type: object displayName: Optional. Pass list of previous stages to depend on them. default: [] stages: - stage: AksDeploymentHealthAndRollback${{ parameters.environment}} displayName: ${{ parameters.environment}} — K8S Deployment Health and Rollback dependsOn: ${{ parameters.dependsOn }} jobs: - job: AksDeploymentHealthCheck displayName: ${{ parameters.environment}} — K8S Deployment Health steps: - checkout: self displayName: Checkout fetchTags: false - task: AzureCLI@2 displayName: K8S Deployment Health inputs: azureSubscription: ${{ parameters.azureServiceConnection}} scriptType: bash scriptLocation: inlineScript inlineScript: | set -e # fail when script is failing sudo az aks install-cli az aks get-credentials --name ${{ parameters.aksName }} \ --resource-group ${{ parameters.aksRg }} \ --subscription ${{ parameters.aksSubscription }} \ --overwrite-existing \ --file .kubeconfig-${{ parameters.aksName }} export KUBECONFIG=$(pwd)/.kubeconfig-${{ parameters.aksName }} # Set default namespace kubectl config set-context ${{ parameters.aksName }} --namespace=${{ parameters.namespace }} kubectl config get-contexts # Pass kubeconfig to kubelogin to access k8s API kubelogin convert-kubeconfig -l azurecli # Check the rollout status of the deployment kubectl rollout status deployment/${{ parameters.deploymentName }} --timeout=${{ parameters.timeout}} - job: ManualApprovalOfRollBack displayName: ${{ parameters.environment}} — Manual Approval Of Rollback dependsOn: AksDeploymentHealthCheck condition: failed() pool: server steps: - task: ManualValidation@0 displayName: Approve Rollback timeoutInMinutes: 1440 # task times out in 1 day - job: Rollback displayName: ${{ parameters.environment}} — K8S Deployment Rollback dependsOn: ManualApprovalOfRollBack condition: eq(dependencies.ManualApprovalOfRollBack.result, 'Succeeded') steps: - checkout: self displayName: Checkout fetchTags: false - task: AzureCLI@2 displayName: K8S Rollback inputs: azureSubscription: ${{ parameters.azureServiceConnection}} scriptType: bash scriptLocation: inlineScript inlineScript: | set -e # fail when script is failing sudo az aks install-cli az aks get-credentials --name ${{ parameters.aksName }} \ --resource-group ${{ parameters.aksRg }} \ --subscription ${{ parameters.aksSubscription }} \ --overwrite-existing \ --file .kubeconfig-${{ parameters.aksName }} export KUBECONFIG=$(pwd)/.kubeconfig-${{ parameters.aksName }} # Set default namespace kubectl config set-context ${{ parameters.aksName }} --namespace=${{ parameters.namespace }} kubectl config get-contexts # Pass kubeconfig to kubelogin to access k8s API kubelogin convert-kubeconfig -l azurecli # Rollback the deployment kubectl rollout undo deployment/${{ parameters.deploymentName }} - job: AksRollbackHealthCheck displayName: ${{ parameters.environment}} — K8S Rollback Health dependsOn: Rollback condition: eq(dependencies.Rollback.result, 'Succeeded') steps: - checkout: self displayName: Checkout fetchTags: false - task: AzureCLI@2 displayName: K8S Rollback Health inputs: azureSubscription: ${{ parameters.azureServiceConnection}} scriptType: bash scriptLocation: inlineScript inlineScript: | set -e # fail when script is failing sudo az aks install-cli az aks get-credentials --name ${{ parameters.aksName }} \ --resource-group ${{ parameters.aksRg }} \ --subscription ${{ parameters.aksSubscription }} \ --overwrite-existing \ --file .kubeconfig-${{ parameters.aksName }} export KUBECONFIG=$(pwd)/.kubeconfig-${{ parameters.aksName }} # Set default namespace kubectl config set-context ${{ parameters.aksName }} --namespace=${{ parameters.namespace }} kubectl config get-contexts # Pass kubeconfig to kubelogin to access k8s API kubelogin convert-kubeconfig -l azurecli # Check the rollout status of the deployment kubectl rollout status deployment/${{ parameters.deploymentName }} --timeout=${{ parameters.timeout}} - job: Clean displayName: Clean Up dependsOn: [ AksDeploymentHealthCheck, ManualApprovalOfRollBack, Rollback ] condition: always() steps: - checkout: none - script: | rm -rf ~/.kube/config displayName: Remove kube config 
Enter fullscreen mode Exit fullscreen mode

Template behaviour:

Image description

Image description

Top comments (0)