Manage the csi-plugin and csi-provisioner components - Container Service for Kubernetes

The Container Storage Interface (CSI) components include csi-plugin and csi-provisioner. These components allow you to dynamically create, attach, and detach volumes.

Component introduction

When you create a cluster, the following CSI components are installed by default:

Component	Description	Deployment type
csi-plugin	Mounts, unmounts, and formats volumes.	DaemonSet
csi-provisioner	Dynamically creates and scales out volumes, and creates snapshots. By default, it supports the creation of Elastic Block Storage, NAS, and OSS volumes.	Deployment

Note

When you create a new cluster, the managed version of csi-provisioner is installed by default. Alibaba Cloud is responsible for the operations and maintenance (O&M) of managed components, so you cannot see the related pods in your cluster.

Upgrade csi-plugin and csi-provisioner

You can view the versions of the csi-plugin and csi-provisioner components and upgrade them in the console.

Important

If you are using the csi-compatible-controller component to migrate FlexVolume to CSI and the migration is not complete, you cannot automatically upgrade the csi-plugin and csi-provisioner components. You must complete the migration before you upgrade the components. Alternatively, you can manually upgrade the CSI components during the migration. For more information, see Upgrade components.

Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the one you want to manage and click its name. In the navigation pane on the left, click Add-ons.
Click the Storage tab. In the csi-plugin and csi-provisioner sections, check for available upgrades and then upgrade the components.
Note
If the upgrade fails in the console, see Component upgrade failures for troubleshooting.

FAQ

Component issues

The CSI component fails to start, the image pull fails, and an `exec /usr/bin/plugin.csi.alibabacloud.com: exec format error` error occurs

Symptom

The csi-plugin container in the pod of the csi-plugin component reports an exec /usr/bin/plugin.csi.alibabacloud.com: exec format error error.

Cause

By default, csi-plugin supports both amd64 and arm64 architectures. If the pod runs on a node with an amd64 or arm64 architecture, this error may occur because the image pull is incomplete. An incomplete image pull means that the image metadata exists, but the image was not successfully pulled. This makes the binary invalid. An image pull can fail if the node is forcibly shut down during the process. You can check the ActionTrail logs for the ECS instance to determine whether a shutdown command was executed.

Solution

Scale out the cluster by adding a new node, and then drain the current node.
If you do not want to add a new node, perform the following steps:
1. Drain the applications from the current node and then remove the node from the cluster.
2. Log on to the node and delete all containers, if any.
3. Delete all files in the /var/lib/containerd directory.
4. Add the node back to the cluster.

Out-of-memory (OOM) issues caused by storage components

The sidecar container in the pod of the csi-provisioner component caches information such as pods, persistent volumes (PVs), and PersistentVolumeClaims (PVCs). This can cause an out-of-memory (OOM) error as the cluster size increases.

If you use the managed version of the csi-provisioner component, submit a ticket for assistance.
If you use the self-managed version of the csi-provisioner component and an OOM error occurs, you must adjust the memory limit based on your cluster size. Perform the following steps:
1. Log on to the ACK console. In the navigation pane on the left, click Clusters.
2. On the Clusters page, find the one you want to manage and click its name. In the navigation pane on the left, click Add-ons.
3. On the Component Management page, find the csi-provisioner component, click the icon, and select View YAML.
4. Modify the component's YAML file to adjust the memory limit based on your cluster size.

High network traffic is observed for the csi-plugin pod

Symptom

Pod monitoring shows high network traffic for the csi-plugin pod.

Cause

The csi-plugin component is responsible for mounting NAS volumes on nodes, which generates network traffic. When a pod on a node uses a NAS volume, the NAS request traffic from that pod passes through the csi-plugin namespace. This traffic is then recorded by cluster monitoring, which reports high network traffic for the csi-plugin pod.

Solution

No action is required. This traffic is only recorded by the monitoring component. It is not duplicated and does not consume extra network bandwidth.

The csi-provisioner component log shows a `failed to renew lease xxx timed out waiting for the condition` error

Symptom

When you run the kubectl logs csi-provisioner-xxxx -n kube-system command to view the CSI logs, you may find a failed to renew lease xxx timed out waiting for the condition error.

Cause

The csi-provisioner component is a high availability (HA) component that runs multiple replicas. The pods use the Kubernetes Lease object for leader election. During this process, a pod must access the API server to acquire a specific lease. The pod that acquires the lease becomes the leader and provides services to the cluster. This error occurs when a csi-provisioner pod fails to access the API server.

Solution

Check whether the cluster network and the API server are functioning normally. If the issue persists, submit a ticket for assistance.

Component upgrade failures

The csi-plugin component pre-check fails

If your cluster does not use disk, NAS, or OSS volumes, or if the cluster is a test environment, you can manually update the image to upgrade the csi-plugin component.
The following is a sample command. Replace <image url> with the URL of the image for the new version. For more information, see csi-plugin.
```
kubectl set image -n kube-system daemonset/csi-plugin csi-plugin=<image url>
```

If your cluster uses disk, NAS, or OSS volumes and contains critical business data, submit a ticket to request assistance with a manual upgrade.

The csi-plugin component pre-check passes, but the upgrade fails

The csi-plugin component is a DaemonSet. The upgrade fails if the cluster contains nodes that are in the NotReady state or any state other than Running. You must manually fix the faulty nodes and then try the upgrade again.

If you cannot determine the cause, submit a ticket to request assistance with a manual upgrade.

The console shows the csi-plugin component but not the csi-provisioner component

Earlier versions of csi-provisioner (1.14 and earlier) were deployed as a StatefulSet. If a csi-provisioner StatefulSet exists in your cluster, run the kubectl delete sts csi-provisioner command to delete it. Then, reinstall the csi-provisioner component.

If an error occurs, submit a ticket to request assistance with a manual upgrade.

The csi-provisioner component pre-check fails

If your cluster does not use dynamically provisioned volumes of the disk, NAS, or OSS type that were created using a StorageClass, or if the cluster is a test environment, you can manually update the image to upgrade the csi-provisioner component.
The following is a sample command. Replace <image url> with the URL of the image for the new version. For more information, see csi-provisioner.
```
kubectl set image -n kube-system deployment/csi-provisioner csi-provisioner=<image url>
```
If your cluster uses dynamically provisioned volumes of the disk, NAS, or OSS type that were created using a StorageClass and the cluster contains critical business data, submit a ticket to request assistance with a manual upgrade.

The csi-provisioner component pre-check passes, but the upgrade fails

submit a ticket to request assistance with a manual upgrade.

The csi-provisioner component upgrade fails because the number of cluster nodes does not meet requirements

Symptoms

Symptom 1: The csi-provisioner component pre-check fails with an error indicating that the number of cluster nodes does not meet the requirements.

Symptom 2: The csi-provisioner component pre-check and upgrade succeed, but the component's pod enters the CrashLoopBackOff state. The log shows an error similar to the following 403 Forbidden message.

time="2023-08-05T13:54:00+08:00" level=info msg="Use node id : <?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n <head>\n <title>403 - Forbidden</title>\n </head>\n <body>\n <h1>403 - Forbidden</h1>\n </body>\n</html>\n"

Causes

Cause of Symptom 1: To ensure high availability, csi-provisioner runs primary and secondary pods that must be deployed on different nodes. The component upgrade fails if the cluster has only one node.
Cause of Symptom 2: Security hardening is enabled on the node where the csi-provisioner pod is running. This mode blocks access to the metadata server on the node, which causes the error.

Solutions

Solution for Symptom 1: Update the csi-provisioner component. For more information, see Manage the csi-plugin and csi-provisioner components.
Solution for Symptom 2: Disable security hardening on the node. CSI requires access to the node's metadata.

The csi-provisioner component upgrade fails due to changes in StorageClass properties

Symptom

The csi-provisioner component pre-check fails with an error indicating that the StorageClass properties do not meet expectations.

Cause

The properties of the default StorageClass were changed. This can happen if you delete and then recreate a StorageClass with the same name. StorageClass properties are immutable. If you change them, the component upgrade fails.

Solution

You must delete the default StorageClasses in the cluster, such as alicloud-disk-essd, alicloud-disk-available, alicloud-disk-efficiency, alicloud-disk-ssd, and alicloud-disk-topology. This deletion does not affect existing applications. After you delete the StorageClasses, try to reinstall the csi-provisioner component. After the installation is complete, the system automatically recreates the StorageClasses. No further action is required.

Important

If you need a custom StorageClass, create a new one with a different name. Do not modify the default StorageClasses.

References

For more information about CSI, see alibaba-cloud-csi-driver.
For more information about how to install and uninstall components, see Components.

Component introduction

Upgrade csi-plugin and csi-provisioner

FAQ

Component issues

The CSI component fails to start, the image pull fails, and an exec /usr/bin/plugin.csi.alibabacloud.com: exec format error error occurs

Out-of-memory (OOM) issues caused by storage components

High network traffic is observed for the csi-plugin pod

The csi-provisioner component log shows a failed to renew lease xxx timed out waiting for the condition error

Component upgrade failures

The csi-plugin component pre-check fails

The csi-plugin component pre-check passes, but the upgrade fails

The console shows the csi-plugin component but not the csi-provisioner component

The csi-provisioner component pre-check fails

The csi-provisioner component pre-check passes, but the upgrade fails

The csi-provisioner component upgrade fails because the number of cluster nodes does not meet requirements

Solutions

The csi-provisioner component upgrade fails due to changes in StorageClass properties

References

The CSI component fails to start, the image pull fails, and an `exec /usr/bin/plugin.csi.alibabacloud.com: exec format error` error occurs

The csi-provisioner component log shows a `failed to renew lease xxx timed out waiting for the condition` error