Container Service alert management - Container Service for Kubernetes

You can enable the Container Service alert management feature to centrally manage container alerts. This feature monitors for anomalous events in container services, key metrics for basic cluster resources, and metrics for core cluster components and applications. You can also use CustomResourceDefinitions (CRDs) to modify the default alert rules in your cluster. This helps you promptly detect cluster anomalies.

Billing

The alert feature uses data from Simple Log Service (SLS), Managed Service for Prometheus, and CloudMonitor. Additional fees are charged for notifications, such as text messages and phone calls, that are sent when an alert is triggered. Before you enable the alert feature, check the data source for each alert item in the default alert rule templates and activate the required services.

Alert Source	Configuration Requirements	Billing Details
Simple Log Service (SLS)	Enable event monitoring. Event monitoring is enabled by default when you enable the alert management feature.	Billing of pay-by-feature
Managed Service for Prometheus	Configure Prometheus monitoring for your cluster.	Free of charge
CloudMonitor	For the cluster: Enable the Cloud Monitor feature for a Container Service for Kubernetes cluster.	Pay-as-you-go

Enable the alert management feature

After you enable the alert management feature, you can set metric-based alerts for specified resources in your cluster. You automatically receive alert notifications when anomalies occur. This helps you manage and maintain your cluster more efficiently and ensures service stability. For more information about resource alerts, see Default alert rule templates.

ACK managed clusters

You can enable alert configuration for an existing cluster or when you create a new cluster.

Enable the feature for an existing cluster

Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose Operations > Alerts.
On the Alerting page, follow the on-screen instructions to install or upgrade the components.

After the installation or upgrade is complete, go to the Alerting page to configure alert information.

Tab	Description
Alert Rule Management	Status: Turn on or off the target alert rule set. Edit Notification Object: Set the contact group for alert notifications. Before you configure this, create contacts and groups, and add the contacts to the groups. You can only select contact groups as notification objects. To notify a single person, create a group that contains only that contact and select the group.
Alert History	You can view the latest 100 alert records from the last 24 hours. Click the link in the Alert Rule column to go to the corresponding monitoring system and view the detailed rule configuration. Click Troubleshoot to quickly locate the resource where the anomaly occurred (anomalous event or metric). Click Intelligent Analysis to use the AI assistant to help analyze the issue and provide troubleshooting guidance.
Contact Management	Manage contacts. You can create, edit, or delete contacts. Contact Methods: Phone/Text Message: After you set a mobile number for a contact, the contact can receive alert notifications by phone and text message. Only verified mobile numbers can be used to receive phone call notifications. For more information about how to verify a mobile number, see Verify a mobile number. Email: After you set an email address for a contact, the contact can receive alert notifications by email. Robot: DingTalk Robot, WeCom Robot, and Lark Robot. For DingTalk robots, you must add the security keywords: Alerting, Dispatch. Before you configure email and robot notifications, verify them in the CloudMonitor console. Choose Alert Service > Alert Contacts to ensure you can receive alert information.
Contact Group Management	Manage contact groups. You can create, edit, or delete contact groups. You can only select contact groups when you Edit Notification Object. If no contact group exists, the console creates a default contact group based on your Alibaba Cloud account information.

Enable the feature when creating a cluster

On the Component Configurations page of the cluster creation wizard, select Configure Alerts Using The Default Alert Template for Alerting and select an Alert Notification Contact Group. For more information, see Create an ACK managed cluster.

After you enable alert configurations during cluster creation, the system enables the default alert rules and sends alert notifications to the default alert contact group. You can also modify the alert contacts or alert contact groups.

ACK dedicated clusters

For ACK dedicated clusters, you must first grant permissions to the worker RAM role and then enable the default alert rules.

Grant permissions to the worker RAM role

Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the target cluster and click its name. In the navigation pane on the left, click Cluster Information.
On the Cluster Information page, in the Cluster Resources section, copy the name of the Worker RAM Role and click the link to open the Resource Access Management (RAM) console and grant permissions to the role.
1. Create a custom policy. For more information, see Create a custom policy on the JSON tab.
```
{ "Action": [ "log:*", "arms:*", "cms:*", "cs:UpdateContactGroup" ], "Resource": [ "*" ], "Effect": "Allow" }
```
2. On the Roles page, find the worker RAM role and grant the custom policy to it. For more information, see Method 1: Grant permissions to a RAM role on the RAM role page.
Note: This document grants broad permissions for simplicity. In a production environment, we recommend that you follow the principle of least privilege and grant only the required permissions.
1. On the Roles page, find the worker RAM role and grant the custom policy to it. For more information, see Method 1: Grant permissions to a RAM role on the RAM role page.
Check the logs to verify that the access permissions for the alert feature are configured.
1. In the navigation pane on the left of the cluster management page, choose Workloads > Stateless.
2. Set Namespace to kube-system and click the Name of the alicloud-monitor-controller application in the list of stateless applications.
3. Click the Logs tab. The pod logs indicate that the authorization was successful.

Enable default alert rules

In the navigation pane on the left of the cluster management page, choose O&M > Alerting.

On the Alerting page, configure the following alert information.

Tab	Description
Alert Rule Management	Status: Turn on or off the target alert rule set. Edit Notification Object: Set the contact group for alert notifications. Before you configure this, create contacts and groups, and add the contacts to the groups. You can only select contact groups as notification objects. To notify a single person, create a group that contains only that contact and select the group.
Alert History	You can view the latest 100 alert records from the last 24 hours. Click the link in the Alert Rule column to go to the corresponding monitoring system and view the detailed rule configuration. Click Troubleshoot to quickly locate the resource where the anomaly occurred (anomalous event or metric). Click Intelligent Analysis to use the AI assistant to help analyze the issue and provide troubleshooting guidance.
Contact Management	Manage contacts. You can create, edit, or delete contacts. Contact Methods: Phone/Text Message: After you set a mobile number for a contact, the contact can receive alert notifications by phone and text message. Only verified mobile numbers can be used to receive phone call notifications. For more information about how to verify a mobile number, see Verify a mobile number. Email: After you set an email address for a contact, the contact can receive alert notifications by email. Robot: DingTalk Robot, WeCom Robot, and Lark Robot. For DingTalk robots, you must add the security keywords: Alerting, Dispatch. Before you configure email and robot notifications, verify them in the CloudMonitor console. Choose Alert Service > Alert Contacts to ensure you can receive alert information.
Contact Group Management	Manage contact groups. You can create, edit, or delete contact groups. You can only select contact groups when you Edit Notification Object. If no contact group exists, the console creates a default contact group based on your Alibaba Cloud account information.

Configure alert rules

After you enable the alert configuration feature, an AckAlertRule CustomResourceDefinition (CRD) resource is created in the kube-system namespace. This resource contains the default alert rule templates. You can modify this CRD resource to customize the default alert rules and configure container service alerts that meet your requirements.

Console

Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose Operations > Alerts.
On the Alert Rule Management tab, click Edit Alert Configuration in the upper-right corner. Then, click YAML in the Actions column of the target rule to view the AckAlertRule resource configuration for the current cluster.

Modify the YAML file as needed. For more information, see Default alert rule templates.

The following code shows a sample YAML configuration for an alert rule:

Alert Rule Configuration YAML

apiVersion: alert.alibabacloud.com/v1beta1 kind: AckAlertRule metadata: name: default spec: groups: # The following code provides a sample configuration for a cluster event alert rule. - name: pod-exceptions # The name of the alert rule group. This corresponds to the Group_Name field in the alert template. rules: - name: pod-oom # The name of the alert rule. type: event # The type of the alert rule (Rule_Type). Valid values: event (event type) and metric-cms (CloudMonitor metric type). expression: sls.app.ack.pod.oom # The alert rule expression. If the rule type is event, the value of this parameter is the Rule_Expression_Id value from the default alert rule templates in this topic. enable: enable # The status of the alert rule. Valid values: enable and disable. - name: pod-failed type: event expression: sls.app.ack.pod.failed enable: enable # The following code provides a sample configuration for a basic cluster resource alert rule. - name: res-exceptions # The name of the alert rule group. This corresponds to the Group_Name field in the alert template. rules: - name: node_cpu_util_high # The name of the alert rule. type: metric-cms # The type of the alert rule (Rule_Type). Valid values: event (event type), metric-cms (CloudMonitor metric), and metric-prometheus (Prometheus metric). expression: cms.host.cpu.utilization # The alert rule expression. If the rule type is metric-cms, the value of this parameter is the Rule_Expression_Id value from the default alert rule templates in this topic. contactGroups: # The contact group configuration for the alert rule. This is generated by the ACK console. Contacts are the same for the same account and can be reused across multiple clusters. enable: enable # The status of the alert rule. Valid values: enable and disable. thresholds: # The alert rule threshold. - key: CMS_ESCALATIONS_CRITICAL_Threshold unit: percent value: '85' # The CPU utilization threshold. Default: 85%. - key: CMS_ESCALATIONS_CRITICAL_Times value: '3' # The alert is triggered if the threshold is exceeded three consecutive times. - key: CMS_RULE_SILENCE_SEC # The quiescent period after the first alert is reported. value: '900'

You can use rules.thresholds to customize the alert threshold. For more information about the parameters, see the following table. For example, the preceding configuration triggers an alert notification if the CPU utilization of a cluster node exceeds 85% for three consecutive checks and the previous alert was triggered more than 900 seconds ago.

Parameter	Required	Description	Default Value
`CMS_ESCALATIONS_CRITICAL_Threshold`	Required	The alert threshold. If this parameter is not configured, the rule fails to sync and is disabled. `unit`: The unit. You can set this to percent, count, or qps. `value`: The threshold.	Depends on the default alert template configuration.
`CMS_ESCALATIONS_CRITICAL_Times`	Optional	The number of retries for the CloudMonitor rule. If this is not configured, the default value is used.	3
`CMS_RULE_SILENCE_SEC`	Optional	The quiescent period in seconds after the first alert is reported when CloudMonitor continuously triggers the rule due to an anomaly. This prevents alert fatigue. If this is not configured, the default value is used.	900

kubectl

Run the following command to edit the YAML file of the alert rule.
```
kubectl edit ackalertrules default -n kube-system
```

Modify the YAML file as needed, and then save and exit. For more information, see Default alert rule templates.

Alert Rule Configuration YAML

apiVersion: alert.alibabacloud.com/v1beta1 kind: AckAlertRule metadata: name: default spec: groups: # The following code provides a sample configuration for a cluster event alert rule. - name: pod-exceptions # The name of the alert rule group. This corresponds to the Group_Name field in the alert template. rules: - name: pod-oom # The name of the alert rule. type: event # The type of the alert rule (Rule_Type). Valid values: event (event type) and metric-cms (CloudMonitor metric type). expression: sls.app.ack.pod.oom # The alert rule expression. If the rule type is event, the value of this parameter is the Rule_Expression_Id value from the default alert rule templates in this topic. enable: enable # The status of the alert rule. Valid values: enable and disable. - name: pod-failed type: event expression: sls.app.ack.pod.failed enable: enable # The following code provides a sample configuration for a basic cluster resource alert rule. - name: res-exceptions # The name of the alert rule group. This corresponds to the Group_Name field in the alert template. rules: - name: node_cpu_util_high # The name of the alert rule. type: metric-cms # The type of the alert rule (Rule_Type). Valid values: event (event type) and metric-cms (CloudMonitor metric type). expression: cms.host.cpu.utilization # The alert rule expression. If the rule type is metric-cms, the value of this parameter is the Rule_Expression_Id value from the default alert rule templates in this topic. contactGroups: # The contact group configuration for the alert rule. This is generated by the ACK console. Contacts are the same for the same account and can be reused across multiple clusters. enable: enable # The status of the alert rule. Valid values: enable and disable. thresholds: # The alert rule threshold. - key: CMS_ESCALATIONS_CRITICAL_Threshold unit: percent value: '85' # The CPU utilization threshold. Default: 85%. - key: CMS_ESCALATIONS_CRITICAL_Times value: '3' # The alert is triggered if the threshold is exceeded three consecutive times. - key: CMS_RULE_SILENCE_SEC # The quiescent period after the first alert is reported. value: '900'

You can use rules.thresholds to customize the alert threshold. For example, the preceding configuration triggers an alert notification if the CPU utilization of a cluster node exceeds 85% for three consecutive checks and the previous alert was triggered more than 900 seconds ago.

Parameter	Required	Description	Default Value
`CMS_ESCALATIONS_CRITICAL_Threshold`	Required	The alert threshold. If this parameter is not configured, the rule fails to sync and is disabled. `unit`: The unit. You can set this to percent, count, or qps. `value`: The threshold.	Depends on the default alert template configuration.
`CMS_ESCALATIONS_CRITICAL_Times`	Optional	The number of retries for the CloudMonitor rule. If this is not configured, the default value is used.	3
`CMS_RULE_SILENCE_SEC`	Optional	The quiescent period in seconds after the first alert is reported when CloudMonitor continuously triggers the rule due to an anomaly. This prevents alert fatigue. If this is not configured, the default value is used.	900

Default alert rule templates

Alert rules are synced from Simple Log Service (SLS), Managed Service for Prometheus, and CloudMonitor. On the Alerting page, you can view the configuration of each alert rule by clicking Advanced Settings in the Alert Management column.

Error event set

Alert Item	Rule Description	Alert Source	Rule_Type	ACK_CR_Rule_Name	SLS_Event_ID
Error event	This alert is triggered by all Error-level anomalous events in the cluster.	Simple Log Service	event	error-event	sls.app.ack.error

Warn event set

Alert Item	Rule Description	Alert Source	Rule_Type	ACK_CR_Rule_Name	SLS_Event_ID
Warn event	This alert is triggered by key Warn-level anomalous events in the cluster, excluding some ignorable events.	Simple Log Service	event	warn-event	sls.app.ack.warn

Alert rule set for core component anomalies in ACK managed clusters

Alert Item	Rule Description	Alert Source	Rule_Type	ACK_CR_Rule_Name	SLS_Event_ID
Anomalous availability of the cluster API server	This alert is triggered when the API server becomes unavailable, which may limit cluster management features.	Managed Service for Prometheus	metric-prometheus	apiserver-unhealthy	prom.apiserver.notHealthy.down
Anomalous availability of cluster etcd	The unavailability of etcd affects the status of the entire cluster.	Managed Service for Prometheus	metric-prometheus	etcd-unhealthy	prom.etcd.notHealthy.down
Anomalous availability of cluster kube-scheduler	The scheduler is responsible for pod scheduling. If the scheduler is unavailable, new pods may fail to start.	Managed Service for Prometheus	metric-prometheus	scheduler-unhealthy	prom.scheduler.notHealthy.down
Anomalous availability of cluster KCM	Anomalies in the control loop affect the automatic repair and resource adjustment mechanisms of the cluster.	Managed Service for Prometheus	metric-prometheus	kcm-unhealthy	prom.kcm.notHealthy.down
Anomalous availability of cluster cloud-controller-manager	Anomalies in the lifecycle management of external cloud service components may affect the dynamic adjustment of services.	Managed Service for Prometheus	metric-prometheus	ccm-unhealthy	prom.ccm.notHealthy.down
Anomalous availability of cluster CoreDNS: Request drops to zero	CoreDNS is the DNS service for the cluster. Anomalies affect service discovery and domain name resolution.	Managed Service for Prometheus	metric-prometheus	coredns-unhealthy-requestdown	prom.coredns.notHealthy.requestdown
Anomalous availability of cluster CoreDNS: Panic error	This alert is triggered when a panic error occurs in CoreDNS. You must immediately analyze the logs for diagnosis.	Managed Service for Prometheus	metric-prometheus	coredns-unhealthy-panic	prom.coredns.notHealthy.panic
High error rate for cluster Ingress requests	A high error rate for HTTPS requests processed by the Ingress controller may affect service accessibility.	Managed Service for Prometheus	metric-prometheus	ingress-err-request	prom.ingress.request.errorRateHigh
Cluster Ingress controller certificate is about to expire	An expired SSL Certificate causes HTTPS requests to fail. You must update the certificate in advance.	Managed Service for Prometheus	metric-prometheus	ingress-ssl-expire	prom.ingress.ssl.expire
Number of pending pods > 1,000	If too many pods in the cluster remain in the Pending state, it may indicate insufficient resources or an unreasonable scheduling policy.	Managed Service for Prometheus	metric-prometheus	pod-pending-accumulate	prom.pod.pending.accumulate
High RT for cluster API server mutating admission webhook	A slow response from the mutating admission webhook affects the efficiency of resource creation and modification.	Managed Service for Prometheus	metric-prometheus	apiserver-admit-rt-high	prom.apiserver.mutating.webhook.rt.high
High RT for cluster API server validating admission webhook	A slow response from the validating admission webhook may cause delays in configuration changes.	Managed Service for Prometheus	metric-prometheus	apiserver-validate-rt-high	prom.apiserver.validation.webhook.rt.high
OOM occurs in a control plane component	An out-of-memory (OOM) error occurs in a core cluster component. You need to investigate the anomaly in detail to prevent service downtime.	Simple Log Service	event	ack-controlplane-oom	sls.app.ack.controlplane.pod.oom

Alert rule set for node pool O&M events

Alert Item	Rule Description	Alert Source	Rule_Type	ACK_CR_Rule_Name	SLS_Event_ID
Node auto-healing fails	If the node auto-healing process fails, you must immediately identify the cause and fix the issue to ensure high availability.	Simple Log Service	event	node-repair_failed	sls.app.ack.rc.node_repair_failed
Node CVE fix fails	If an important CVE fix fails, cluster security may be affected. You must urgently assess and fix the issue.	Simple Log Service	event	nodepool-cve-fix-failed	sls.app.ack.rc.node_vulnerability_fix_failed
Node pool CVE fix succeeds	Successfully applying a CVE fix reduces the security risks of known vulnerabilities.	Simple Log Service	event	nodepool-cve-fix-succ	sls.app.ack.rc.node_vulnerability_fix_succeed
Node pool CVE auto-fix is skipped	The auto-fix is skipped, possibly due to compatibility issues or specific configurations. You must check if the security policy is reasonable.	Simple Log Service	event	nodepool-cve-fix-skip	sls.app.ack.rc.node_vulnerability_fix_skipped
kubelet parameter configuration for the node pool fails	The kubelet configuration fails to update, which may affect node performance and resource scheduling.	Simple Log Service	event	nodepool-kubelet-cfg-failed	sls.app.ack.rc.node_kubelet_config_failed
kubelet parameter configuration for the node pool succeeds	After the new kubelet configuration is successfully applied, confirm that it takes effect and meets expectations.	Simple Log Service	event	nodepool-kubelet-config-succ	sls.app.ack.rc.node_kubelet_config_succeed
kubelet upgrade for the node pool fails	This may affect cluster stability and functionality. You must confirm the upgrade process and configuration.	Simple Log Service	event	nodepool-k-c-upgrade-failed	sls.app.ack.rc.node_kubelet_config_upgrade_failed
kubelet upgrade for the node pool succeeds	After confirming the successful upgrade, ensure that the kubelet version meets cluster and application requirements.	Simple Log Service	event	nodepool-k-c-upgrade-succ	sls.app.ack.rc.kubelet_upgrade_succeed
Runtime upgrade for the node pool succeeds	The container runtime in the node pool is successfully upgraded.	Simple Log Service	event	nodepool-runtime-upgrade-succ	sls.app.ack.rc.runtime_upgrade_succeed
Runtime upgrade for the node pool fails	The container runtime in the node pool fails to upgrade.	Simple Log Service	event	nodepool-runtime-upgrade-fail	sls.app.ack.rc.runtime_upgrade_failed
OS image upgrade for the node pool succeeds	The operating system image in the node pool is successfully upgraded.	Simple Log Service	event	nodepool-os-upgrade-succ	sls.app.ack.rc.os_image_upgrade_succeed
OS image upgrade for the node pool fails	The operating system image in the node pool fails to upgrade.	Simple Log Service	event	nodepool-os-upgrade-failed	sls.app.ack.rc.os_image_upgrade_failed
Configuration change for Node Lingjun pool succeeds	The configuration of the Node Lingjun pool is successfully changed.	Simple Log Service	event	nodepool-lingjun-config-succ	sls.app.ack.rc.lingjun_configuration_apply_succeed
Configuration change for Node Lingjun pool fails	The configuration of the Node Lingjun pool fails to change.	Simple Log Service	event	nodepool-lingjun-cfg-failed	sls.app.ack.rc.lingjun_configuration_apply_failed

Alert rule set for node anomalies

Alert Item	Rule Description	Alert Source	Rule_Type	ACK_CR_Rule_Name	SLS_Event_ID
Anomalous Docker process on a cluster node	The Dockerd or containerd runtime on a cluster node is abnormal.	Simple Log Service	event	docker-hang	sls.app.ack.docker.hang
Cluster eviction event	An eviction event occurs in the cluster.	Simple Log Service	event	eviction-event	sls.app.ack.eviction
GPU XID error event in the cluster	An anomalous GPU XID event occurs in the cluster.	Simple Log Service	event	gpu-xid-error	sls.app.ack.gpu.xid_error
Cluster node goes offline	A node in the cluster goes offline.	Simple Log Service	event	node-down	sls.app.ack.node.down
Cluster node restarts	A node in the cluster restarts.	Simple Log Service	event	node-restart	sls.app.ack.node.restart
Anomalous time service on a cluster node	The time synchronization system service on a cluster node is abnormal.	Simple Log Service	event	node-ntp-down	sls.app.ack.ntp.down
Anomalous PLEG on a cluster node	The PLEG on a cluster node is abnormal.	Simple Log Service	event	node-pleg-error	sls.app.ack.node.pleg_error
Anomalous process on a cluster node	The number of processes on a cluster node is abnormal.	Simple Log Service	event	ps-hang	sls.app.ack.ps.hang
Too many file handles on a cluster node	The number of file handles on the node is too large.	Simple Log Service	event	node-fd-pressure	sls.app.ack.node.fd_pressure
Too many processes on a cluster node	The number of processes on the cluster node is too large.	Simple Log Service	event	node-pid-pressure	sls.app.ack.node.pid_pressure
Failed to delete a node	An event indicating that the cluster failed to delete a node.	Simple Log Service	event	node-del-err	sls.app.ack.ccm.del_node_failed
Failed to add a node	An event indicating that the cluster failed to add a node.	Simple Log Service	event	node-add-err	sls.app.ack.ccm.add_node_failed
Command execution fails in a managed node pool	An anomalous event in a managed node pool of the cluster.	Simple Log Service	event	nlc-run-cmd-err	sls.app.ack.nlc.run_command_fail
No specific command is provided for the task in the managed node pool	An anomalous event in a managed node pool of the cluster.	Simple Log Service	event	nlc-empty-cmd	sls.app.ack.nlc.empty_task_cmd
Unimplemented task mode occurs in the managed node pool	An anomalous event in a managed node pool of the cluster.	Simple Log Service	event	nlc-url-m-unimp	sls.app.ack.nlc.url_mode_unimpl
Unknown repair operation occurs in the managed node pool	An anomalous event in a managed node pool of the cluster.	Simple Log Service	event	nlc-opt-no-found	sls.app.ack.nlc.op_not_found
Error occurs when destroying a node in the managed node pool	An anomalous event in a managed node pool of the cluster.	Simple Log Service	event	nlc-des-node-err	sls.app.ack.nlc.destroy_node_fail
Failed to drain a node in a managed node pool	An anomalous draining event in a managed node pool of the cluster.	Simple Log Service	event	nlc-drain-node-err	sls.app.ack.nlc.drain_node_fail
Restarted ECS instance in a managed node pool does not reach the desired state	An anomalous event in a managed node pool of the cluster.	Simple Log Service	event	nlc-restart-ecs-wait	sls.app.ack.nlc.restart_ecs_wait_fail
Failed to restart an ECS instance in a managed node pool	An anomalous event in a managed node pool of the cluster.	Simple Log Service	event	nlc-restart-ecs-err	sls.app.ack.nlc.restart_ecs_fail
Failed to reset an ECS instance in a managed node pool	An anomalous event in a managed node pool of the cluster.	Simple Log Service	event	nlc-reset-ecs-err	sls.app.ack.nlc.reset_ecs_fail
Self-healing task fails in a managed node pool	An anomalous event in a managed node pool of the cluster.	Simple Log Service	event	nlc-sel-repair-err	sls.app.ack.nlc.repair_fail

Alert rule set for resource anomalies

Alert Item	Rule Description	Alert Source	Rule_Type	ACK_CR_Rule_Name	SLS_Event_ID
Cluster node CPU utilization ≥ 85%	The CPU utilization of a node instance in the cluster exceeds the threshold. Default value: 85%. If the remaining resources are less than 15%, the CPU resource reservation of the container engine layer may be exceeded. For more information, see Node resource reservation policy. This may cause frequent CPU throttling and severely affect process response speed. Optimize CPU usage or adjust the threshold promptly. For more information about how to adjust the threshold, see Configure alert rules.	CloudMonitor	metric-cms	node_cpu_util_high	cms.host.cpu.utilization
Cluster node memory utilization ≥ 85%	The memory usage of a node instance in the cluster exceeds the threshold. Default value: 85%. If the remaining resources are less than 15% and are still in use, the memory resource reservation of the container engine layer will be exceeded. For more information, see Node resource reservation policy. In this scenario, the kubelet will perform a forced eviction. Optimize memory usage or adjust the threshold promptly. For more information about how to adjust the threshold, see Configure alert rules.	CloudMonitor	metric-cms	node_mem_util_high	cms.host.memory.utilization
Cluster node disk usage ≥ 85%	The disk usage of a node instance in the cluster exceeds the threshold. Default value: 85%. For more information about how to adjust the threshold, see Configure alert rules.	CloudMonitor	metric-cms	node_disk_util_high	cms.host.disk.utilization
Cluster node outbound Internet bandwidth usage ≥ 85%	The outbound Internet bandwidth usage of a node instance in the cluster exceeds the threshold. Default value: 85%. For more information about how to adjust the threshold, see Configure alert rules.	CloudMonitor	metric-cms	node_public_net_util_high	cms.host.public.network.utilization
Cluster node inode usage ≥ 85%	The inode usage of a node instance in the cluster exceeds the threshold. Default value: 85%. For more information about how to adjust the threshold, see Configure alert rules.	CloudMonitor	metric-cms	node_fs_inode_util_high	cms.host.fs.inode.utilization
Cluster resource: Layer 7 QPS usage of SLB instance ≥ 85%	The queries per second (QPS) of an SLB instance in the cluster exceeds the threshold. Default value: 85%. Note SLB instances are those associated with the API server and Ingress. For more information about how to adjust the threshold, see Configure alert rules.	CloudMonitor	metric-cms	slb_qps_util_high	cms.slb.qps.utilization
Cluster resource: Outbound bandwidth usage of SLB instance ≥ 85%	The outbound bandwidth usage of an SLB instance in the cluster exceeds the threshold. Default value: 85%. Note SLB instances are those associated with the API server and Ingress. For more information about how to adjust the threshold, see Configure alert rules.	CloudMonitor	metric-cms	slb_traff_tx_util_high	cms.slb.traffic.tx.utilization
Cluster resource: Maximum connection usage of SLB instance ≥ 85%	The maximum connection usage of an SLB instance in the cluster exceeds the threshold. Default value: 85%. Note SLB instances are those associated with the API server and Ingress. For more information about how to adjust the threshold, see Configure alert rules.	CloudMonitor	metric-cms	slb_max_con_util_high	cms.slb.max.connection.utilization
Cluster resource: Dropped connections per second for SLB listener ≥ 1	The number of dropped connections per second for an SLB instance in the cluster continuously exceeds the threshold. Default value: 1. Note SLB instances are those associated with the API server and Ingress. For more information about how to adjust the threshold, see Configure alert rules.	CloudMonitor	metric-cms	slb_drop_con_high	cms.slb.drop.connection
Insufficient disk space on a cluster node	An anomalous event indicating insufficient disk space on a node in the cluster.	Simple Log Service	event	node-disk-pressure	sls.app.ack.node.disk_pressure
Insufficient scheduling resources on a cluster node	An anomalous event indicating no available scheduling resources in the cluster.	Simple Log Service	event	node-res-insufficient	sls.app.ack.resource.insufficient
Insufficient IP resources on a cluster node	An anomalous event indicating insufficient IP resources in the cluster.	Simple Log Service	event	node-ip-pressure	sls.app.ack.ip.not_enough
Disk usage exceeds the threshold	An anomaly where the cluster disk usage exceeds the threshold. Check the cluster disk usage.	Simple Log Service	event	disk_space_press	sls.app.ack.csi.no_enough_disk_space

Alert rule set for ACK control plane O&M notifications

Alert Item	Rule Description	Alert Source	Rule_Type	ACK_CR_Rule_Name	SLS_Event_ID
ACK cluster task notification	Records and informs the control plane of relevant plans and changes.	Simple Log Service	event	ack-system-event-info	sls.app.ack.system_events.task.info
ACK cluster task failure notification	When a cluster operation fails, you must investigate the cause promptly.	Simple Log Service	event	ack-system-event-error	sls.app.ack.system_events.task.error

Alert rule set for cluster auto scaling

Alert Item	Rule Description	Alert Source	Rule_Type	ACK_CR_Rule_Name	SLS_Event_ID
Auto scaling: Scale-out	Nodes are automatically scaled out to handle increased load requests.	Simple Log Service	event	autoscaler-scaleup	sls.app.ack.autoscaler.scaleup_group
Auto scaling: Scale-in	When the load decreases, nodes are automatically scaled in to save resources.	Simple Log Service	event	autoscaler-scaledown	sls.app.ack.autoscaler.scaledown
Auto scaling: Scale-out timeout	A scale-out timeout may indicate insufficient resources or an improper policy.	Simple Log Service	event	autoscaler-scaleup-timeout	sls.app.ack.autoscaler.scaleup_timeout
Auto scaling: Scale-in of empty nodes	Inactive nodes are identified and cleaned up to optimize resource usage.	Simple Log Service	event	autoscaler-scaledown-empty	sls.app.ack.autoscaler.scaledown_empty
Auto scaling: Scale-out fails	If a scale-out fails, you must immediately analyze the cause and adjust the resource policy.	Simple Log Service	event	autoscaler-up-group-failed	sls.app.ack.autoscaler.scaleup_group_failed
Auto scaling: Unhealthy cluster	An unhealthy cluster status due to auto scaling must be handled promptly.	Simple Log Service	event	autoscaler-cluster-unhealthy	sls.app.ack.autoscaler.cluster_unhealthy
Auto scaling: Deletion of nodes that fail to start for a long time	Invalid nodes are cleaned up to reclaim resources.	Simple Log Service	event	autoscaler-del-started	sls.app.ack.autoscaler.delete_started_timeout
Auto scaling: Deletion of unregistered nodes	Redundant nodes are processed to optimize cluster resources.	Simple Log Service	event	autoscaler-del-unregistered	sls.app.ack.autoscaler.delete_unregistered
Auto scaling: Scale-in fails	A scale-in failure may lead to resource waste and uneven load distribution.	Simple Log Service	event	autoscaler-scale-down-failed	sls.app.ack.autoscaler.scaledown_failed
Auto scaling: Node is deleted before it is drained	When an auto scaling operation deletes a node, the pods running on the node fail to be evicted or migrated.	Simple Log Service	event	autoscaler-instance-expired	sls.app.ack.autoscaler.instance_expired

Alert rule set for application workload anomalies

Alert Item	Rule Description	Alert Source	Rule_Type	ACK_CR_Rule_Name	SLS_Event_ID
Job fails to run	This alert is triggered when a Job fails during execution.	Managed Service for Prometheus	metric-prometheus	job-failed	prom.job.failed
Anomalous status of available replicas in a deployment	This alert is triggered when the number of available replicas in a deployment is insufficient, which may cause the service to be unavailable or partially unavailable.	Managed Service for Prometheus	metric-prometheus	deployment-rep-err	prom.deployment.replicaError
Anomalous replica status in a DaemonSet	This alert is triggered when some replicas in a DaemonSet are in an abnormal state, such as failing to start or crashing. This affects the expected behavior or service of the nodes.	Managed Service for Prometheus	metric-prometheus	daemonset-status-err	prom.daemonset.scheduledError
Anomalous replica scheduling in a DaemonSet	This alert is triggered when a DaemonSet fails to correctly schedule some or all nodes, possibly due to resource constraints or an improper scheduling policy.	Managed Service for Prometheus	metric-prometheus	daemonset-misscheduled	prom.daemonset.misscheduled

Alert rule set for container replica anomalies

Alert Item	Rule Description	Alert Source	Rule_Type	ACK_CR_Rule_Name	SLS_Event_ID
OOM occurs in a container replica in the cluster	An out-of-memory (OOM) error occurs in a pod or a process within it.	Simple Log Service	event	pod-oom	sls.app.ack.pod.oom
Container replica in the cluster fails to start	An event indicating that a pod in the cluster fails to start.	Simple Log Service	event	pod-failed	sls.app.ack.pod.failed
Anomalous pod status	This alert is triggered when a pod is in an unhealthy state, such as Pending, Failed, or Unknown.	Managed Service for Prometheus	metric-prometheus	pod-status-err	prom.pod.status.notHealthy
Pod fails to start	This alert is triggered when a pod frequently fails to start and enters the CrashLoopBackOff state or another failed state.	Managed Service for Prometheus	metric-prometheus	pod-crashloop	prom.pod.status.crashLooping

Alert rule set for storage anomalies

Alert Item	Rule Description	Alert Source	Rule_Type	ACK_CR_Rule_Name	SLS_Event_ID
Disk capacity is less than the 20 GiB limit	Due to a fixed limitation, you cannot attach a disk smaller than 20 GiB. Check the capacity of the attached disk.	Simple Log Service	event	csi_invalid_size	sls.app.ack.csi.invalid_disk_size
Subscription disks are not supported for container volumes	Due to a fixed limitation, you cannot attach a subscription disk. Check the billing method of the attached disk.	Simple Log Service	event	csi_not_portable	sls.app.ack.csi.disk_not_portable
Failed to unmount the mount target because it is being used by a process	The resource has not been fully released, or an active process is accessing the mount target.	Simple Log Service	event	csi_device_busy	sls.app.ack.csi.deivce_busy
No available disks	An anomaly where no available disks can be attached to the cluster storage.	Simple Log Service	event	csi_no_ava_disk	sls.app.ack.csi.no_ava_disk
Disk IOHang	An IOHang anomaly occurs in the cluster.	Simple Log Service	event	csi_disk_iohang	sls.app.ack.csi.disk_iohang
Slow I/O occurs on the PVC bound to the disk	A slow I/O anomaly occurs on the PVC bound to the cluster disk.	Simple Log Service	event	csi_latency_high	sls.app.ack.csi.latency_too_high
Anomalous PersistentVolume status	An anomaly occurs on a PV in the cluster.	Managed Service for Prometheus	metric-prometheus	pv-failed	prom.pv.failed

Alert rule set for network anomalies

Alert Item	Rule Description	Alert Source	Rule_Type	ACK_CR_Rule_Name	SLS_Event_ID
Multiple route tables exist in the VPC	This may complicate network configuration or cause route conflicts. You need to optimize the network structure promptly.	Simple Log Service	event	ccm-vpc-multi-route-err	sls.app.ack.ccm.describe_route_tables_failed
No available SLB instances	An event indicating that the cluster cannot create an SLB instance.	Simple Log Service	event	slb-no-ava	sls.app.ack.ccm.no_ava_slb
Failed to sync the SLB instance	An event indicating that the cluster failed to sync the created SLB instance.	Simple Log Service	event	slb-sync-err	sls.app.ack.ccm.sync_slb_failed
Failed to delete the SLB instance	An event indicating that the cluster failed to delete the SLB instance.	Simple Log Service	event	slb-del-err	sls.app.ack.ccm.del_slb_failed
Failed to create a route	An event indicating that the cluster failed to create a VPC network route.	Simple Log Service	event	route-create-err	sls.app.ack.ccm.create_route_failed
Failed to sync a route	An event indicating that the cluster failed to sync a VPC network route.	Simple Log Service	event	route-sync-err	sls.app.ack.ccm.sync_route_failed
Invalid Terway resource	An anomalous event indicating an invalid Terway network resource in the cluster.	Simple Log Service	event	terway-invalid-res	sls.app.ack.terway.invalid_resource
Terway fails to assign an IP address	An anomalous event indicating that the Terway network resource in the cluster fails to assign an IP address.	Simple Log Service	event	terway-alloc-ip-err	sls.app.ack.terway.alloc_ip_fail
Failed to parse the Ingress bandwidth configuration	An anomalous event indicating a configuration parsing error for the cluster Ingress network.	Simple Log Service	event	terway-parse-err	sls.app.ack.terway.parse_fail
Terway fails to allocate network resources	An anomalous event indicating that the Terway network resource in the cluster fails to be allocated.	Simple Log Service	event	terway-alloc-res-err	sls.app.ack.terway.allocate_failure
Terway fails to reclaim network resources	An anomalous event indicating that the Terway network resource in the cluster fails to be reclaimed.	Simple Log Service	event	terway-dispose-err	sls.app.ack.terway.dispose_failure
Terway virtual mode changes	An event indicating a change in the virtual mode of the cluster Terway network.	Simple Log Service	event	terway-virt-mod-err	sls.app.ack.terway.virtual_mode_change
Terway triggers a pod IP configuration check	An event indicating that the cluster Terway network triggers a pod IP configuration check.	Simple Log Service	event	terway-ip-check	sls.app.ack.terway.config_check
Failed to reload the Ingress configuration	An anomalous event indicating that the cluster Ingress network configuration fails to reload. Check if the Ingress configuration is correct.	Simple Log Service	event	ingress-reload-err	sls.app.ack.ingress.err_reload_nginx

Alert rule set for important audit operations

Alert Item	Rule Description	Alert Source	Rule_Type	ACK_CR_Rule_Name	SLS_Event_ID
A user logs on to a container or executes a command in the cluster	This may be a maintenance or anomalous activity. Audit operations can be used for tracking and security detection.	Simple Log Service	event	audit-at-command	sls.app.k8s.audit.at.command
The scheduling status of a cluster node changes	This affects service efficiency and resource load. You must promptly follow up on the intent of the change and verify the effect.	Simple Log Service	event	audit-cordon-switch	sls.app.k8s.audit.at.cordon.uncordon
A resource is deleted from the cluster	Resource deletion may be a planned or anomalous behavior. We recommend that you audit the operation to prevent risks.	Simple Log Service	event	audit-resource-delete	sls.app.k8s.audit.at.delete
A node is drained or an eviction occurs in the cluster	This reflects node load pressure or policy execution. You must confirm its necessity and impact.	Simple Log Service	event	audit-drain-eviction	sls.app.k8s.audit.at.drain.eviction
A user logs on to the cluster from the Internet	Logging on from the Internet may pose security risks. You must confirm the logon and access permission configurations.	Simple Log Service	event	audit-internet-login	sls.app.k8s.audit.at.internet.login
A node label is updated in the cluster	Label updates are used to distinguish and manage node resources. Correctness affects O&M efficiency.	Simple Log Service	event	audit-node-label-update	sls.app.k8s.audit.at.label
A node taint is updated in the cluster	Changes in node taint configuration affect scheduling policies and toleration mechanisms. You must correctly execute and review the configuration.	Simple Log Service	event	audit-node-taint-update	sls.app.k8s.audit.at.taint
A resource is modified in the cluster	Real-time modification of resource configurations may indicate adjustments to application policies. You must verify if it aligns with business objectives.	Simple Log Service	event	audit-resource-update	sls.app.k8s.audit.at.update

Alert rule set for security anomalies

Alert Item	Rule Description	Alert Source	Rule_Type	ACK_CR_Rule_Name	SLS_Event_ID
Security inspection finds a high-risk configuration	An event indicating that a cluster security inspection has found a high-risk configuration.	Simple Log Service	event	si-c-a-risk	sls.app.ack.si.config_audit_high_risk

Alert rule set for cluster inspection anomalies

Alert Item	Rule Description	Alert Source	Rule_Type	ACK_CR_Rule_Name	SLS_Event_ID
Cluster inspection finds an anomaly	The automatic inspection mechanism captures a potential anomaly. You need to analyze the specific issue and daily maintenance policies.	Simple Log Service	event	cis-sched-failed	sls.app.ack.cis.schedule_task_failed

Troubleshooting guide for alerts

Pod eviction triggered by node disk usage reaching the threshold

Alert message

(combined from similar events): Failed to garbage collect required amount of images. Attempted to free XXXX bytes, but only found 0 bytes eligible to free

Symptoms

The pod status is Evicted. The node experiences disk pressure (The node had condition: [DiskPressure].)

Cause

When the node disk usage reaches the eviction threshold (the default is 85%), the kubelet performs pressure-based eviction and garbage collection to reclaim unused image files. This process causes the pod to be evicted. You can log on to the target node and run the df -h command to view the disk usage.

Solution

Log on to the target node (containerd runtime environment) and run the following command to delete unused container images and release disk space.
```
crictl rmi --prune
```
Clean up logs or resize the node disk.
- Create a snapshot backup of the data disk or system disk for the target node. After the backup is complete, delete files or folders that are no longer needed. For more information, see Resolve full disk space issues on Linux instances.
- Scale out the system disk or data disk of the target node online to increase its storage capacity. For more information, see Scale out the system disk or data disk of a node.
Adjust the relevant thresholds.
- Adjust the kubelet image garbage collection threshold as needed to reduce pod evictions caused by high disk usage on nodes. For more information, see Customize kubelet configurations for a node pool.
- When the node disk usage reaches or exceeds 85%, you receive an alert. You can modify the alert threshold in the node_disk_util_high alert rule in the YAML configuration based on your business needs. For more information, see Configure alert rules.

Recommendations and preventive measures

For nodes that frequently encounter this issue, we recommend that you assess the actual storage needs of your applications and properly plan resource requests and node disk capacity.
We recommend that you regularly monitor your storage usage to promptly identify and address potential threats. For more information, see the Node Storage Dashboard.

Pod OOMKilling

Alert message

pod was OOM killed. node:xxx pod:xxx namespace:xxx uuid:xxx

Symptoms

The pod status is abnormal, and the event details contain PodOOMKilling.

Solution

An Out of Memory (OOM) event can be triggered at the node level or the container cgroup level.

Causes:
- Container cgroup-level OOM: The actual memory usage of a pod exceeds its memory limits. The pod is then forcibly terminated by the Kubernetes cgroup.
- Node-level OOM: This usually occurs when too many pods without resource limits (requests/limits) are running on a node, or when some processes (which may not be managed by Kubernetes) consume a large amount of memory.
Method: Log on to the target node and run the dmesg -T | grep -i "memory" command. If the output contains out_of_memory, an OOM event has occurred. If the log output also contains Memory cgroup, the event is a container cgroup-level OOM. Otherwise, the event is a node-level OOM.
Suggestions:
- For container cgroup-level OOM:
  - Increase the pod's memory limits as needed. The actual usage should not exceed 80% of the specified limits. For more information, see Manage pods and Upgrade or downgrade node resources.
  - Enable resource profiling to obtain recommended configurations for container requests and limits.
- For node-level OOM:
  - Scale out the memory resources of the node or distribute workloads to more nodes. For more information, see Upgrade or downgrade node resources and Schedule applications to specific nodes.
  - Identify pods with high memory usage on the node and set reasonable memory limits for them.

For more information about the causes of OOM events and their solutions, see Causes and solutions for OOM Killer.

Pod status is CrashLoopBackOff

When a process in a pod exits unexpectedly, ACK tries to restart the pod. If the pod fails to reach the desired state after multiple restarts, its status changes to CrashLoopBackOff. Follow these steps to troubleshoot:

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters pagefind the cluster you want and click its name. In the left-side pane, choose Workloads > Pods.
Find the abnormal pod in the list and click Details in the Actions column.
Check the Events of the pod and analyze the description of the abnormal event.
View the Logs of the pod, which may record the cause of the abnormal process.
Note
If the pod has been restarted, select Show the log of the last container exit to view the logs of the previous pod.
The console displays a maximum of 500 of the most recent log entries. To view more historical logs, we recommend that you set up a log persistence solution for unified collection and storage.