All Products
Search
Document Center

Container Service for Kubernetes:Container Service alert management

Last Updated:Sep 08, 2025

You can enable the Container Service alert management feature to centrally manage container alerts. This feature monitors for anomalous events in container services, key metrics for basic cluster resources, and metrics for core cluster components and applications. You can also use CustomResourceDefinitions (CRDs) to modify the default alert rules in your cluster. This helps you promptly detect cluster anomalies.

Billing

The alert feature uses data from Simple Log Service (SLS), Managed Service for Prometheus, and CloudMonitor. Additional fees are charged for notifications, such as text messages and phone calls, that are sent when an alert is triggered. Before you enable the alert feature, check the data source for each alert item in the default alert rule templates and activate the required services.

Alert Source

Configuration Requirements

Billing Details

Simple Log Service (SLS)

Enable event monitoring. Event monitoring is enabled by default when you enable the alert management feature.

Billing of pay-by-feature

Managed Service for Prometheus

Configure Prometheus monitoring for your cluster.

Free of charge

CloudMonitor

For the cluster: Enable the Cloud Monitor feature for a Container Service for Kubernetes cluster.

Pay-as-you-go

Enable the alert management feature

After you enable the alert management feature, you can set metric-based alerts for specified resources in your cluster. You automatically receive alert notifications when anomalies occur. This helps you manage and maintain your cluster more efficiently and ensures service stability. For more information about resource alerts, see Default alert rule templates.

ACK managed clusters

You can enable alert configuration for an existing cluster or when you create a new cluster.

Enable the feature for an existing cluster

  1. Log on to the ACK console. In the navigation pane on the left, click Clusters.

  2. On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose Operations > Alerts.

  3. On the Alerting page, follow the on-screen instructions to install or upgrade the components.

  4. After the installation or upgrade is complete, go to the Alerting page to configure alert information.

    Tab

    Description

    Alert Rule Management

    • Status: Turn on or off the target alert rule set.

    • Edit Notification Object: Set the contact group for alert notifications.

    Before you configure this, create contacts and groups, and add the contacts to the groups. You can only select contact groups as notification objects. To notify a single person, create a group that contains only that contact and select the group.

    Alert History

    You can view the latest 100 alert records from the last 24 hours.

    • Click the link in the Alert Rule column to go to the corresponding monitoring system and view the detailed rule configuration.

    • Click Troubleshoot to quickly locate the resource where the anomaly occurred (anomalous event or metric).

    • Click Intelligent Analysis to use the AI assistant to help analyze the issue and provide troubleshooting guidance.

    Contact Management

    Manage contacts. You can create, edit, or delete contacts.

    Contact Methods:

    • Phone/Text Message: After you set a mobile number for a contact, the contact can receive alert notifications by phone and text message.

      Only verified mobile numbers can be used to receive phone call notifications. For more information about how to verify a mobile number, see Verify a mobile number.
    • Email: After you set an email address for a contact, the contact can receive alert notifications by email.

    • Robot: DingTalk Robot, WeCom Robot, and Lark Robot.

      For DingTalk robots, you must add the security keywords: Alerting, Dispatch.
    Before you configure email and robot notifications, verify them in the CloudMonitor console. Choose Alert Service > Alert Contacts to ensure you can receive alert information.

    Contact Group Management

    Manage contact groups. You can create, edit, or delete contact groups. You can only select contact groups when you Edit Notification Object.

    If no contact group exists, the console creates a default contact group based on your Alibaba Cloud account information.

Enable the feature when creating a cluster

On the Component Configurations page of the cluster creation wizard, select Configure Alerts Using The Default Alert Template for Alerting and select an Alert Notification Contact Group. For more information, see Create an ACK managed cluster.

image

After you enable alert configurations during cluster creation, the system enables the default alert rules and sends alert notifications to the default alert contact group. You can also modify the alert contacts or alert contact groups.

ACK dedicated clusters

For ACK dedicated clusters, you must first grant permissions to the worker RAM role and then enable the default alert rules.

Grant permissions to the worker RAM role

  1. Log on to the ACK console. In the navigation pane on the left, click Clusters.

  2. On the Clusters page, find the target cluster and click its name. In the navigation pane on the left, click Cluster Information.

  3. On the Cluster Information page, in the Cluster Resources section, copy the name of the Worker RAM Role and click the link to open the Resource Access Management (RAM) console and grant permissions to the role.

    1. Create a custom policy. For more information, see Create a custom policy on the JSON tab.

      { "Action": [ "log:*", "arms:*", "cms:*", "cs:UpdateContactGroup" ], "Resource": [ "*" ], "Effect": "Allow" }
    2. On the Roles page, find the worker RAM role and grant the custom policy to it. For more information, see Method 1: Grant permissions to a RAM role on the RAM role page.

  4. Note: This document grants broad permissions for simplicity. In a production environment, we recommend that you follow the principle of least privilege and grant only the required permissions.
    1. On the Roles page, find the worker RAM role and grant the custom policy to it. For more information, see Method 1: Grant permissions to a RAM role on the RAM role page.

  5. Check the logs to verify that the access permissions for the alert feature are configured.

    1. In the navigation pane on the left of the cluster management page, choose Workloads > Stateless.

    2. Set Namespace to kube-system and click the Name of the alicloud-monitor-controller application in the list of stateless applications.

    3. Click the Logs tab. The pod logs indicate that the authorization was successful.

Enable default alert rules

  1. In the navigation pane on the left of the cluster management page, choose O&M > Alerting.

  2. On the Alerting page, configure the following alert information.

    Tab

    Description

    Alert Rule Management

    • Status: Turn on or off the target alert rule set.

    • Edit Notification Object: Set the contact group for alert notifications.

    Before you configure this, create contacts and groups, and add the contacts to the groups. You can only select contact groups as notification objects. To notify a single person, create a group that contains only that contact and select the group.

    Alert History

    You can view the latest 100 alert records from the last 24 hours.

    • Click the link in the Alert Rule column to go to the corresponding monitoring system and view the detailed rule configuration.

    • Click Troubleshoot to quickly locate the resource where the anomaly occurred (anomalous event or metric).

    • Click Intelligent Analysis to use the AI assistant to help analyze the issue and provide troubleshooting guidance.

    Contact Management

    Manage contacts. You can create, edit, or delete contacts.

    Contact Methods:

    • Phone/Text Message: After you set a mobile number for a contact, the contact can receive alert notifications by phone and text message.

      Only verified mobile numbers can be used to receive phone call notifications. For more information about how to verify a mobile number, see Verify a mobile number.
    • Email: After you set an email address for a contact, the contact can receive alert notifications by email.

    • Robot: DingTalk Robot, WeCom Robot, and Lark Robot.

      For DingTalk robots, you must add the security keywords: Alerting, Dispatch.
    Before you configure email and robot notifications, verify them in the CloudMonitor console. Choose Alert Service > Alert Contacts to ensure you can receive alert information.

    Contact Group Management

    Manage contact groups. You can create, edit, or delete contact groups. You can only select contact groups when you Edit Notification Object.

    If no contact group exists, the console creates a default contact group based on your Alibaba Cloud account information.

Configure alert rules

After you enable the alert configuration feature, an AckAlertRule CustomResourceDefinition (CRD) resource is created in the kube-system namespace. This resource contains the default alert rule templates. You can modify this CRD resource to customize the default alert rules and configure container service alerts that meet your requirements.

Console

  1. Log on to the ACK console. In the navigation pane on the left, click Clusters.

  2. On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose Operations > Alerts.

  3. On the Alert Rule Management tab, click Edit Alert Configuration in the upper-right corner. Then, click YAML in the Actions column of the target rule to view the AckAlertRule resource configuration for the current cluster.

  4. Modify the YAML file as needed. For more information, see Default alert rule templates.

    The following code shows a sample YAML configuration for an alert rule:

    Alert Rule Configuration YAML

    apiVersion: alert.alibabacloud.com/v1beta1 kind: AckAlertRule metadata: name: default spec: groups: # The following code provides a sample configuration for a cluster event alert rule. - name: pod-exceptions # The name of the alert rule group. This corresponds to the Group_Name field in the alert template. rules: - name: pod-oom # The name of the alert rule. type: event # The type of the alert rule (Rule_Type). Valid values: event (event type) and metric-cms (CloudMonitor metric type). expression: sls.app.ack.pod.oom # The alert rule expression. If the rule type is event, the value of this parameter is the Rule_Expression_Id value from the default alert rule templates in this topic. enable: enable # The status of the alert rule. Valid values: enable and disable. - name: pod-failed type: event expression: sls.app.ack.pod.failed enable: enable # The following code provides a sample configuration for a basic cluster resource alert rule. - name: res-exceptions # The name of the alert rule group. This corresponds to the Group_Name field in the alert template. rules: - name: node_cpu_util_high # The name of the alert rule. type: metric-cms # The type of the alert rule (Rule_Type). Valid values: event (event type), metric-cms (CloudMonitor metric), and metric-prometheus (Prometheus metric). expression: cms.host.cpu.utilization # The alert rule expression. If the rule type is metric-cms, the value of this parameter is the Rule_Expression_Id value from the default alert rule templates in this topic. contactGroups: # The contact group configuration for the alert rule. This is generated by the ACK console. Contacts are the same for the same account and can be reused across multiple clusters. enable: enable # The status of the alert rule. Valid values: enable and disable. thresholds: # The alert rule threshold. - key: CMS_ESCALATIONS_CRITICAL_Threshold unit: percent value: '85' # The CPU utilization threshold. Default: 85%. - key: CMS_ESCALATIONS_CRITICAL_Times value: '3' # The alert is triggered if the threshold is exceeded three consecutive times. - key: CMS_RULE_SILENCE_SEC # The quiescent period after the first alert is reported. value: '900' 

    You can use rules.thresholds to customize the alert threshold. For more information about the parameters, see the following table. For example, the preceding configuration triggers an alert notification if the CPU utilization of a cluster node exceeds 85% for three consecutive checks and the previous alert was triggered more than 900 seconds ago.

    Parameter

    Required

    Description

    Default Value

    CMS_ESCALATIONS_CRITICAL_Threshold

    Required

    The alert threshold. If this parameter is not configured, the rule fails to sync and is disabled.

    • unit: The unit. You can set this to percent, count, or qps.

    • value: The threshold.

    Depends on the default alert template configuration.

    CMS_ESCALATIONS_CRITICAL_Times

    Optional

    The number of retries for the CloudMonitor rule. If this is not configured, the default value is used.

    3

    CMS_RULE_SILENCE_SEC

    Optional

    The quiescent period in seconds after the first alert is reported when CloudMonitor continuously triggers the rule due to an anomaly. This prevents alert fatigue. If this is not configured, the default value is used.

    900

kubectl

  1. Run the following command to edit the YAML file of the alert rule.

    kubectl edit ackalertrules default -n kube-system
  2. Modify the YAML file as needed, and then save and exit. For more information, see Default alert rule templates.

    Alert Rule Configuration YAML

    apiVersion: alert.alibabacloud.com/v1beta1 kind: AckAlertRule metadata: name: default spec: groups: # The following code provides a sample configuration for a cluster event alert rule. - name: pod-exceptions # The name of the alert rule group. This corresponds to the Group_Name field in the alert template. rules: - name: pod-oom # The name of the alert rule. type: event # The type of the alert rule (Rule_Type). Valid values: event (event type) and metric-cms (CloudMonitor metric type). expression: sls.app.ack.pod.oom # The alert rule expression. If the rule type is event, the value of this parameter is the Rule_Expression_Id value from the default alert rule templates in this topic. enable: enable # The status of the alert rule. Valid values: enable and disable. - name: pod-failed type: event expression: sls.app.ack.pod.failed enable: enable # The following code provides a sample configuration for a basic cluster resource alert rule. - name: res-exceptions # The name of the alert rule group. This corresponds to the Group_Name field in the alert template. rules: - name: node_cpu_util_high # The name of the alert rule. type: metric-cms # The type of the alert rule (Rule_Type). Valid values: event (event type) and metric-cms (CloudMonitor metric type). expression: cms.host.cpu.utilization # The alert rule expression. If the rule type is metric-cms, the value of this parameter is the Rule_Expression_Id value from the default alert rule templates in this topic. contactGroups: # The contact group configuration for the alert rule. This is generated by the ACK console. Contacts are the same for the same account and can be reused across multiple clusters. enable: enable # The status of the alert rule. Valid values: enable and disable. thresholds: # The alert rule threshold. - key: CMS_ESCALATIONS_CRITICAL_Threshold unit: percent value: '85' # The CPU utilization threshold. Default: 85%. - key: CMS_ESCALATIONS_CRITICAL_Times value: '3' # The alert is triggered if the threshold is exceeded three consecutive times. - key: CMS_RULE_SILENCE_SEC # The quiescent period after the first alert is reported. value: '900' 

    You can use rules.thresholds to customize the alert threshold. For example, the preceding configuration triggers an alert notification if the CPU utilization of a cluster node exceeds 85% for three consecutive checks and the previous alert was triggered more than 900 seconds ago.

    Parameter

    Required

    Description

    Default Value

    CMS_ESCALATIONS_CRITICAL_Threshold

    Required

    The alert threshold. If this parameter is not configured, the rule fails to sync and is disabled.

    • unit: The unit. You can set this to percent, count, or qps.

    • value: The threshold.

    Depends on the default alert template configuration.

    CMS_ESCALATIONS_CRITICAL_Times

    Optional

    The number of retries for the CloudMonitor rule. If this is not configured, the default value is used.

    3

    CMS_RULE_SILENCE_SEC

    Optional

    The quiescent period in seconds after the first alert is reported when CloudMonitor continuously triggers the rule due to an anomaly. This prevents alert fatigue. If this is not configured, the default value is used.

    900

Default alert rule templates

Alert rules are synced from Simple Log Service (SLS), Managed Service for Prometheus, and CloudMonitor. On the Alerting page, you can view the configuration of each alert rule by clicking Advanced Settings in the Alert Management column.

Error event set

Alert Item

Rule Description

Alert Source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Error event

This alert is triggered by all Error-level anomalous events in the cluster.

Simple Log Service

event

error-event

sls.app.ack.error

Warn event set

Alert Item

Rule Description

Alert Source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Warn event

This alert is triggered by key Warn-level anomalous events in the cluster, excluding some ignorable events.

Simple Log Service

event

warn-event

sls.app.ack.warn

Alert rule set for core component anomalies in ACK managed clusters

Alert Item

Rule Description

Alert Source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Anomalous availability of the cluster API server

This alert is triggered when the API server becomes unavailable, which may limit cluster management features.

Managed Service for Prometheus

metric-prometheus

apiserver-unhealthy

prom.apiserver.notHealthy.down

Anomalous availability of cluster etcd

The unavailability of etcd affects the status of the entire cluster.

Managed Service for Prometheus

metric-prometheus

etcd-unhealthy

prom.etcd.notHealthy.down

Anomalous availability of cluster kube-scheduler

The scheduler is responsible for pod scheduling. If the scheduler is unavailable, new pods may fail to start.

Managed Service for Prometheus

metric-prometheus

scheduler-unhealthy

prom.scheduler.notHealthy.down

Anomalous availability of cluster KCM

Anomalies in the control loop affect the automatic repair and resource adjustment mechanisms of the cluster.

Managed Service for Prometheus

metric-prometheus

kcm-unhealthy

prom.kcm.notHealthy.down

Anomalous availability of cluster cloud-controller-manager

Anomalies in the lifecycle management of external cloud service components may affect the dynamic adjustment of services.

Managed Service for Prometheus

metric-prometheus

ccm-unhealthy

prom.ccm.notHealthy.down

Anomalous availability of cluster CoreDNS: Request drops to zero

CoreDNS is the DNS service for the cluster. Anomalies affect service discovery and domain name resolution.

Managed Service for Prometheus

metric-prometheus

coredns-unhealthy-requestdown

prom.coredns.notHealthy.requestdown

Anomalous availability of cluster CoreDNS: Panic error

This alert is triggered when a panic error occurs in CoreDNS. You must immediately analyze the logs for diagnosis.

Managed Service for Prometheus

metric-prometheus

coredns-unhealthy-panic

prom.coredns.notHealthy.panic

High error rate for cluster Ingress requests

A high error rate for HTTPS requests processed by the Ingress controller may affect service accessibility.

Managed Service for Prometheus

metric-prometheus

ingress-err-request

prom.ingress.request.errorRateHigh

Cluster Ingress controller certificate is about to expire

An expired SSL Certificate causes HTTPS requests to fail. You must update the certificate in advance.

Managed Service for Prometheus

metric-prometheus

ingress-ssl-expire

prom.ingress.ssl.expire

Number of pending pods > 1,000

If too many pods in the cluster remain in the Pending state, it may indicate insufficient resources or an unreasonable scheduling policy.

Managed Service for Prometheus

metric-prometheus

pod-pending-accumulate

prom.pod.pending.accumulate

High RT for cluster API server mutating admission webhook

A slow response from the mutating admission webhook affects the efficiency of resource creation and modification.

Managed Service for Prometheus

metric-prometheus

apiserver-admit-rt-high

prom.apiserver.mutating.webhook.rt.high

High RT for cluster API server validating admission webhook

A slow response from the validating admission webhook may cause delays in configuration changes.

Managed Service for Prometheus

metric-prometheus

apiserver-validate-rt-high

prom.apiserver.validation.webhook.rt.high

OOM occurs in a control plane component

An out-of-memory (OOM) error occurs in a core cluster component. You need to investigate the anomaly in detail to prevent service downtime.

Simple Log Service

event

ack-controlplane-oom

sls.app.ack.controlplane.pod.oom

Alert rule set for node pool O&M events

Alert Item

Rule Description

Alert Source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Node auto-healing fails

If the node auto-healing process fails, you must immediately identify the cause and fix the issue to ensure high availability.

Simple Log Service

event

node-repair_failed

sls.app.ack.rc.node_repair_failed

Node CVE fix fails

If an important CVE fix fails, cluster security may be affected. You must urgently assess and fix the issue.

Simple Log Service

event

nodepool-cve-fix-failed

sls.app.ack.rc.node_vulnerability_fix_failed

Node pool CVE fix succeeds

Successfully applying a CVE fix reduces the security risks of known vulnerabilities.

Simple Log Service

event

nodepool-cve-fix-succ

sls.app.ack.rc.node_vulnerability_fix_succeed

Node pool CVE auto-fix is skipped

The auto-fix is skipped, possibly due to compatibility issues or specific configurations. You must check if the security policy is reasonable.

Simple Log Service

event

nodepool-cve-fix-skip

sls.app.ack.rc.node_vulnerability_fix_skipped

kubelet parameter configuration for the node pool fails

The kubelet configuration fails to update, which may affect node performance and resource scheduling.

Simple Log Service

event

nodepool-kubelet-cfg-failed

sls.app.ack.rc.node_kubelet_config_failed

kubelet parameter configuration for the node pool succeeds

After the new kubelet configuration is successfully applied, confirm that it takes effect and meets expectations.

Simple Log Service

event

nodepool-kubelet-config-succ

sls.app.ack.rc.node_kubelet_config_succeed

kubelet upgrade for the node pool fails

This may affect cluster stability and functionality. You must confirm the upgrade process and configuration.

Simple Log Service

event

nodepool-k-c-upgrade-failed

sls.app.ack.rc.node_kubelet_config_upgrade_failed

kubelet upgrade for the node pool succeeds

After confirming the successful upgrade, ensure that the kubelet version meets cluster and application requirements.

Simple Log Service

event

nodepool-k-c-upgrade-succ

sls.app.ack.rc.kubelet_upgrade_succeed

Runtime upgrade for the node pool succeeds

The container runtime in the node pool is successfully upgraded.

Simple Log Service

event

nodepool-runtime-upgrade-succ

sls.app.ack.rc.runtime_upgrade_succeed

Runtime upgrade for the node pool fails

The container runtime in the node pool fails to upgrade.

Simple Log Service

event

nodepool-runtime-upgrade-fail

sls.app.ack.rc.runtime_upgrade_failed

OS image upgrade for the node pool succeeds

The operating system image in the node pool is successfully upgraded.

Simple Log Service

event

nodepool-os-upgrade-succ

sls.app.ack.rc.os_image_upgrade_succeed

OS image upgrade for the node pool fails

The operating system image in the node pool fails to upgrade.

Simple Log Service

event

nodepool-os-upgrade-failed

sls.app.ack.rc.os_image_upgrade_failed

Configuration change for Node Lingjun pool succeeds

The configuration of the Node Lingjun pool is successfully changed.

Simple Log Service

event

nodepool-lingjun-config-succ

sls.app.ack.rc.lingjun_configuration_apply_succeed

Configuration change for Node Lingjun pool fails

The configuration of the Node Lingjun pool fails to change.

Simple Log Service

event

nodepool-lingjun-cfg-failed

sls.app.ack.rc.lingjun_configuration_apply_failed

Alert rule set for node anomalies

Alert Item

Rule Description

Alert Source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Anomalous Docker process on a cluster node

The Dockerd or containerd runtime on a cluster node is abnormal.

Simple Log Service

event

docker-hang

sls.app.ack.docker.hang

Cluster eviction event

An eviction event occurs in the cluster.

Simple Log Service

event

eviction-event

sls.app.ack.eviction

GPU XID error event in the cluster

An anomalous GPU XID event occurs in the cluster.

Simple Log Service

event

gpu-xid-error

sls.app.ack.gpu.xid_error

Cluster node goes offline

A node in the cluster goes offline.

Simple Log Service

event

node-down

sls.app.ack.node.down

Cluster node restarts

A node in the cluster restarts.

Simple Log Service

event

node-restart

sls.app.ack.node.restart

Anomalous time service on a cluster node

The time synchronization system service on a cluster node is abnormal.

Simple Log Service

event

node-ntp-down

sls.app.ack.ntp.down

Anomalous PLEG on a cluster node

The PLEG on a cluster node is abnormal.

Simple Log Service

event

node-pleg-error

sls.app.ack.node.pleg_error

Anomalous process on a cluster node

The number of processes on a cluster node is abnormal.

Simple Log Service

event

ps-hang

sls.app.ack.ps.hang

Too many file handles on a cluster node

The number of file handles on the node is too large.

Simple Log Service

event

node-fd-pressure

sls.app.ack.node.fd_pressure

Too many processes on a cluster node

The number of processes on the cluster node is too large.

Simple Log Service

event

node-pid-pressure

sls.app.ack.node.pid_pressure

Failed to delete a node

An event indicating that the cluster failed to delete a node.

Simple Log Service

event

node-del-err

sls.app.ack.ccm.del_node_failed

Failed to add a node

An event indicating that the cluster failed to add a node.

Simple Log Service

event

node-add-err

sls.app.ack.ccm.add_node_failed

Command execution fails in a managed node pool

An anomalous event in a managed node pool of the cluster.

Simple Log Service

event

nlc-run-cmd-err

sls.app.ack.nlc.run_command_fail

No specific command is provided for the task in the managed node pool

An anomalous event in a managed node pool of the cluster.

Simple Log Service

event

nlc-empty-cmd

sls.app.ack.nlc.empty_task_cmd

Unimplemented task mode occurs in the managed node pool

An anomalous event in a managed node pool of the cluster.

Simple Log Service

event

nlc-url-m-unimp

sls.app.ack.nlc.url_mode_unimpl

Unknown repair operation occurs in the managed node pool

An anomalous event in a managed node pool of the cluster.

Simple Log Service

event

nlc-opt-no-found

sls.app.ack.nlc.op_not_found

Error occurs when destroying a node in the managed node pool

An anomalous event in a managed node pool of the cluster.

Simple Log Service

event

nlc-des-node-err

sls.app.ack.nlc.destroy_node_fail

Failed to drain a node in a managed node pool

An anomalous draining event in a managed node pool of the cluster.

Simple Log Service

event

nlc-drain-node-err

sls.app.ack.nlc.drain_node_fail

Restarted ECS instance in a managed node pool does not reach the desired state

An anomalous event in a managed node pool of the cluster.

Simple Log Service

event

nlc-restart-ecs-wait

sls.app.ack.nlc.restart_ecs_wait_fail

Failed to restart an ECS instance in a managed node pool

An anomalous event in a managed node pool of the cluster.

Simple Log Service

event

nlc-restart-ecs-err

sls.app.ack.nlc.restart_ecs_fail

Failed to reset an ECS instance in a managed node pool

An anomalous event in a managed node pool of the cluster.

Simple Log Service

event

nlc-reset-ecs-err

sls.app.ack.nlc.reset_ecs_fail

Self-healing task fails in a managed node pool

An anomalous event in a managed node pool of the cluster.

Simple Log Service

event

nlc-sel-repair-err

sls.app.ack.nlc.repair_fail

Alert rule set for resource anomalies

Alert Item

Rule Description

Alert Source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Cluster node CPU utilization ≥ 85%

The CPU utilization of a node instance in the cluster exceeds the threshold. Default value: 85%.

If the remaining resources are less than 15%, the CPU resource reservation of the container engine layer may be exceeded. For more information, see Node resource reservation policy. This may cause frequent CPU throttling and severely affect process response speed. Optimize CPU usage or adjust the threshold promptly.

For more information about how to adjust the threshold, see Configure alert rules.

CloudMonitor

metric-cms

node_cpu_util_high

cms.host.cpu.utilization

Cluster node memory utilization ≥ 85%

The memory usage of a node instance in the cluster exceeds the threshold. Default value: 85%.

If the remaining resources are less than 15% and are still in use, the memory resource reservation of the container engine layer will be exceeded. For more information, see Node resource reservation policy. In this scenario, the kubelet will perform a forced eviction. Optimize memory usage or adjust the threshold promptly.

For more information about how to adjust the threshold, see Configure alert rules.

CloudMonitor

metric-cms

node_mem_util_high

cms.host.memory.utilization

Cluster node disk usage ≥ 85%

The disk usage of a node instance in the cluster exceeds the threshold. Default value: 85%.

For more information about how to adjust the threshold, see Configure alert rules.

CloudMonitor

metric-cms

node_disk_util_high

cms.host.disk.utilization

Cluster node outbound Internet bandwidth usage ≥ 85%

The outbound Internet bandwidth usage of a node instance in the cluster exceeds the threshold. Default value: 85%.

For more information about how to adjust the threshold, see Configure alert rules.

CloudMonitor

metric-cms

node_public_net_util_high

cms.host.public.network.utilization

Cluster node inode usage ≥ 85%

The inode usage of a node instance in the cluster exceeds the threshold. Default value: 85%.

For more information about how to adjust the threshold, see Configure alert rules.

CloudMonitor

metric-cms

node_fs_inode_util_high

cms.host.fs.inode.utilization

Cluster resource: Layer 7 QPS usage of SLB instance ≥ 85%

The queries per second (QPS) of an SLB instance in the cluster exceeds the threshold. Default value: 85%.

Note

SLB instances are those associated with the API server and Ingress.

For more information about how to adjust the threshold, see Configure alert rules.

CloudMonitor

metric-cms

slb_qps_util_high

cms.slb.qps.utilization

Cluster resource: Outbound bandwidth usage of SLB instance ≥ 85%

The outbound bandwidth usage of an SLB instance in the cluster exceeds the threshold. Default value: 85%.

Note

SLB instances are those associated with the API server and Ingress.

For more information about how to adjust the threshold, see Configure alert rules.

CloudMonitor

metric-cms

slb_traff_tx_util_high

cms.slb.traffic.tx.utilization

Cluster resource: Maximum connection usage of SLB instance ≥ 85%

The maximum connection usage of an SLB instance in the cluster exceeds the threshold. Default value: 85%.

Note

SLB instances are those associated with the API server and Ingress.

For more information about how to adjust the threshold, see Configure alert rules.

CloudMonitor

metric-cms

slb_max_con_util_high

cms.slb.max.connection.utilization

Cluster resource: Dropped connections per second for SLB listener ≥ 1

The number of dropped connections per second for an SLB instance in the cluster continuously exceeds the threshold. Default value: 1.

Note

SLB instances are those associated with the API server and Ingress.

For more information about how to adjust the threshold, see Configure alert rules.

CloudMonitor

metric-cms

slb_drop_con_high

cms.slb.drop.connection

Insufficient disk space on a cluster node

An anomalous event indicating insufficient disk space on a node in the cluster.

Simple Log Service

event

node-disk-pressure

sls.app.ack.node.disk_pressure

Insufficient scheduling resources on a cluster node

An anomalous event indicating no available scheduling resources in the cluster.

Simple Log Service

event

node-res-insufficient

sls.app.ack.resource.insufficient

Insufficient IP resources on a cluster node

An anomalous event indicating insufficient IP resources in the cluster.

Simple Log Service

event

node-ip-pressure

sls.app.ack.ip.not_enough

Disk usage exceeds the threshold

An anomaly where the cluster disk usage exceeds the threshold. Check the cluster disk usage.

Simple Log Service

event

disk_space_press

sls.app.ack.csi.no_enough_disk_space

Alert rule set for ACK control plane O&M notifications

Alert Item

Rule Description

Alert Source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

ACK cluster task notification

Records and informs the control plane of relevant plans and changes.

Simple Log Service

event

ack-system-event-info

sls.app.ack.system_events.task.info

ACK cluster task failure notification

When a cluster operation fails, you must investigate the cause promptly.

Simple Log Service

event

ack-system-event-error

sls.app.ack.system_events.task.error

Alert rule set for cluster auto scaling

Alert Item

Rule Description

Alert Source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Auto scaling: Scale-out

Nodes are automatically scaled out to handle increased load requests.

Simple Log Service

event

autoscaler-scaleup

sls.app.ack.autoscaler.scaleup_group

Auto scaling: Scale-in

When the load decreases, nodes are automatically scaled in to save resources.

Simple Log Service

event

autoscaler-scaledown

sls.app.ack.autoscaler.scaledown

Auto scaling: Scale-out timeout

A scale-out timeout may indicate insufficient resources or an improper policy.

Simple Log Service

event

autoscaler-scaleup-timeout

sls.app.ack.autoscaler.scaleup_timeout

Auto scaling: Scale-in of empty nodes

Inactive nodes are identified and cleaned up to optimize resource usage.

Simple Log Service

event

autoscaler-scaledown-empty

sls.app.ack.autoscaler.scaledown_empty

Auto scaling: Scale-out fails

If a scale-out fails, you must immediately analyze the cause and adjust the resource policy.

Simple Log Service

event

autoscaler-up-group-failed

sls.app.ack.autoscaler.scaleup_group_failed

Auto scaling: Unhealthy cluster

An unhealthy cluster status due to auto scaling must be handled promptly.

Simple Log Service

event

autoscaler-cluster-unhealthy

sls.app.ack.autoscaler.cluster_unhealthy

Auto scaling: Deletion of nodes that fail to start for a long time

Invalid nodes are cleaned up to reclaim resources.

Simple Log Service

event

autoscaler-del-started

sls.app.ack.autoscaler.delete_started_timeout

Auto scaling: Deletion of unregistered nodes

Redundant nodes are processed to optimize cluster resources.

Simple Log Service

event

autoscaler-del-unregistered

sls.app.ack.autoscaler.delete_unregistered

Auto scaling: Scale-in fails

A scale-in failure may lead to resource waste and uneven load distribution.

Simple Log Service

event

autoscaler-scale-down-failed

sls.app.ack.autoscaler.scaledown_failed

Auto scaling: Node is deleted before it is drained

When an auto scaling operation deletes a node, the pods running on the node fail to be evicted or migrated.

Simple Log Service

event

autoscaler-instance-expired

sls.app.ack.autoscaler.instance_expired

Alert rule set for application workload anomalies

Alert Item

Rule Description

Alert Source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Job fails to run

This alert is triggered when a Job fails during execution.

Managed Service for Prometheus

metric-prometheus

job-failed

prom.job.failed

Anomalous status of available replicas in a deployment

This alert is triggered when the number of available replicas in a deployment is insufficient, which may cause the service to be unavailable or partially unavailable.

Managed Service for Prometheus

metric-prometheus

deployment-rep-err

prom.deployment.replicaError

Anomalous replica status in a DaemonSet

This alert is triggered when some replicas in a DaemonSet are in an abnormal state, such as failing to start or crashing. This affects the expected behavior or service of the nodes.

Managed Service for Prometheus

metric-prometheus

daemonset-status-err

prom.daemonset.scheduledError

Anomalous replica scheduling in a DaemonSet

This alert is triggered when a DaemonSet fails to correctly schedule some or all nodes, possibly due to resource constraints or an improper scheduling policy.

Managed Service for Prometheus

metric-prometheus

daemonset-misscheduled

prom.daemonset.misscheduled

Alert rule set for container replica anomalies

Alert Item

Rule Description

Alert Source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

OOM occurs in a container replica in the cluster

An out-of-memory (OOM) error occurs in a pod or a process within it.

Simple Log Service

event

pod-oom

sls.app.ack.pod.oom

Container replica in the cluster fails to start

An event indicating that a pod in the cluster fails to start.

Simple Log Service

event

pod-failed

sls.app.ack.pod.failed

Anomalous pod status

This alert is triggered when a pod is in an unhealthy state, such as Pending, Failed, or Unknown.

Managed Service for Prometheus

metric-prometheus

pod-status-err

prom.pod.status.notHealthy

Pod fails to start

This alert is triggered when a pod frequently fails to start and enters the CrashLoopBackOff state or another failed state.

Managed Service for Prometheus

metric-prometheus

pod-crashloop

prom.pod.status.crashLooping

Alert rule set for storage anomalies

Alert Item

Rule Description

Alert Source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Disk capacity is less than the 20 GiB limit

Due to a fixed limitation, you cannot attach a disk smaller than 20 GiB. Check the capacity of the attached disk.

Simple Log Service

event

csi_invalid_size

sls.app.ack.csi.invalid_disk_size

Subscription disks are not supported for container volumes

Due to a fixed limitation, you cannot attach a subscription disk. Check the billing method of the attached disk.

Simple Log Service

event

csi_not_portable

sls.app.ack.csi.disk_not_portable

Failed to unmount the mount target because it is being used by a process

The resource has not been fully released, or an active process is accessing the mount target.

Simple Log Service

event

csi_device_busy

sls.app.ack.csi.deivce_busy

No available disks

An anomaly where no available disks can be attached to the cluster storage.

Simple Log Service

event

csi_no_ava_disk

sls.app.ack.csi.no_ava_disk

Disk IOHang

An IOHang anomaly occurs in the cluster.

Simple Log Service

event

csi_disk_iohang

sls.app.ack.csi.disk_iohang

Slow I/O occurs on the PVC bound to the disk

A slow I/O anomaly occurs on the PVC bound to the cluster disk.

Simple Log Service

event

csi_latency_high

sls.app.ack.csi.latency_too_high

Anomalous PersistentVolume status

An anomaly occurs on a PV in the cluster.

Managed Service for Prometheus

metric-prometheus

pv-failed

prom.pv.failed

Alert rule set for network anomalies

Alert Item

Rule Description

Alert Source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Multiple route tables exist in the VPC

This may complicate network configuration or cause route conflicts. You need to optimize the network structure promptly.

Simple Log Service

event

ccm-vpc-multi-route-err

sls.app.ack.ccm.describe_route_tables_failed

No available SLB instances

An event indicating that the cluster cannot create an SLB instance.

Simple Log Service

event

slb-no-ava

sls.app.ack.ccm.no_ava_slb

Failed to sync the SLB instance

An event indicating that the cluster failed to sync the created SLB instance.

Simple Log Service

event

slb-sync-err

sls.app.ack.ccm.sync_slb_failed

Failed to delete the SLB instance

An event indicating that the cluster failed to delete the SLB instance.

Simple Log Service

event

slb-del-err

sls.app.ack.ccm.del_slb_failed

Failed to create a route

An event indicating that the cluster failed to create a VPC network route.

Simple Log Service

event

route-create-err

sls.app.ack.ccm.create_route_failed

Failed to sync a route

An event indicating that the cluster failed to sync a VPC network route.

Simple Log Service

event

route-sync-err

sls.app.ack.ccm.sync_route_failed

Invalid Terway resource

An anomalous event indicating an invalid Terway network resource in the cluster.

Simple Log Service

event

terway-invalid-res

sls.app.ack.terway.invalid_resource

Terway fails to assign an IP address

An anomalous event indicating that the Terway network resource in the cluster fails to assign an IP address.

Simple Log Service

event

terway-alloc-ip-err

sls.app.ack.terway.alloc_ip_fail

Failed to parse the Ingress bandwidth configuration

An anomalous event indicating a configuration parsing error for the cluster Ingress network.

Simple Log Service

event

terway-parse-err

sls.app.ack.terway.parse_fail

Terway fails to allocate network resources

An anomalous event indicating that the Terway network resource in the cluster fails to be allocated.

Simple Log Service

event

terway-alloc-res-err

sls.app.ack.terway.allocate_failure

Terway fails to reclaim network resources

An anomalous event indicating that the Terway network resource in the cluster fails to be reclaimed.

Simple Log Service

event

terway-dispose-err

sls.app.ack.terway.dispose_failure

Terway virtual mode changes

An event indicating a change in the virtual mode of the cluster Terway network.

Simple Log Service

event

terway-virt-mod-err

sls.app.ack.terway.virtual_mode_change

Terway triggers a pod IP configuration check

An event indicating that the cluster Terway network triggers a pod IP configuration check.

Simple Log Service

event

terway-ip-check

sls.app.ack.terway.config_check

Failed to reload the Ingress configuration

An anomalous event indicating that the cluster Ingress network configuration fails to reload. Check if the Ingress configuration is correct.

Simple Log Service

event

ingress-reload-err

sls.app.ack.ingress.err_reload_nginx

Alert rule set for important audit operations

Alert Item

Rule Description

Alert Source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

A user logs on to a container or executes a command in the cluster

This may be a maintenance or anomalous activity. Audit operations can be used for tracking and security detection.

Simple Log Service

event

audit-at-command

sls.app.k8s.audit.at.command

The scheduling status of a cluster node changes

This affects service efficiency and resource load. You must promptly follow up on the intent of the change and verify the effect.

Simple Log Service

event

audit-cordon-switch

sls.app.k8s.audit.at.cordon.uncordon

A resource is deleted from the cluster

Resource deletion may be a planned or anomalous behavior. We recommend that you audit the operation to prevent risks.

Simple Log Service

event

audit-resource-delete

sls.app.k8s.audit.at.delete

A node is drained or an eviction occurs in the cluster

This reflects node load pressure or policy execution. You must confirm its necessity and impact.

Simple Log Service

event

audit-drain-eviction

sls.app.k8s.audit.at.drain.eviction

A user logs on to the cluster from the Internet

Logging on from the Internet may pose security risks. You must confirm the logon and access permission configurations.

Simple Log Service

event

audit-internet-login

sls.app.k8s.audit.at.internet.login

A node label is updated in the cluster

Label updates are used to distinguish and manage node resources. Correctness affects O&M efficiency.

Simple Log Service

event

audit-node-label-update

sls.app.k8s.audit.at.label

A node taint is updated in the cluster

Changes in node taint configuration affect scheduling policies and toleration mechanisms. You must correctly execute and review the configuration.

Simple Log Service

event

audit-node-taint-update

sls.app.k8s.audit.at.taint

A resource is modified in the cluster

Real-time modification of resource configurations may indicate adjustments to application policies. You must verify if it aligns with business objectives.

Simple Log Service

event

audit-resource-update

sls.app.k8s.audit.at.update

Alert rule set for security anomalies

Alert Item

Rule Description

Alert Source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Security inspection finds a high-risk configuration

An event indicating that a cluster security inspection has found a high-risk configuration.

Simple Log Service

event

si-c-a-risk

sls.app.ack.si.config_audit_high_risk

Alert rule set for cluster inspection anomalies

Alert Item

Rule Description

Alert Source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Cluster inspection finds an anomaly

The automatic inspection mechanism captures a potential anomaly. You need to analyze the specific issue and daily maintenance policies.

Simple Log Service

event

cis-sched-failed

sls.app.ack.cis.schedule_task_failed

Troubleshooting guide for alerts

Pod eviction triggered by node disk usage reaching the threshold

Alert message

(combined from similar events): Failed to garbage collect required amount of images. Attempted to free XXXX bytes, but only found 0 bytes eligible to free

Symptoms

The pod status is Evicted. The node experiences disk pressure (The node had condition: [DiskPressure].)

Cause

When the node disk usage reaches the eviction threshold (the default is 85%), the kubelet performs pressure-based eviction and garbage collection to reclaim unused image files. This process causes the pod to be evicted. You can log on to the target node and run the df -h command to view the disk usage.

Solution

  1. Log on to the target node (containerd runtime environment) and run the following command to delete unused container images and release disk space.

    crictl rmi --prune
  2. Clean up logs or resize the node disk.

  3. Adjust the relevant thresholds.

    • Adjust the kubelet image garbage collection threshold as needed to reduce pod evictions caused by high disk usage on nodes. For more information, see Customize kubelet configurations for a node pool.

    • When the node disk usage reaches or exceeds 85%, you receive an alert. You can modify the alert threshold in the node_disk_util_high alert rule in the YAML configuration based on your business needs. For more information, see Configure alert rules.

Recommendations and preventive measures

  • For nodes that frequently encounter this issue, we recommend that you assess the actual storage needs of your applications and properly plan resource requests and node disk capacity.

  • We recommend that you regularly monitor your storage usage to promptly identify and address potential threats. For more information, see the Node Storage Dashboard.

Pod OOMKilling

Alert message

pod was OOM killed. node:xxx pod:xxx namespace:xxx uuid:xxx

Symptoms

The pod status is abnormal, and the event details contain PodOOMKilling.

Solution

An Out of Memory (OOM) event can be triggered at the node level or the container cgroup level.

  • Causes:

    • Container cgroup-level OOM: The actual memory usage of a pod exceeds its memory limits. The pod is then forcibly terminated by the Kubernetes cgroup.

    • Node-level OOM: This usually occurs when too many pods without resource limits (requests/limits) are running on a node, or when some processes (which may not be managed by Kubernetes) consume a large amount of memory.

  • Method: Log on to the target node and run the dmesg -T | grep -i "memory" command. If the output contains out_of_memory, an OOM event has occurred. If the log output also contains Memory cgroup, the event is a container cgroup-level OOM. Otherwise, the event is a node-level OOM.

  • Suggestions:

For more information about the causes of OOM events and their solutions, see Causes and solutions for OOM Killer.

Pod status is CrashLoopBackOff

When a process in a pod exits unexpectedly, ACK tries to restart the pod. If the pod fails to reach the desired state after multiple restarts, its status changes to CrashLoopBackOff. Follow these steps to troubleshoot:

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters pagefind the cluster you want and click its name. In the left-side pane, choose Workloads > Pods.

  3. Find the abnormal pod in the list and click Details in the Actions column.

  4. Check the Events of the pod and analyze the description of the abnormal event.

  5. View the Logs of the pod, which may record the cause of the abnormal process.

    Note

    If the pod has been restarted, select Show the log of the last container exit to view the logs of the previous pod.

    The console displays a maximum of 500 of the most recent log entries. To view more historical logs, we recommend that you set up a log persistence solution for unified collection and storage.