You can enable the Container Service alert management feature to centrally manage container alerts. This feature monitors for anomalous events in container services, key metrics for basic cluster resources, and metrics for core cluster components and applications. You can also use CustomResourceDefinitions (CRDs) to modify the default alert rules in your cluster. This helps you promptly detect cluster anomalies.
Billing
The alert feature uses data from Simple Log Service (SLS), Managed Service for Prometheus, and CloudMonitor. Additional fees are charged for notifications, such as text messages and phone calls, that are sent when an alert is triggered. Before you enable the alert feature, check the data source for each alert item in the default alert rule templates and activate the required services.
Alert Source | Configuration Requirements | Billing Details |
Simple Log Service (SLS) | Enable event monitoring. Event monitoring is enabled by default when you enable the alert management feature. | |
Managed Service for Prometheus | Configure Prometheus monitoring for your cluster. | Free of charge |
CloudMonitor | For the cluster: Enable the Cloud Monitor feature for a Container Service for Kubernetes cluster. |
Enable the alert management feature
After you enable the alert management feature, you can set metric-based alerts for specified resources in your cluster. You automatically receive alert notifications when anomalies occur. This helps you manage and maintain your cluster more efficiently and ensures service stability. For more information about resource alerts, see Default alert rule templates.
ACK managed clusters
You can enable alert configuration for an existing cluster or when you create a new cluster.
Enable the feature for an existing cluster
Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose .
On the Alerting page, follow the on-screen instructions to install or upgrade the components.
After the installation or upgrade is complete, go to the Alerting page to configure alert information.
Tab
Description
Alert Rule Management
Status: Turn on or off the target alert rule set.
Edit Notification Object: Set the contact group for alert notifications.
Before you configure this, create contacts and groups, and add the contacts to the groups. You can only select contact groups as notification objects. To notify a single person, create a group that contains only that contact and select the group.
Alert History
You can view the latest 100 alert records from the last 24 hours.
Click the link in the Alert Rule column to go to the corresponding monitoring system and view the detailed rule configuration.
Click Troubleshoot to quickly locate the resource where the anomaly occurred (anomalous event or metric).
Click Intelligent Analysis to use the AI assistant to help analyze the issue and provide troubleshooting guidance.
Contact Management
Manage contacts. You can create, edit, or delete contacts.
Contact Methods:
Phone/Text Message: After you set a mobile number for a contact, the contact can receive alert notifications by phone and text message.
Only verified mobile numbers can be used to receive phone call notifications. For more information about how to verify a mobile number, see Verify a mobile number.
Email: After you set an email address for a contact, the contact can receive alert notifications by email.
Robot: DingTalk Robot, WeCom Robot, and Lark Robot.
For DingTalk robots, you must add the security keywords: Alerting, Dispatch.
Before you configure email and robot notifications, verify them in the CloudMonitor console. Choose to ensure you can receive alert information.
Contact Group Management
Manage contact groups. You can create, edit, or delete contact groups. You can only select contact groups when you Edit Notification Object.
If no contact group exists, the console creates a default contact group based on your Alibaba Cloud account information.
Enable the feature when creating a cluster
On the Component Configurations page of the cluster creation wizard, select Configure Alerts Using The Default Alert Template for Alerting and select an Alert Notification Contact Group. For more information, see Create an ACK managed cluster.
After you enable alert configurations during cluster creation, the system enables the default alert rules and sends alert notifications to the default alert contact group. You can also modify the alert contacts or alert contact groups.
ACK dedicated clusters
For ACK dedicated clusters, you must first grant permissions to the worker RAM role and then enable the default alert rules.
Grant permissions to the worker RAM role
Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the target cluster and click its name. In the navigation pane on the left, click Cluster Information.
On the Cluster Information page, in the Cluster Resources section, copy the name of the Worker RAM Role and click the link to open the Resource Access Management (RAM) console and grant permissions to the role.
Create a custom policy. For more information, see Create a custom policy on the JSON tab.
{ "Action": [ "log:*", "arms:*", "cms:*", "cs:UpdateContactGroup" ], "Resource": [ "*" ], "Effect": "Allow" }
On the Roles page, find the worker RAM role and grant the custom policy to it. For more information, see Method 1: Grant permissions to a RAM role on the RAM role page.
Note: This document grants broad permissions for simplicity. In a production environment, we recommend that you follow the principle of least privilege and grant only the required permissions.
On the Roles page, find the worker RAM role and grant the custom policy to it. For more information, see Method 1: Grant permissions to a RAM role on the RAM role page.
Check the logs to verify that the access permissions for the alert feature are configured.
In the navigation pane on the left of the cluster management page, choose .
Set Namespace to kube-system and click the Name of the alicloud-monitor-controller application in the list of stateless applications.
Click the Logs tab. The pod logs indicate that the authorization was successful.
Enable default alert rules
In the navigation pane on the left of the cluster management page, choose O&M > Alerting.
On the Alerting page, configure the following alert information.
Tab
Description
Alert Rule Management
Status: Turn on or off the target alert rule set.
Edit Notification Object: Set the contact group for alert notifications.
Before you configure this, create contacts and groups, and add the contacts to the groups. You can only select contact groups as notification objects. To notify a single person, create a group that contains only that contact and select the group.
Alert History
You can view the latest 100 alert records from the last 24 hours.
Click the link in the Alert Rule column to go to the corresponding monitoring system and view the detailed rule configuration.
Click Troubleshoot to quickly locate the resource where the anomaly occurred (anomalous event or metric).
Click Intelligent Analysis to use the AI assistant to help analyze the issue and provide troubleshooting guidance.
Contact Management
Manage contacts. You can create, edit, or delete contacts.
Contact Methods:
Phone/Text Message: After you set a mobile number for a contact, the contact can receive alert notifications by phone and text message.
Only verified mobile numbers can be used to receive phone call notifications. For more information about how to verify a mobile number, see Verify a mobile number.
Email: After you set an email address for a contact, the contact can receive alert notifications by email.
Robot: DingTalk Robot, WeCom Robot, and Lark Robot.
For DingTalk robots, you must add the security keywords: Alerting, Dispatch.
Before you configure email and robot notifications, verify them in the CloudMonitor console. Choose to ensure you can receive alert information.
Contact Group Management
Manage contact groups. You can create, edit, or delete contact groups. You can only select contact groups when you Edit Notification Object.
If no contact group exists, the console creates a default contact group based on your Alibaba Cloud account information.
Configure alert rules
After you enable the alert configuration feature, an AckAlertRule CustomResourceDefinition (CRD) resource is created in the kube-system namespace. This resource contains the default alert rule templates. You can modify this CRD resource to customize the default alert rules and configure container service alerts that meet your requirements.
Console
Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose .
On the Alert Rule Management tab, click Edit Alert Configuration in the upper-right corner. Then, click YAML in the Actions column of the target rule to view the AckAlertRule resource configuration for the current cluster.
Modify the YAML file as needed. For more information, see Default alert rule templates.
The following code shows a sample YAML configuration for an alert rule:
You can use
rules.thresholds
to customize the alert threshold. For more information about the parameters, see the following table. For example, the preceding configuration triggers an alert notification if the CPU utilization of a cluster node exceeds 85% for three consecutive checks and the previous alert was triggered more than 900 seconds ago.Parameter
Required
Description
Default Value
CMS_ESCALATIONS_CRITICAL_Threshold
Required
The alert threshold. If this parameter is not configured, the rule fails to sync and is disabled.
unit
: The unit. You can set this to percent, count, or qps.value
: The threshold.
Depends on the default alert template configuration.
CMS_ESCALATIONS_CRITICAL_Times
Optional
The number of retries for the CloudMonitor rule. If this is not configured, the default value is used.
3
CMS_RULE_SILENCE_SEC
Optional
The quiescent period in seconds after the first alert is reported when CloudMonitor continuously triggers the rule due to an anomaly. This prevents alert fatigue. If this is not configured, the default value is used.
900
kubectl
Run the following command to edit the YAML file of the alert rule.
kubectl edit ackalertrules default -n kube-system
Modify the YAML file as needed, and then save and exit. For more information, see Default alert rule templates.
You can use
rules.thresholds
to customize the alert threshold. For example, the preceding configuration triggers an alert notification if the CPU utilization of a cluster node exceeds 85% for three consecutive checks and the previous alert was triggered more than 900 seconds ago.Parameter
Required
Description
Default Value
CMS_ESCALATIONS_CRITICAL_Threshold
Required
The alert threshold. If this parameter is not configured, the rule fails to sync and is disabled.
unit
: The unit. You can set this to percent, count, or qps.value
: The threshold.
Depends on the default alert template configuration.
CMS_ESCALATIONS_CRITICAL_Times
Optional
The number of retries for the CloudMonitor rule. If this is not configured, the default value is used.
3
CMS_RULE_SILENCE_SEC
Optional
The quiescent period in seconds after the first alert is reported when CloudMonitor continuously triggers the rule due to an anomaly. This prevents alert fatigue. If this is not configured, the default value is used.
900
Default alert rule templates
Alert rules are synced from Simple Log Service (SLS), Managed Service for Prometheus, and CloudMonitor. On the Alerting page, you can view the configuration of each alert rule by clicking Advanced Settings in the Alert Management column.
Troubleshooting guide for alerts
Pod eviction triggered by node disk usage reaching the threshold
Alert message
(combined from similar events): Failed to garbage collect required amount of images. Attempted to free XXXX bytes, but only found 0 bytes eligible to free
Symptoms
The pod status is Evicted. The node experiences disk pressure (The node had condition: [DiskPressure].
)
Cause
When the node disk usage reaches the eviction threshold (the default is 85%), the kubelet performs pressure-based eviction and garbage collection to reclaim unused image files. This process causes the pod to be evicted. You can log on to the target node and run the df -h
command to view the disk usage.
Solution
Log on to the target node (containerd runtime environment) and run the following command to delete unused container images and release disk space.
crictl rmi --prune
Clean up logs or resize the node disk.
Create a snapshot backup of the data disk or system disk for the target node. After the backup is complete, delete files or folders that are no longer needed. For more information, see Resolve full disk space issues on Linux instances.
Scale out the system disk or data disk of the target node online to increase its storage capacity. For more information, see Scale out the system disk or data disk of a node.
Adjust the relevant thresholds.
Adjust the kubelet image garbage collection threshold as needed to reduce pod evictions caused by high disk usage on nodes. For more information, see Customize kubelet configurations for a node pool.
When the node disk usage reaches or exceeds 85%, you receive an alert. You can modify the alert threshold in the
node_disk_util_high
alert rule in the YAML configuration based on your business needs. For more information, see Configure alert rules.
Recommendations and preventive measures
For nodes that frequently encounter this issue, we recommend that you assess the actual storage needs of your applications and properly plan resource requests and node disk capacity.
We recommend that you regularly monitor your storage usage to promptly identify and address potential threats. For more information, see the Node Storage Dashboard.
Pod OOMKilling
Alert message
pod was OOM killed. node:xxx pod:xxx namespace:xxx uuid:xxx
Symptoms
The pod status is abnormal, and the event details contain PodOOMKilling
.
Solution
An Out of Memory (OOM) event can be triggered at the node level or the container cgroup level.
Causes:
Container cgroup-level OOM: The actual memory usage of a pod exceeds its memory limits. The pod is then forcibly terminated by the Kubernetes cgroup.
Node-level OOM: This usually occurs when too many pods without resource limits (requests/limits) are running on a node, or when some processes (which may not be managed by Kubernetes) consume a large amount of memory.
Method: Log on to the target node and run the
dmesg -T | grep -i "memory"
command. If the output containsout_of_memory
, an OOM event has occurred. If the log output also containsMemory cgroup
, the event is a container cgroup-level OOM. Otherwise, the event is a node-level OOM.Suggestions:
For container cgroup-level OOM:
Increase the pod's memory limits as needed. The actual usage should not exceed 80% of the specified limits. For more information, see Manage pods and Upgrade or downgrade node resources.
Enable resource profiling to obtain recommended configurations for container requests and limits.
For node-level OOM:
Scale out the memory resources of the node or distribute workloads to more nodes. For more information, see Upgrade or downgrade node resources and Schedule applications to specific nodes.
Identify pods with high memory usage on the node and set reasonable memory limits for them.
For more information about the causes of OOM events and their solutions, see Causes and solutions for OOM Killer.
Pod status is CrashLoopBackOff
When a process in a pod exits unexpectedly, ACK tries to restart the pod. If the pod fails to reach the desired state after multiple restarts, its status changes to CrashLoopBackOff. Follow these steps to troubleshoot:
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters pagefind the cluster you want and click its name. In the left-side pane, choose .
Find the abnormal pod in the list and click Details in the Actions column.
Check the Events of the pod and analyze the description of the abnormal event.
View the Logs of the pod, which may record the cause of the abnormal process.
NoteIf the pod has been restarted, select Show the log of the last container exit to view the logs of the previous pod.
The console displays a maximum of 500 of the most recent log entries. To view more historical logs, we recommend that you set up a log persistence solution for unified collection and storage.