Skip to content

Commit 2c341eb

Browse files
committed
Add runbook for alert KubeAggregatedAPIErrors
AggregatedAPIErrors alert was recently renamed[1] to KubeAggregatedAPIErrors. This commit adds a runbook for the KubeAggregatedAPIErrors alert and keeps the old runbook as is to have backward compatibility. [1] openshift/cluster-monitoring-operator@1b85b55 Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>
1 parent aa25cab commit 2c341eb

File tree

1 file changed

+79
-0
lines changed

1 file changed

+79
-0
lines changed
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# KubeAggregatedAPIErrors
2+
3+
## Meaning
4+
5+
[This alert][KubeAggregatedAPIErrors] is triggered when multiple calls to the
6+
aggregated API of OpenShift fail over a certain period.
7+
8+
## Impact
9+
10+
Errors on the aggregated API can result in the unavailability of some OpenShift
11+
services.
12+
13+
## Diagnosis
14+
15+
The alert should contain information about the affected API and the scope of the
16+
impact.
17+
18+
```text
19+
- alertname = KubeAggregatedAPIErrors
20+
- name = v1.packages.operators.coreos.com
21+
- namespace = default
22+
...
23+
- message = Kubernetes aggregated API v1.packages.operators.coreos.com/default has reported errors. It has appeared unavailable 5 times averaged over the past 10m.
24+
```
25+
26+
## Mitigation
27+
28+
### Check the APIs status checks are on True
29+
30+
Currently, there are at least four aggregated APIs in an OpenShift Cluster. The
31+
API on the `openshift-apiserver` namespace, the prometheus-adapter on the
32+
namespace `openshift-monitoring`, the package-server service in the
33+
`openshift-operator-lifecycle-manager` namespace, and the API on the
34+
`openshift-oauth-apiserver` namespace. However, it makes sense to check the
35+
availability of all APIs.
36+
37+
To get a list of `APIServices` and their backing aggregated APIs, use the
38+
following command:
39+
40+
```console
41+
$ oc get apiservice
42+
```
43+
44+
The `SERVICE` column notes here the aggregated API name. The availability status
45+
for every listed API should be `True`. A `False` means that requests for that
46+
API service, API server pods, or resources belonging to that apiGroup failed
47+
many times during the last minutes.
48+
49+
Fetch the pods that serve the unavailable API. E.g.: for
50+
`openshift-apiserver/api` use the following command:
51+
52+
```console
53+
$ oc get pods -n openshift-apiserver
54+
```
55+
56+
When their status is not `Running`, check the logs for more details. As these
57+
pods are controlled by a deployment, they can be restart when they are not
58+
answering to requests anymore.
59+
60+
### Check the authentication certificates of the aggregated API
61+
62+
Make sure the certificates are up to date and still valid. Use:
63+
64+
```console
65+
$ oc get configmaps -n kube-system extension-apiserver-authentication
66+
```
67+
68+
You can save those certificates into a file and use the following command to
69+
check the end dates:
70+
71+
```console
72+
$ openssl x509 -noout -enddate -in {myfile_with_certs.crt}
73+
```
74+
75+
Those certificates are used by the aggregated APIs to validate requests. For the
76+
case, they are expired check [here][cert] how to add a new one.
77+
78+
[cert]: https://docs.openshift.com/container-platform/latest/security/certificates/api-server.html
79+
[KubeAggregatedAPIErrors]: https://github.com/openshift/cluster-monitoring-operator/blob/1824f9c9a39f54734298dd10e5d20d42c8247995/assets/control-plane/prometheus-rule.yaml#L399-L408

0 commit comments

Comments
 (0)