|
| 1 | +# KubeAggregatedAPIErrors |
| 2 | + |
| 3 | +## Meaning |
| 4 | + |
| 5 | +[This alert][KubeAggregatedAPIErrors] is triggered when multiple calls to the |
| 6 | +aggregated API of OpenShift fail over a certain period. |
| 7 | + |
| 8 | +## Impact |
| 9 | + |
| 10 | +Errors on the aggregated API can result in the unavailability of some OpenShift |
| 11 | +services. |
| 12 | + |
| 13 | +## Diagnosis |
| 14 | + |
| 15 | +The alert should contain information about the affected API and the scope of the |
| 16 | +impact. |
| 17 | + |
| 18 | +```text |
| 19 | + - alertname = KubeAggregatedAPIErrors |
| 20 | + - name = v1.packages.operators.coreos.com |
| 21 | + - namespace = default |
| 22 | +... |
| 23 | + - message = Kubernetes aggregated API v1.packages.operators.coreos.com/default has reported errors. It has appeared unavailable 5 times averaged over the past 10m. |
| 24 | +``` |
| 25 | + |
| 26 | +## Mitigation |
| 27 | + |
| 28 | +### Check the APIs status checks are on True |
| 29 | + |
| 30 | +Currently, there are at least four aggregated APIs in an OpenShift Cluster. The |
| 31 | +API on the `openshift-apiserver` namespace, the prometheus-adapter on the |
| 32 | +namespace `openshift-monitoring`, the package-server service in the |
| 33 | +`openshift-operator-lifecycle-manager` namespace, and the API on the |
| 34 | +`openshift-oauth-apiserver` namespace. However, it makes sense to check the |
| 35 | +availability of all APIs. |
| 36 | + |
| 37 | +To get a list of `APIServices` and their backing aggregated APIs, use the |
| 38 | +following command: |
| 39 | + |
| 40 | +```console |
| 41 | +$ oc get apiservice |
| 42 | +``` |
| 43 | + |
| 44 | +The `SERVICE` column notes here the aggregated API name. The availability status |
| 45 | +for every listed API should be `True`. A `False` means that requests for that |
| 46 | +API service, API server pods, or resources belonging to that apiGroup failed |
| 47 | +many times during the last minutes. |
| 48 | + |
| 49 | +Fetch the pods that serve the unavailable API. E.g.: for |
| 50 | +`openshift-apiserver/api` use the following command: |
| 51 | + |
| 52 | +```console |
| 53 | +$ oc get pods -n openshift-apiserver |
| 54 | +``` |
| 55 | + |
| 56 | +When their status is not `Running`, check the logs for more details. As these |
| 57 | +pods are controlled by a deployment, they can be restart when they are not |
| 58 | +answering to requests anymore. |
| 59 | + |
| 60 | +### Check the authentication certificates of the aggregated API |
| 61 | + |
| 62 | +Make sure the certificates are up to date and still valid. Use: |
| 63 | + |
| 64 | +```console |
| 65 | +$ oc get configmaps -n kube-system extension-apiserver-authentication |
| 66 | +``` |
| 67 | + |
| 68 | +You can save those certificates into a file and use the following command to |
| 69 | +check the end dates: |
| 70 | + |
| 71 | +```console |
| 72 | +$ openssl x509 -noout -enddate -in {myfile_with_certs.crt} |
| 73 | +``` |
| 74 | + |
| 75 | +Those certificates are used by the aggregated APIs to validate requests. For the |
| 76 | +case, they are expired check [here][cert] how to add a new one. |
| 77 | + |
| 78 | +[cert]: https://docs.openshift.com/container-platform/latest/security/certificates/api-server.html |
| 79 | +[KubeAggregatedAPIErrors]: https://github.com/openshift/cluster-monitoring-operator/blob/1824f9c9a39f54734298dd10e5d20d42c8247995/assets/control-plane/prometheus-rule.yaml#L399-L408 |
0 commit comments