GKE metrics agent logging many errors

Question

We have created GKE cluster and we are getting errors from gke-metrics-agent. The errors shows up every cca 30 minutes. It's always the same 62 errors.

All the errors have label k8s-pod/k8s-app: "gke-metrics-agent".

First error is:

error exporterhelper/queued_retry.go:245 Exporting failed. Try enabling retry_on_failure config option. {"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = DeadlineExceeded desc = Deadline expired before operation could complete."

This error is followed by these errors in order

"go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send"
"/go/src/gke-logmon/gke-metrics-agent/vendor/go.opentelemetry.io/collector/exporter/exporterhelper/queued_retry.go:245"
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
/go/src/gke-logmon/gke-metrics-agent/vendor/go.opentelemetry.io/collector/exporter/exporterhelper/metrics.go:120

There are cca 40 errors like this. Two errors which stand out are:

- error exporterhelper/queued_retry.go:175 Exporting failed. Dropping data. Try enabling sending_queue to survive temporary failures. {"kind": "exporter", "name": "googlecloud", "dropped_items": 19}" - warn batchprocessor/batch_processor.go:184 Sender failed {"kind": "processor", "name": "batch", "error": "rpc error: code = DeadlineExceeded desc = Deadline expired before operation could complete."}"

I tried to search those errors on google but I could not find anything. I can't even find any documentation for gke-metrics-agent.

Things I tried:

check quotas
update GKE to newer version (current version is 1.21.3-gke.2001)
update nodes
disable all firewall rules
give all permissions to k8s nodes

I can provide more information about our kubernetes cluster but I don't know what information may be important to solve this issue.

“Deadline exceeded” is a known issue and starting from Kubernetes 1.16, metrics are sent to Cloud Monitoring via GKE Metrics agent which is built on top of Open Telemetry. Can you provide the details about the version you are using for OpenCensus exporter and check by updating the OpenCensus exporter version which increases the timeout and let me know whether it works? — Srividya
– Srividya, Commented Oct 17, 2021 at 6:49
Thanks for the response. It seems that I don't know how to update OpenCensus exporter. I found gke-metrics-agent pod in kubernetes and I tried to change the annotation components.gke.io/component-version: 0.6.0 to 0.13.6. This restarted the pods but the error is styl present. I also tried to change monitoring to open telemetry but I don't know how. Is it possible to set this using terraform? I found only monitoring_service setting which is set to monitoring.googleapis.com/kubernetes by default. — Melchy
– Melchy, Commented Oct 17, 2021 at 9:33
Can you check this link for the OpenCensus exporter update and for OpenTelemetry operations on google cloud? — Srividya
– Srividya, Commented Oct 18, 2021 at 10:58
How did it end? I observe the same behaviour with 1.20.10-gke.301. — Maciek Leks
– Maciek Leks, Commented Oct 25, 2021 at 4:48
I still have no idea what to do. I checked the link to OpenCensus and I can see that there is new version but I still have no idea how to update it. Maybe I should delete the default exporter and create custom exporter with new version? — Melchy
– Melchy, Commented Oct 28, 2021 at 16:33

Srividya · Accepted Answer · 2021-11-09 09:44:18Z

“Deadline exceeded” is a known issue, metrics are sent to Cloud Monitoring via GKE Metrics agent which is built on top of Open Telemetry. Currently there are two workarounds as following to resolve the issue:

1.Updating timeout.

Since the new release included a change that increases the default timeout from 5 to 12 seconds. So you might need to rebuild and redeploy the workload with the new version that could fix this rpc error.

2.To use higher GKE versions, this issue has a fix with gke-metrics-agent versions: 1.18.6-gke.6400+ 1.19.3-gke.600+ 1.20.0-gke.600+.

@Melchy, If you think that the above answer helped you, please consider accepting it (✔️). — Chandra Kiran Pasumarti
– Chandra Kiran Pasumarti, Commented Nov 16, 2021 at 14:13

kwiesmueller · Accepted Answer · 2022-03-06 03:38:12Z

If you are still seeing those errors, please have a look at your metrics. Mainly kubernetes.io/container/... metrics for containers running on the same node as the gke-metrics-agent logging the errors. Do you see gaps in the metrics that shouldn't be there?

The context exceeded errors can happen once in a while, but should not in huge amounts. It may be networking issues or just occasional blips. Do you have any network policies/firewall rules that may prevent gke-metrics-agent from talking to Cloud Monitoring?

Sadly you can't update open-telemetry inside the gke-metrics-agent yourself. A newer cluster version can help too as it updates the agent, so try upgrading your cluster if possible. If the issue impacts your metrics, reach out to support.

Hi, thanks for he response I don't see the errors anymore. After updating k8s cluster and waitning for cca one week the erros suddenly dissapeared. I have no idea why. — Melchy
– Melchy, Commented Mar 7, 2022 at 7:44
Then you might have received a new version of gke-metrics-agent with a fix. — kwiesmueller
– kwiesmueller, Commented Mar 8, 2022 at 13:43

Stack Exchange Network

GKE metrics agent logging many errors

2 Answers 2

You must log in to answer this question.

Hot Network Questions

GKE metrics agent logging many errors

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions