Not sink the global metrics #23

codebien · 2022-05-31T09:31:39Z

Running the output with the xk6 race detector enabled a data race is caught:

================== WARNING: DATA RACE Read at 0x00c00011fd20 by goroutine 71: go.k6.io/k6/metrics.(*CounterSink).Add() go.k6.io/k6@v0.38.0/metrics/sink.go:50 +0x5a github.com/grafana/xk6-output-prometheus-remote/pkg/remotewrite.(*metricsStorage).update() github.com/grafana/xk6-output-prometheus-remote@v0.0.0-00010101000000-000000000000/pkg/remotewrite/metrics.go:43 +0x54f github.com/grafana/xk6-output-prometheus-remote/pkg/remotewrite.(*PrometheusMapping).MapCounter() github.com/grafana/xk6-output-prometheus-remote@v0.0.0-00010101000000-000000000000/pkg/remotewrite/prometheus.go:16 +0xe4 github.com/grafana/xk6-output-prometheus-remote/pkg/remotewrite.(*metricsStorage).transform() github.com/grafana/xk6-output-prometheus-remote@v0.0.0-00010101000000-000000000000/pkg/remotewrite/metrics.go:55 +0x218 github.com/grafana/xk6-output-prometheus-remote/pkg/remotewrite.(*Output).convertToTimeSeries() github.com/grafana/xk6-output-prometheus-remote@v0.0.0-00010101000000-000000000000/pkg/remotewrite/remotewrite.go:149 +0x4c8 github.com/grafana/xk6-output-prometheus-remote/pkg/remotewrite.(*Output).flush() github.com/grafana/xk6-output-prometheus-remote@v0.0.0-00010101000000-000000000000/pkg/remotewrite/remotewrite.go:112 +0x104 github.com/grafana/xk6-output-prometheus-remote/pkg/remotewrite.(*Output).flush-fm() github.com/grafana/xk6-output-prometheus-remote@v0.0.0-00010101000000-000000000000/pkg/remotewrite/remotewrite.go:84 +0x39 go.k6.io/k6/output.(*PeriodicFlusher).run() go.k6.io/k6@v0.38.0/output/helpers.go:89 +0xe7 go.k6.io/k6/output.NewPeriodicFlusher·dwrap·5() go.k6.io/k6@v0.38.0/output/helpers.go:122 +0x39 Previous write at 0x00c00011fd20 by goroutine 72: go.k6.io/k6/metrics.(*CounterSink).Add() go.k6.io/k6@v0.38.0/metrics/sink.go:50 +0x77 go.k6.io/k6/metrics/engine.(*outputIngester).flushMetrics() go.k6.io/k6@v0.38.0/metrics/engine/ingester.go:80 +0x3a1 go.k6.io/k6/metrics/engine.(*outputIngester).flushMetrics-fm() go.k6.io/k6@v0.38.0/metrics/engine/ingester.go:52 +0x39 go.k6.io/k6/output.(*PeriodicFlusher).run() go.k6.io/k6@v0.38.0/output/helpers.go:89 +0xe7 go.k6.io/k6/output.NewPeriodicFlusher·dwrap·5() go.k6.io/k6@v0.38.0/output/helpers.go:122 +0x39

The global metrics referenced in the samples are managed by the k6's ingester output, so the sink operation is already executed from it. So, executing concurrently the sink on the metric in the output generates a data race.

The fixes use new dedicated metrics in the metricsStorage for executing the sink locally in the extension.

The global metric is controlled by the k6's ingester output, so the sink operation on the global metric in the sample is already executed. Use a local metric in the metricsStorage for executing a local and dedicated sink in the extension. It fixes the data race where both the extension and the k6 ingester calls Add and sink for the same metrics.

yorugac

LGTM!

yorugac · 2022-06-01T07:44:44Z

pkg/remotewrite/metrics.go

+// TODO: this is just avoiding duplicates with the previous
+// Implement a better and complete solution
+// maybe discard any timestamp < latest?
+//
+//current.Time = sample.Time // to avoid duplicates in timestamps


Just a note regarding solutions: can k6 guarantee to always send samples with timestamps in strictly increasing order?

current.Time = sample.Time // to avoid duplicates in timestamps

This assignment was mostly meant to avoid encountering an error on write (and losing data subsequently) and Prometheus bug mentioned below when such an error is thrown seemingly when it shouldn't be. Also, #11 for further investigation.

I find just now the time for checking this comment. The Prometheus' timestamp generator function

xk6-output-prometheus-remote/pkg/remotewrite/prometheus.go

Line 101 in 78df114

Timestamp: timestamp.FromTime(sample.Time),

that we are using for generating the timestamp has a millisecond precision that it could be very low if we consider our high concurrent world. If the remote write supports higher precision we could just use the sample.Time.UnixNano(), if not then we need to discard or wait for the new data modelling (hopefully available soon).

WDYT? Could it explain the issue?

Yes, I agree that improving precision could help with this particular problem but even if it's possible, I'm not sure about drawbacks it'll bring. E.g. do people actually need precision higher than ms? what else would it impact during processing? etc. Also, I believe precision question is valid for both Prometheus and k6 implementations, IIRC.
So IMO, precision is a multi-level problem here.

This change feels like a bit of an experiment to me 🙂 Definitely 👍 for making more experiments and figuring out when exactly duplicate timestamps happen.

I removed all the commented lines and I'm going to merge this PR so we can fix the data race. We need to find the main issue about duplicates and fix it in a dedicated PR. I expect to address it by implementing the k6 metrics refactoring.

javaducky

LGTM!

The specific hack is not required with the new local model. Regarding the original issue we need to get the real problem and apply a complete solution. We expect to fix with a PR for github.com//issues/11

codebien requested review from mstoykov and yorugac May 31, 2022 09:31

codebien marked this pull request as draft May 31, 2022 09:39

codebien force-pushed the local-synk branch from 4950c11 to a97d401 Compare May 31, 2022 10:01

codebien marked this pull request as ready for review May 31, 2022 10:08

codebien self-assigned this May 31, 2022

codebien mentioned this pull request May 31, 2022

k6 standard output results are wrong when prometheus output is enabled #24

Closed

yorugac approved these changes Jun 1, 2022

View reviewed changes

javaducky approved these changes Jun 2, 2022

View reviewed changes

mstoykov approved these changes Jun 6, 2022

View reviewed changes

Removed the hack for duplicates

70a9df8

The specific hack is not required with the new local model. Regarding the original issue we need to get the real problem and apply a complete solution. We expect to fix with a PR for github.com//issues/11

codebien merged commit 5ac1293 into main Jun 6, 2022

codebien deleted the local-synk branch June 6, 2022 14:46

mstoykov mentioned this pull request Jun 8, 2022

Http requests count double with 'constant-arrival-rate' executor #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Not sink the global metrics #23

Not sink the global metrics #23

Uh oh!

codebien commented May 31, 2022

yorugac left a comment

yorugac Jun 1, 2022

codebien Jun 2, 2022 •

edited

Loading

yorugac Jun 6, 2022

codebien Jun 6, 2022

javaducky left a comment

Not sink the global metrics #23

Not sink the global metrics #23

Uh oh!

Conversation

codebien commented May 31, 2022

yorugac left a comment

Choose a reason for hiding this comment

yorugac Jun 1, 2022

Choose a reason for hiding this comment

codebien Jun 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

yorugac Jun 6, 2022

Choose a reason for hiding this comment

codebien Jun 6, 2022

Choose a reason for hiding this comment

javaducky left a comment

Choose a reason for hiding this comment

codebien Jun 2, 2022 •

edited

Loading