Skip to content

Conversation

@volodk85
Copy link
Contributor

@volodk85 volodk85 commented Jun 7, 2023

What: Expose snapshot/restore throttling time in stats APIs
Change: Added repository section to _nodes/stats API response. It shows throttling stats for registered repositories per node.

Sample:
Setup: 3-nodes cluster, FS repository, index 1000 documents 8Kbytes each, take a snapshot, observe throttling stats
Request: http://localhost:9200/_nodes/stats/repository?pretty
Response (truncated to one node):

{ "_nodes" : { "total" : 3, "successful" : 3, "failed" : 0 }, "cluster_name" : "stateless", "nodes" : { "nGCCGizzRoSaUNeB5VTdtQ" : { "timestamp" : 1686171059819, "name" : "es01", "transport_address" : "172.18.0.2:9300", "host" : "172.18.0.2", "ip" : "172.18.0.2:9300", "roles" : [ "index", "master" ], "attributes" : { "stateless.memory" : "8232894464", "xpack.installed" : "true" }, "repository" : { "my_repository" : { "total_read_throttled_nanos" : 0, "total_write_throttled_nanos" : 178347835 } } }, 

Closes #89385

@volodk85 volodk85 added >feature :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Jun 7, 2023
@volodk85 volodk85 requested review from DaveCTurner and fcofdez June 7, 2023 22:40
@github-actions
Copy link
Contributor

github-actions bot commented Jun 7, 2023

Documentation preview:

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added v8.9.0 Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Jun 7, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @volodk85, I've created a changelog YAML for you.

Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks mostly good. We need to handle the mixed-version serialization logic and I would prefer to have an integration test where we a repository is configured with a low rate-limit and we see that the added stats reflect the throttling.


import static java.util.stream.Collectors.toMap;

public class RepositoriesThrottlingStats implements Writeable, ToXContentFragment {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can be a bit more general here and name this class RepositoriesStats?

ingestStats = in.readOptionalWriteable(IngestStats::read);
adaptiveSelectionStats = in.readOptionalWriteable(AdaptiveSelectionStats::new);
indexingPressureStats = in.readOptionalWriteable(IndexingPressureStats::new);
repositoriesThrottlingStats = in.readOptionalWriteable(RepositoriesThrottlingStats::new);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should handle the case where the node is communicating with a node in a different version. See the javadocs in org.elasticsearch.TransportVersion for more information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

return Collections.unmodifiableMap(repositoryToThrottlingStats);
}

public record RepositoryThrottling(String repositoryName, long totalReadThrottledNanos, long totalWriteThrottledNanos) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we use this class instead of ThrottlingStats? and as I stated above maybe we should name this RepositoryStats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified

@volodk85 volodk85 requested a review from fcofdez June 8, 2023 23:33
Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling the review comments! This looks mostly good but we still need to handle the mixed-cluster serialization in the NodeStats class and add a new transport version for this change.

ingestStats = in.readOptionalWriteable(IngestStats::read);
adaptiveSelectionStats = in.readOptionalWriteable(AdaptiveSelectionStats::new);
indexingPressureStats = in.readOptionalWriteable(IndexingPressureStats::new);
repositoriesStats = in.readOptionalWriteable(RepositoriesStats::new);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should handle the mixed-node serialization here too. That's the reason why the bwc tests are failing.

private final Map<String, ThrottlingStats> repositoryThrottlingStats;

public RepositoriesStats(StreamInput in) throws IOException {
if (in.getTransportVersion().onOrAfter(TransportVersion.V_8_9_0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a new transport version, this changed recently to handle a new scenario where versions are deployed more often in serverless. I can share some documentation about this.

@volodk85 volodk85 requested a review from fcofdez June 9, 2023 22:13
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I left a few comments & questions, all superficial.

Comment on lines 100 to 101
builder.field("total_read_throttled_nanos", totalReadThrottledNanos);
builder.field("total_write_throttled_nanos", totalWriteThrottledNanos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also have these fields represented in a human-readable fashion if the ?human flag is specified? IIRC that's a little awkward with nanoseconds, because if you use org.elasticsearch.xcontent.XContentBuilder#humanReadableField with a TimeValue then you get the raw output in milliseconds (because of this) but still we can just check the human flag directly (like we do e.g. here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those values are just long counters (not wrapped in TimeValue) and I think those transformers won't affect them, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the only remaining point now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following ...

  1. http://localhost:9200/_nodes/rVt20tYyS3yN5LNYHjPoSw/stats/repository?pretty
{ "_nodes" : { "total" : 1, "successful" : 1, "failed" : 0 }, "cluster_name" : "stateless", "nodes" : { "rVt20tYyS3yN5LNYHjPoSw" : { "timestamp" : 1686604771491, "name" : "es01", "transport_address" : "172.18.0.4:9300", "host" : "172.18.0.4", "ip" : "172.18.0.4:9300", "roles" : [ "index", "master" ], "attributes" : { "stateless.memory" : "8232894464", "xpack.installed" : "true" }, "repositories" : { "my_repository" : { "total_read_throttled_time_nanos" : 0, "total_write_throttled_time_nanos" : 551110710 } } } } } 
  1. http://localhost:9200/_nodes/rVt20tYyS3yN5LNYHjPoSw/stats/repository?pretty&human
{ "_nodes" : { "total" : 1, "successful" : 1, "failed" : 0 }, "cluster_name" : "stateless", "nodes" : { "rVt20tYyS3yN5LNYHjPoSw" : { "timestamp" : 1686604850299, "name" : "es01", "transport_address" : "172.18.0.4:9300", "host" : "172.18.0.4", "ip" : "172.18.0.4:9300", "roles" : [ "index", "master" ], "attributes" : { "stateless.memory" : "8232894464", "xpack.installed" : "true" }, "repositories" : { "my_repository" : { "total_read_throttled_time" : "0s", "total_write_throttled_time" : "551.1ms" } } } } } 

Note that field names are different for human and non-human mode. asciidoc was updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to follow the same pattern/convention as in AdaptiveSelectionStats: we always output the machine-readable field, plus the human-readable field if requested.

Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for all the iterations 👍. I agree with David's comments, but once these are tackled this is good to get merged.

@Matzr-4

This comment was marked as off-topic.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to give this another review before a final LGTM so marking it as request changes for now.

@volodk85 volodk85 requested review from DaveCTurner and fcofdez June 12, 2023 19:29
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@volodk85
Copy link
Contributor Author

Thank you @DaveCTurner @fcofdez for reviewing this

@volodk85 volodk85 merged commit 7abe8cb into elastic:main Jun 13, 2023
@volodk85 volodk85 deleted the issues/89385-throttle-stats branch June 13, 2023 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >feature Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.9.0

5 participants