- Notifications
You must be signed in to change notification settings - Fork 25.6k
Add repo throttle metrics to node stats api response #96678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Documentation preview: |
| Pinging @elastic/es-distributed (Team:Distributed) |
| Hi @volodk85, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks mostly good. We need to handle the mixed-version serialization logic and I would prefer to have an integration test where we a repository is configured with a low rate-limit and we see that the added stats reflect the throttling.
| | ||
| import static java.util.stream.Collectors.toMap; | ||
| | ||
| public class RepositoriesThrottlingStats implements Writeable, ToXContentFragment { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can be a bit more general here and name this class RepositoriesStats?
| ingestStats = in.readOptionalWriteable(IngestStats::read); | ||
| adaptiveSelectionStats = in.readOptionalWriteable(AdaptiveSelectionStats::new); | ||
| indexingPressureStats = in.readOptionalWriteable(IndexingPressureStats::new); | ||
| repositoriesThrottlingStats = in.readOptionalWriteable(RepositoriesThrottlingStats::new); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should handle the case where the node is communicating with a node in a different version. See the javadocs in org.elasticsearch.TransportVersion for more information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
| return Collections.unmodifiableMap(repositoryToThrottlingStats); | ||
| } | ||
| | ||
| public record RepositoryThrottling(String repositoryName, long totalReadThrottledNanos, long totalWriteThrottledNanos) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we use this class instead of ThrottlingStats? and as I stated above maybe we should name this RepositoryStats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling the review comments! This looks mostly good but we still need to handle the mixed-cluster serialization in the NodeStats class and add a new transport version for this change.
| ingestStats = in.readOptionalWriteable(IngestStats::read); | ||
| adaptiveSelectionStats = in.readOptionalWriteable(AdaptiveSelectionStats::new); | ||
| indexingPressureStats = in.readOptionalWriteable(IndexingPressureStats::new); | ||
| repositoriesStats = in.readOptionalWriteable(RepositoriesStats::new); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should handle the mixed-node serialization here too. That's the reason why the bwc tests are failing.
| private final Map<String, ThrottlingStats> repositoryThrottlingStats; | ||
| | ||
| public RepositoriesStats(StreamInput in) throws IOException { | ||
| if (in.getTransportVersion().onOrAfter(TransportVersion.V_8_9_0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a new transport version, this changed recently to handle a new scenario where versions are deployed more often in serverless. I can share some documentation about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. I left a few comments & questions, all superficial.
| builder.field("total_read_throttled_nanos", totalReadThrottledNanos); | ||
| builder.field("total_write_throttled_nanos", totalWriteThrottledNanos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also have these fields represented in a human-readable fashion if the ?human flag is specified? IIRC that's a little awkward with nanoseconds, because if you use org.elasticsearch.xcontent.XContentBuilder#humanReadableField with a TimeValue then you get the raw output in milliseconds (because of this) but still we can just check the human flag directly (like we do e.g. here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those values are just long counters (not wrapped in TimeValue) and I think those transformers won't affect them, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the only remaining point now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following ...
{ "_nodes" : { "total" : 1, "successful" : 1, "failed" : 0 }, "cluster_name" : "stateless", "nodes" : { "rVt20tYyS3yN5LNYHjPoSw" : { "timestamp" : 1686604771491, "name" : "es01", "transport_address" : "172.18.0.4:9300", "host" : "172.18.0.4", "ip" : "172.18.0.4:9300", "roles" : [ "index", "master" ], "attributes" : { "stateless.memory" : "8232894464", "xpack.installed" : "true" }, "repositories" : { "my_repository" : { "total_read_throttled_time_nanos" : 0, "total_write_throttled_time_nanos" : 551110710 } } } } } { "_nodes" : { "total" : 1, "successful" : 1, "failed" : 0 }, "cluster_name" : "stateless", "nodes" : { "rVt20tYyS3yN5LNYHjPoSw" : { "timestamp" : 1686604850299, "name" : "es01", "transport_address" : "172.18.0.4:9300", "host" : "172.18.0.4", "ip" : "172.18.0.4:9300", "roles" : [ "index", "master" ], "attributes" : { "stateless.memory" : "8232894464", "xpack.installed" : "true" }, "repositories" : { "my_repository" : { "total_read_throttled_time" : "0s", "total_write_throttled_time" : "551.1ms" } } } } } Note that field names are different for human and non-human mode. asciidoc was updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to follow the same pattern/convention as in AdaptiveSelectionStats: we always output the machine-readable field, plus the human-readable field if requested.
server/src/main/java/org/elasticsearch/repositories/RepositoriesStats.java Outdated Show resolved Hide resolved
server/src/main/java/org/elasticsearch/repositories/RepositoriesStats.java Outdated Show resolved Hide resolved
server/src/main/java/org/elasticsearch/repositories/RepositoriesService.java Outdated Show resolved Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for all the iterations 👍. I agree with David's comments, but once these are tackled this is good to get merged.
This comment was marked as off-topic.
This comment was marked as off-topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to give this another review before a final LGTM so marking it as request changes for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| Thank you @DaveCTurner @fcofdez for reviewing this |
What: Expose snapshot/restore throttling time in stats APIs
Change: Added repository section to _nodes/stats API response. It shows throttling stats for registered repositories per node.
Sample:
Setup: 3-nodes cluster, FS repository, index 1000 documents 8Kbytes each, take a snapshot, observe throttling stats
Request: http://localhost:9200/_nodes/stats/repository?pretty
Response (truncated to one node):
Closes #89385