Skip to content

Conversation

@nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Nov 26, 2024

We discovered that the "request counts" for blob store operations were inconsistently reported between Azure and S3 implementations. S3 counted the requests (including retries) and Azure counted operations (not including retries).

This PR changes the BlobStore implementations to report separate operations and requests counts, so we can use operations counts for places where we're interested in reporting costs (e.g. the metering endpoint, also changed in this PR) and we can use request counts for reporting behaviour/performance (e.g. benchmarking, that might be interested in an increase in throttled requests).

The stats for each BlobStore operation will represented by an EndpointStats object which includes two values:

  • requests, the request count
  • operations, the operation count

Relates ES-9767
Fixes #104443

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Nov 27, 2024
@nicktindall nicktindall added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Nov 28, 2024
@nicktindall nicktindall changed the title Track requests and operations separately to better serve metering/stats endpoints Track/report requests and operations separately to better serve metering/stats endpoints Nov 28, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @nicktindall, I've created a changelog YAML for you.

@nicktindall nicktindall requested a review from ywangd November 28, 2024 03:21
@nicktindall nicktindall marked this pull request as ready for review November 28, 2024 03:32
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Nov 28, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

…ng_stats_endpoints # Conflicts: #	server/src/main/java/org/elasticsearch/TransportVersions.java
@ywangd
Copy link
Member

ywangd commented Nov 28, 2024

It's not immediately obvious to me. Does this PR change any user visible behaviour, i.e. GetRepositoriesMetering API? If the impact is only to the internal API, we should not label it as >bug but instead >non-issue.

@nicktindall
Copy link
Contributor Author

nicktindall commented Nov 28, 2024

It's not immediately obvious to me. Does this PR change any user visible behaviour, i.e. GetRepositoriesMetering API? If the impact is only to the internal API, we should not label it as >bug but instead >non-issue.

It does, because now S3 request_count will be operations not requests, but the difference between those counts is miniscule. So I would say effectively there is no visible change.

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great. I only have some minor comments.

Comment on lines 751 to 755
EndpointStats addTo(EndpointStats other) {
long ops = operations.sum() + other.operations();
long reqs = requests.sum() + other.requests();
return new EndpointStats(ops, reqs);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Why do we need this method given we already have EndpointStats#add which also uses Math.addExact for overflow detection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it was just slightly lighter-weight, so rather than doing

return new EndpointStats(operations.sum(), requests.sum()).add(other);

which would create two EndpointStats just to do the addition (seemed excessive to just add some longs), I figured this would just create a single new one with the new values. It may be premature optimisation. At very least I should be doing addExact here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 9824022

Comment on lines 51 to 58
Map<String, EndpointStats> toMap() {
final Map<String, EndpointStats> results = new HashMap<>();
final long getOperations = getCount.get();
results.put("GetObject", new EndpointStats(getOperations, getOperations));
final long listOperations = listCount.get();
results.put("ListObjects", new EndpointStats(listOperations, listOperations));
final long insertOperations = postCount.get() + putCount.get();
results.put("InsertObject", new EndpointStats(insertOperations, insertOperations));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is temporary and needs to be fixed later to differentiate between requests and operations? If so, might worth to raise a ticket and add a TODO with the ticket number here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will do. You're right it's just to meet the requirements of the interface, at the moment GCS doesn't track those numbers separately so we fudge it by returning the single value we do have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 8143b65

* @param operations The number of calls
* @param requests The number of calls (including retries)
*/
public record EndpointStats(long operations, long requests) implements Writeable, ToXContentObject {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about ActionStats? We have similar classes such as TransportActionStats. Maybe this could be BlobStoreActionStats if we don't mind verbosity. Endpoint feels a bit too generic and it does not align with the fact that we categarize based on purpose as well (in addition to operation/url).

}

public EndpointStats {
assert operations >= 0 && requests >= 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also assert requests >= operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 50a9551

Comment on lines 41 to 42
out.writeLong(operations);
out.writeLong(requests);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are always non-negative, we can use writeVLong to save a few bytes

Suggested change
out.writeLong(operations);
out.writeLong(requests);
out.writeVLong(operations);
out.writeVLong(requests);

Similar change will be needed on the read side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 0234b14

Comment on lines 53 to 64
@Override
public boolean equals(Object o) {
if (o instanceof EndpointStats other) {
return requests == other.requests && operations == other.operations;
}
return false;
}

@Override
public int hashCode() {
return Objects.hash(operations, requests);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these given it is a record?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good pickup, no.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 0143979

public static final RepositoryStats EMPTY_STATS = new RepositoryStats(Collections.emptyMap());

public final Map<String, Long> requestCounts;
public final Map<String, EndpointStats> requestCounts;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this field accordingly? It gets a bit confusing reading through other parts of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 30968f5

Comment on lines +58 to +63
assertEquals(statsMapString, 1L, stats.get(statsKey(purpose, AzureBlobStore.Operation.PUT_BLOB)).operations());
assertEquals(statsMapString, 1L, stats.get(statsKey(purpose, AzureBlobStore.Operation.LIST_BLOBS)).operations());
assertEquals(statsMapString, 1L, stats.get(statsKey(purpose, AzureBlobStore.Operation.GET_BLOB_PROPERTIES)).operations());
assertEquals(statsMapString, 1L, stats.get(statsKey(purpose, AzureBlobStore.Operation.PUT_BLOCK)).operations());
assertEquals(statsMapString, 1L, stats.get(statsKey(purpose, AzureBlobStore.Operation.PUT_BLOCK_LIST)).operations());
assertEquals(statsMapString, 1L, stats.get(statsKey(purpose, AzureBlobStore.Operation.BLOB_BATCH)).operations());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a test for .requests() as well to maybe show how they behave when there are retries? Or is it already covered somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind-of implicitly tested elsewhere because we use requests() to compare against the counts maintained by org.elasticsearch.repositories.azure.AzureBlobStoreRepositoryTests.AzureHTTPStatsCollectorHandler, but I added an explicit test in 90e7061

@ywangd
Copy link
Member

ywangd commented Dec 2, 2024

It does, because now S3 request_count will be operations not requests, but the difference between those counts is miniscule. So I would say effectively there is no visible change.

Thanks for explanation. Since there are technically visible changes but trivial enough, maybe label it as >enhancement is a better choice and we can say the request counting accuracy is improved in the release notes.

@elasticsearchmachine
Copy link
Collaborator

Hi @nicktindall, I've created a changelog YAML for you.

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

private void assertRequestCountersAccountedForReads(RepositoryStatsSnapshot statsSnapshot) {
RepositoryStats repositoryStats = statsSnapshot.getRepositoryStats();
Map<String, Long> requestCounts = repositoryStats.requestCounts;
Map<String, BlobStoreActionStats> requestCounts = repositoryStats.actionStats;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we also rename the requestCounts here and other places in this test class to something like actionStats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in ca36b82

@nicktindall nicktindall merged commit 7d43d8a into elastic:main Dec 3, 2024
16 checks passed
@nicktindall nicktindall deleted the ES-9767_update_metering_stats_endpoints branch September 10, 2025 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement serverless-linked Added by automation, don't add manually Team:Distributed Coordination Meta label for Distributed Coordination team v9.0.0

3 participants