Skip to content

Conversation

ywangd
Copy link
Member

@ywangd ywangd commented May 2, 2025

With project secrets in place, we can now create and manage per-project repository clients in addition to the cluster level repository clients. This PR adds a manager class for that. Note that the logic is not yet fully wired because it needs per-project repository/objec_store which will be added in separate PRs (as part of MP snapshots and MP objecStoreService). As such the tests are currently unit tests.

Relates: #126584
Resolves: ES-11713

@ywangd ywangd added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels May 5, 2025
@ywangd ywangd marked this pull request as ready for review May 5, 2025 01:40
@ywangd ywangd requested a review from a team as a code owner May 5, 2025 01:40
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label May 5, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@ywangd
Copy link
Member Author

ywangd commented May 5, 2025

@elasticmachine update branch

@ywangd ywangd requested a review from a team as a code owner May 9, 2025 03:17
@ywangd ywangd force-pushed the per-project-s3-clients branch from aef9dde to fdc9c1d Compare May 9, 2025 03:18
@ywangd ywangd removed request for a team May 9, 2025 03:18
}

@Override
public void applyClusterState(ClusterChangedEvent event) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this to be an applier instead of a listener so that the repository (also created in an applier) is fully functional in any listeners of the same cluster state. This is necessary so that the repository can be used to bootstrapping a new project.

@ywangd ywangd requested a review from pxsalehi May 9, 2025 03:20
@ywangd
Copy link
Member Author

ywangd commented May 20, 2025

@DaveCTurner Could you please review this PR when convenient? It's based on the discussion we had on #126584 and has a reduced scope for just s3 first. Thanks!

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather we treated a single-project cluster as a special case of a multi-project cluster that only has one project, rather than introducing a whole different way to manage clients for this case. I could perhaps be convinced otherwise, but today this all just looks like it duplicates everything (client lifecycles etc) and any differences in behaviour are lost.

Comment on lines 130 to 131
@Nullable // if multi-project is disabled
private final S3PerProjectClientManager s3PerProjectClientManager;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this to be @Nullable? I'd rather we didn't have to think about the null case, nor to have such different code-paths. Does something break if we use the per-project manager even in the single-project case?

Copy link
Member Author

@ywangd ywangd May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few reasons that I chose for managing them separately:

  1. Existing clients are initialized via the ReloadablePlugin interface, while project clients is a cluster state applier. I didn't want to mix them to avoid contention.
  2. Existing clients allow overriding client settings with repository settings which makes the settings management more complex. We agreed to not support such override for project clients. This makes the new code simpler when separated and retrieving client can be simply by the name.
  3. We have a cluster level blobstore in the MP setup. I feel it is better for the client to be managed on its own with the existing code.

That said, I can give it a go to see whether we can always have the client manager for all clients. For the above points, I think

  1. I used to have synchronized keyword in the applier method. This is no longer the case. So the contention should be acceptable with just CHM opeations. Also the Applier method and ReloadablePlugin interface should never operate against the same entry.
  2. We can probably manage this complexity in the manager and the ClientsHolder. Still not support override for project clients.
  3. We can store the cluster level clients against the default project ID. Functional wise, clients for the default project store and clients for the cluster are identical.

Does it make sense to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I see that there's some differences, I'm just surprised it needs to have such divergent code paths. I'd rather we had a S3ClientManager that looked after all the clients, both project-level and cluster-level. In particular I agree that forbidding repository-level overrides makes things a little simpler to reason about but that shouldn't come at the cost of this amount of extra code complexity/duplication.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR has been updated to allow the S3ClientsManager taking care of all clients.

threadPool,
DiscoveryNodeUtils.create("node", "node"),
providedSettings,
new ClusterSettings(Settings.EMPTY, ClusterSettings.BUILT_IN_CLUSTER_SETTINGS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit odd to use Settings.EMPTY here rather than providedSettings although this is what happens in other createClusterService overrides.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah these utility methods are a bit odd. Even if we use providedSettings here, it is still not accurate since the final settings have something more that gets added in the delegate method. It's the reason that I chose Settings.EMPTY so it is obvious that the settings are empty if it ever triggers a test failure.

ResourceWatcherService resourceWatcherService,
Supplier<Region> defaultRegionSupplier
) {
final Settings settings = clusterService.getSettings();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I liked the name nodeSettings in this context, it can otherwise be confusing whether we're talking about node-level settings or repository settings from the RepositoryMetadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

private void closeClientsAsync(List<ClientsHolder> clientsHoldersToClose, ActionListener<Void> listener) {
executor.execute(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please use an AbstractRunnable (here ActionRunnable.run) rather than executing a bare Runnable, so that exceptions (e.g. a rejection) still complete the listener.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have learnt this by now. Thanks for spotting it. See 0f1bdd5

final CountDownLatch latch = new CountDownLatch(1);
currentClientsCloseListener.addListener(ActionListener.running(latch::countDown));
try {
if (latch.await(1, TimeUnit.MINUTES) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we need to wait here? With S3Service today we just decref the clients and move on, allowing any ongoing uses to finish and shut down separately. Is that not ok here?

If so, why 1 minute? This needs to be configurable in case we need to change it. And if it's ok to proceed after a minute why wait at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wait here is to ensure S3Service does not close until all asyncly closed clients (from the applier threads) have finished closing per your comment. But maybe I misunderstood what you were suggesting?

The 1 minute is meant to be a best effort to close things cleanly similar to the wait for IndicesService to close in 1min in Stateless.java. It's unlike to happen. If somehow a close call does stall, it's not fatal and should not prevent progress (decRef can still trigger closeInternal and close both S3Client and SdkHttpClient inline which might take a bit longer?). I can make it configurable if necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I now wonder what I meant there. Tho at the time I wrote that these things had a kinda weird lifecycle, fixed in the upgrade to SDKv2, so maybe I was worried about that. As things stand now it should be safe to just decref each client and move on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be safe to just decref each client and move on

The async closing is to avoid having synchronized method calls in the applier thread. I think this point still stands. Do you mean we no longer need to track the close listeners and assume close calls are completed fast enough so that there is no need to wait for them to finish when the node shuts down?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, +1 to async closing still, just I don't see a need to block the shutdown here. We're not assuming that close calls happen fast enough, we're relying on the refcounting and other lifecycle management to do the right thing even for clients that other subsystems happen to be using while we're shutting down.

@ywangd
Copy link
Member Author

ywangd commented May 29, 2025

@elasticmachine update branch

@ywangd
Copy link
Member Author

ywangd commented May 30, 2025

@elasticmachine update branch

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label May 30, 2025
@ywangd
Copy link
Member Author

ywangd commented May 30, 2025

@elasticmachine update branch

@elasticsearchmachine elasticsearchmachine merged commit 6311bb3 into elastic:main May 30, 2025
18 checks passed
@ywangd ywangd deleted the per-project-s3-clients branch May 30, 2025 12:54
joshua-adams-1 pushed a commit to joshua-adams-1/elasticsearch that referenced this pull request Jun 3, 2025
With project secrets in place, we can now create and manage per-project repository clients in addition to the cluster level repository clients. This PR adds a manager class for that. Note that the logic is not yet fully wired because it needs per-project repository/objec_store which will be added in separate PRs (as part of MP snapshots and MP objecStoreService). As such the tests are currently unit tests. Relates: elastic#126584 Resolves: ES-11713
Samiul-TheSoccerFan pushed a commit to Samiul-TheSoccerFan/elasticsearch that referenced this pull request Jun 5, 2025
With project secrets in place, we can now create and manage per-project repository clients in addition to the cluster level repository clients. This PR adds a manager class for that. Note that the logic is not yet fully wired because it needs per-project repository/objec_store which will be added in separate PRs (as part of MP snapshots and MP objecStoreService). As such the tests are currently unit tests. Relates: elastic#126584 Resolves: ES-11713
elasticsearchmachine pushed a commit that referenced this pull request Jun 19, 2025
Enables creating different s3 clients for different projects Relates: #127631
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Jun 19, 2025
The stateful side tests generally do not exercise multi-project. Therefore we need to use the default project-id instead of random one. Relates: elastic#127631
elasticsearchmachine pushed a commit that referenced this pull request Jun 19, 2025
The stateful side tests generally do not exercise multi-project. Therefore we need to use the default project-id instead of random one. Relates: #127631
kderusso pushed a commit to kderusso/elasticsearch that referenced this pull request Jun 23, 2025
Enables creating different s3 clients for different projects Relates: elastic#127631
kderusso pushed a commit to kderusso/elasticsearch that referenced this pull request Jun 23, 2025
The stateful side tests generally do not exercise multi-project. Therefore we need to use the default project-id instead of random one. Relates: elastic#127631
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jun 25, 2025
Enables creating different s3 clients for different projects Relates: elastic#127631
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jun 25, 2025
The stateful side tests generally do not exercise multi-project. Therefore we need to use the default project-id instead of random one. Relates: elastic#127631
elasticsearchmachine pushed a commit that referenced this pull request Aug 4, 2025
Similar to #127631, this PR adds per project client support for GCS repository.
elasticsearchmachine pushed a commit that referenced this pull request Aug 13, 2025
This PR adds per project client support for Azure repository similar to #131899 (GCS) and #127631 (s3). Resolves: ES-11714
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.1.0

4 participants