Skip to content

Conversation

nielsbauman
Copy link
Contributor

Instead of processing cluster state updates on the cluster state applier thread, we fork to a different thread where ILM's runtime of processing the cluster state update does not affect the speed at which the cluster can apply new cluster states. That does not mean we don't need to optimize ILM's cluster state processing, as the overall amount of processing is generally unaffected by this fork approach (unless we skip some cluster states), but it does mean we're saving a significant amount of processing on the critical cluster state applier thread.

Additionally, by running ILM's state processing asynchronously, we allow ILM to skip some cluster states if the management thread pool is saturated or ILM's processing is taking too long.

Here's a flamegraph taken while the many-shards benchmark was at 40% of the initial-indices phase (i.e. when it's adding the 50000 indices as part of the startup), running on commit e6eb8ef (i.e. the last commit on main before multi-project was merged), and a profiler duration of 10 minutes.
image

And here's one with the same profiling setup, but with these changes -- I ran this benchmark with my changes rebased on the aforementioned commit instead of the current main, as I didn't want the multi-project changes to influence my results.
image

The roughly 5% of the total processing time that ILM takes up is moved off of the cluster applier thread and onto a different one.

For completeness, that processing does indeed still show up in the second flamegraph here:
image

Instead of processing cluster state updates on the cluster state applier thread, we fork to a different thread where ILM's runtime of processing the cluster state update does not affect the speed at which the cluster can apply new cluster states. That does not mean we don't need to optimize ILM's cluster state processing, as the overall amount of processing is generally unaffected by this fork approach (unless we skip some cluster states), but it does mean we're saving a significant amount of processing on the critical cluster state applier thread. Additionally, by running ILM's state processing asynchronously, we allow ILM to skip some cluster states if the management thread pool is saturated or ILM's processing is taking too long.
@nielsbauman nielsbauman added >enhancement :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team v9.1.0 labels Feb 28, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @nielsbauman, I've created a changelog YAML for you.

@dakrone dakrone changed the title Process ILM cluster state updates on anoter thread Process ILM cluster state updates on another thread Mar 5, 2025
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
@nielsbauman
Copy link
Contributor Author

@elasticmachine update branch

if (currentState == null) {
return;
}
triggerPolicies(currentState, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this throws? I'm worried that it could leave lastSeenState set while processing is not actually happening, meaning that we'll never trigger another thread to do future processing. Maybe we should do something with either try-catch or try-finally here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are we okay with the fact that exceptions won't get propagated back to the ClusterApplierService? It looks like that would just log and discard the exception anyway, so I guess it's okay. But maybe we should also explicitly do the equivalent log and discard behaviour here, rather than relying on whatever the executor is going to do with unhandled exceptions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ Let's make this an AbstractRunnable and also make it force execute. Otherwise if we run into a rejection we'll never do the work for a cluster state update and implicitly rely on the fact that there will be another update to get us out of the broken state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about AbstractRunnable, thanks @original-brownbear !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, @PeteGillinElastic! And thanks both! I changed the runnable into an AbstractRunnable to make it force execute and to take care of failures.

}
// If the last seen cluster state changed while this thread was running, it means a new cluster state came in and we need to
// process it. We do that by kicking off a new thread, which will pick up the new cluster state when the thread gets executed.
processClusterState();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of recursion here is making me slightly nervous. I was initially worried about stack overflow scenarios, but I think we're safe here because the recursive call will just schedule the work on another thread and return immediately, right? (Assuming we never call this with a direct executor, or at least never outside tests!) Then I started worrying about things like deadlock, but again it's probably safe since nothing is blocking, the execute call should return immediately (unless we've managed to fill the thread pool's work queue, in which case it'll throw some rejected task exception, and in that case the system is in bad trouble anyway). So maybe this is all okay?

An alternative would be to have a loop here instead of recursing. That would mean that we'd start the next update on the same thread immediately, rather than putting it to the back of the pool's work queue. I don't know whether that matters, it depends what other work the pool is used for. Given that you've already done a bunch of benchmarking with the current approach, maybe it's better to stick with that.

Sorry, I realize that I'm kind of thinking aloud in this comment! I figured it's the kind of thing that's worth working through carefully...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I understand what you're saying. I intentionally decided not to use a loop, as I wanted to avoid hoarding a thread in the management pool for potentially long times. I preferred requesting a new thread at the end of the queue (as we're ok with skipping some cluster state updates in ILM). I tried to avoid deadlocks etc. by doing things in the right order, but I get that the recursion can seem confusing and that it's more prone to bugs, so I'm open to other suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we're ok with skipping some cluster state updates in ILM

Just for the record, I think it would be possible to do this skipping with a loop. I wasn't thinking of putting the updates in a queue, I was thinking of having a do-while or something which gets the current state at the end and loops if it doesn't match the state it just processed.

However, I think that this:

I intentionally decided not to use a loop, as I wanted to avoid hoarding a thread in the management pool for potentially long times

is a reasonable point. And I think I've convinced myself that your recursion approach is safe. So let's go with that.

Copy link
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments :) This looks alright to me (assuming we fix the possible rejection of the management pool task that is :)) but someone from the actual area should ✅ this as well obviously. Thanks!

if (currentState == null) {
return;
}
triggerPolicies(currentState, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ Let's make this an AbstractRunnable and also make it force execute. Otherwise if we run into a rejection we'll never do the work for a cluster state update and implicitly rely on the fact that there will be another update to get us out of the broken state.

Copy link
Member

@PeteGillinElastic PeteGillinElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the new AbstractRunnable stuff, I have just one small optional suggestion. Thanks for the cool PR!

@nielsbauman nielsbauman merged commit e68587a into elastic:main Mar 24, 2025
17 checks passed
@nielsbauman nielsbauman deleted the ilm-state-optimization branch March 24, 2025 16:13
omricohenn pushed a commit to omricohenn/elasticsearch that referenced this pull request Mar 28, 2025
Instead of processing cluster state updates on the cluster state applier thread, we fork to a different thread where ILM's runtime of processing the cluster state update does not affect the speed at which the cluster can apply new cluster states. That does not mean we don't need to optimize ILM's cluster state processing, as the overall amount of processing is generally unaffected by this fork approach (unless we skip some cluster states), but it does mean we're saving a significant amount of processing on the critical cluster state applier thread. Additionally, by running ILM's state processing asynchronously, we allow ILM to skip some cluster states if the management thread pool is saturated or ILM's processing is taking too long.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement Team:Data Management Meta label for data/management team v9.1.0

6 participants