Refactor enrich maintenance coordination logic #90931

jbaiera · 2022-10-14T21:13:45Z

Enrich maintenance runs on a schedule on the master. It cleans up unused enrich indices while there are no policy executions in progress. This can cause issues in environments that frequently execute enrich policies on a tight schedule as the maintenance service will refuse to clean indices. Additionally, the maintenance task relies on some dubious concurrency code to attempt detection of concurrent policy executions. If this detection logic reads incorrect state, there is a possibility that the enrich maintenance service can remove enrich indices for policies currently being executed.

This refactoring changes a few core aspects of the enrich policy locking code:

First, the policy maintenance task collects indices to delete from the cluster state instead of using the async get indices action. The cluster state is available to the maintenance service and no features from the get action are required for this step.

Second, the internal policy execution logic has been refactored to generate the new enrich index's name earlier in the process so that it can keep track of enrich indices that are actively being constructed. The maintenance task uses this set of indices to restrict which indices must be cleaned up.

Third, since the enrich indices to keep are now explicitly marked for the maintenance service, the global locking and clumsy state checking can be removed from the policy locks object. This should now be relying on concurrency primitives in the way they should actually be used.

Finally, A number of improvements have been added to improve the concurrency hygiene of the policy coordination. A process can only unlock a policy if it holds the policy's active semaphore. Previously any thread could release a policy any number of times, even if it did not have the current right to do so. Some critical sections have been updated to make sure that policy locks are always released in case of an uncaught exception.

elasticsearchmachine · 2022-10-14T21:14:09Z

Hi @jbaiera, I've created a changelog YAML for you.

elasticsearchmachine · 2022-10-14T21:14:15Z

Pinging @elastic/es-data-management (Team:Data Management)

jbaiera · 2022-10-14T21:15:00Z

Relates and hopefully fixes #89835 and #85221

dakrone

I left some comments, but nothing major, thanks for taking a look at this Jimmy!

dakrone · 2022-11-10T23:08:22Z

x-pack/plugin/enrich/src/main/java/org/elasticsearch/xpack/enrich/EnrichPolicyExecutor.java

 }

- private void releasePolicy(String policyName) {
+ private void releasePolicy(EnrichPolicyLock policyLock) {


What's the difference between releasePolicy(policyLock) and policyLock.release()? Is there a way we can differentiate the two a bit better? It looks like the main difference is the just the execution permits?

I could update the name to make it a bit more clear - but yes, the cleanup logic here needs us to release a permit as well as unlock the policy.

I changed the policy lock object to implement Releasable, and now I simply have the method wrap the Releasable so that it also releases the execution permits. Everything just calls close() now instead of needing to remember to run releasePolicy(policyLock)

...ugin/enrich/src/main/java/org/elasticsearch/xpack/enrich/EnrichPolicyMaintenanceService.java

x-pack/plugin/enrich/src/main/java/org/elasticsearch/xpack/enrich/EnrichPolicyRunner.java

dakrone · 2022-11-10T23:18:52Z

x-pack/plugin/enrich/src/main/java/org/elasticsearch/xpack/enrich/EnrichPolicyLocks.java

+ * cleaned up by the maintenance task.
 */
- private final AtomicLong policyRunCounter = new AtomicLong(0L);
+ private final ConcurrentHashMap<String, Semaphore> workingIndices = new ConcurrentHashMap<>();


Do we ever actually call workingIndices.get(...) anywhere? I can't find anywhere, which makes me think maybe this could just be a Set instead of a Map?

I'm using a map here mostly to ensure we do not remove the working index unless we are the holder of the semaphore it is mapped to. When calling map.remove() in the unlock method we pass the actively held semaphore in. This makes it so that the entry is only removed if it is mapped to the same semaphore.

...test/java/org/elasticsearch/xpack/enrich/action/InternalExecutePolicyActionRequestTests.java

Wrap the Releasable in the executor to release any held permits.

jbaiera · 2022-11-15T00:24:18Z

@elasticmachine update branch

masseyke

Looks good to me, and looks much better than what it replaces. For what it's worth (maybe someone coming to this in the future), here are my notes:

 Yes -- Can I run any policy right now? Yes put index name in working indices create index run policy remove from working indices No -- exception No -- exception Can I delete this index (out of all enrich indices in the cluster state)? Is it in working indices? Yes -- then no No -- other checks unrelated to this PR

elasticsearchmachine · 2022-11-16T21:05:42Z

💔 Backport failed

Status	Branch	Result
❌	7.17	Commit could not be cherrypicked due to conflicts
✅	8.5

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 90931

This PR refactors the locking logic for enrich policies so that enrich index names are resolved early so that they may be explicitly protected from maintenance tasks on the master node. The maintenance service has been optimized to allow for concurrent removal of old enrich indices while policies are executing. Further concurrency changes were made to improve the thread safety of the system (such as removing the double check locking in maintenance and the ability to unlock policies from code that does not hold the lock).

jbaiera · 2022-12-07T18:12:18Z

💚 All backports created successfully

Status	Branch	Result
✅	8.6

Questions ?

Please refer to the Backport tool documentation

This PR refactors the locking logic for enrich policies so that enrich index names are resolved early so that they may be explicitly protected from maintenance tasks on the master node. The maintenance service has been optimized to allow for concurrent removal of old enrich indices while policies are executing. Further concurrency changes were made to improve the thread safety of the system (such as removing the double check locking in maintenance and the ability to unlock policies from code that does not hold the lock). (cherry picked from commit 998520e)

jbaiera · 2023-03-28T17:10:03Z

💚 All backports created successfully

Status	Branch	Result
✅	7.17

Questions ?

Please refer to the Backport tool documentation

This PR refactors the locking logic for enrich policies so that enrich index names are resolved early so that they may be explicitly protected from maintenance tasks on the master node. The maintenance service has been optimized to allow for concurrent removal of old enrich indices while policies are executing. Further concurrency changes were made to improve the thread safety of the system (such as removing the double check locking in maintenance and the ability to unlock policies from code that does not hold the lock). (cherry picked from commit 998520e) # Conflicts: # x-pack/plugin/enrich/src/main/java/org/elasticsearch/xpack/enrich/EnrichPolicyMaintenanceService.java # x-pack/plugin/enrich/src/main/java/org/elasticsearch/xpack/enrich/action/InternalExecutePolicyAction.java # x-pack/plugin/enrich/src/test/java/org/elasticsearch/xpack/enrich/EnrichPolicyRunnerTests.java

This PR refactors the locking logic for enrich policies so that enrich index names are resolved early so that they may be explicitly protected from maintenance tasks on the master node. The maintenance service has been optimized to allow for concurrent removal of old enrich indices while policies are executing. Further concurrency changes were made to improve the thread safety of the system (such as removing the double check locking in maintenance and the ability to unlock policies from code that does not hold the lock). (cherry picked from commit 998520e)

jbaiera added 3 commits October 14, 2022 17:00

Refactor inflight enrich policy locks.

f50f68a

Style checks

ceaac99

Fix test bug

20290d9

jbaiera added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.5.1 v8.6.0 v7.17.8 labels Oct 14, 2022

jbaiera mentioned this pull request Oct 14, 2022

Run enrich maintenance tasks using current cluster state #90487

Closed

Update docs/changelog/90931.yaml

08dd0fb

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Oct 14, 2022

jbaiera requested a review from dakrone November 10, 2022 18:26

dakrone reviewed Nov 10, 2022

View reviewed changes

jbaiera added 4 commits November 14, 2022 18:55

Rename targetIndex -> enrichIndex where applicable

b8117e1

Refactor EnrichPolicyLock to be Releasable

c452802

Wrap the Releasable in the executor to release any held permits.

Rename index filter method for maintenance task

4670ccb

Implement mutate instance on InternalExecutePolicyActionRequestTests

46839e2

jbaiera requested a review from dakrone November 14, 2022 23:58

Merge branch 'main' into enrich-maintenance-lock-refactor

c67d7f5

tmgordeeva added v8.5.2 and removed v8.5.1 labels Nov 15, 2022

kingherc added v8.7.0 and removed v8.6.0 labels Nov 16, 2022

masseyke approved these changes Nov 16, 2022

View reviewed changes

jbaiera added the auto-backport Automatically create backport pull requests when merged label Nov 16, 2022

jbaiera merged commit 998520e into elastic:main Nov 16, 2022

jbaiera deleted the enrich-maintenance-lock-refactor branch November 16, 2022 21:04

jbaiera mentioned this pull request Nov 16, 2022

[8.5] Refactor enrich maintenance coordination logic (#90931) #91633

Merged

elasticsearchmachine added the backport pending label Nov 16, 2022

jbaiera removed the v7.17.8 label Nov 17, 2022

bpintea added v8.5.3 and removed v8.5.2 labels Nov 22, 2022

jbaiera mentioned this pull request Dec 7, 2022

[8.6] Refactor enrich maintenance coordination logic (#90931) #92210

Merged

This was referenced Dec 7, 2022

Enrich index getting deleted while policy is executing #85221

Closed

Enrich index cleaner appears to sometimes remove active indices for a policy #89835

Closed

jbaiera added the v8.6.1 label Dec 7, 2022

jbaiera mentioned this pull request Mar 28, 2023

[7.17] Refactor enrich maintenance coordination logic (#90931) #94840

Merged

jbaiera added the v7.17.10 label Mar 28, 2023

joegallo removed the backport pending label Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Refactor enrich maintenance coordination logic #90931

Refactor enrich maintenance coordination logic #90931

Uh oh!

jbaiera commented Oct 14, 2022

elasticsearchmachine commented Oct 14, 2022

elasticsearchmachine commented Oct 14, 2022

jbaiera commented Oct 14, 2022 •

edited

Loading

dakrone left a comment

dakrone Nov 10, 2022

jbaiera Nov 14, 2022

jbaiera Nov 14, 2022

Uh oh!

Uh oh!

dakrone Nov 10, 2022

jbaiera Nov 14, 2022 •

edited

Loading

Uh oh!

jbaiera commented Nov 15, 2022

masseyke left a comment

elasticsearchmachine commented Nov 16, 2022

jbaiera commented Dec 7, 2022

jbaiera commented Mar 28, 2023

Labels

9 participants

Uh oh!

Refactor enrich maintenance coordination logic #90931

Refactor enrich maintenance coordination logic #90931

Uh oh!

Conversation

jbaiera commented Oct 14, 2022

elasticsearchmachine commented Oct 14, 2022

elasticsearchmachine commented Oct 14, 2022

jbaiera commented Oct 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

dakrone Nov 10, 2022

Choose a reason for hiding this comment

jbaiera Nov 14, 2022

Choose a reason for hiding this comment

jbaiera Nov 14, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dakrone Nov 10, 2022

Choose a reason for hiding this comment

jbaiera Nov 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbaiera commented Nov 15, 2022

masseyke left a comment

Choose a reason for hiding this comment

elasticsearchmachine commented Nov 16, 2022

💔 Backport failed

jbaiera commented Dec 7, 2022

💚 All backports created successfully

Questions ?

jbaiera commented Mar 28, 2023

💚 All backports created successfully

Questions ?

Labels

9 participants

jbaiera commented Oct 14, 2022 •

edited

Loading

jbaiera Nov 14, 2022 •

edited

Loading