Track & log when there is insufficient disk space available to execute merges #131711

albertzaharovits · 2025-07-22T16:41:38Z

This goal of this PR is to record information about the available disk space at the time that merge tasks are scheduled (the info will show up in heap dumps).
This should aid troubleshooting in cases where there are running merge tasks all while the available disk space on the node is below the watermark. In this case we should increase the threshold for scheduling merge tasks indices.merge.disk.watermark.high, and possibly consider implementing aborting already executing merges too.

Relates #88606 (comment)

enqueued merge tasks

elasticsearchmachine · 2025-07-22T16:42:17Z

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

elasticsearchmachine · 2025-07-22T16:42:17Z

Hi @albertzaharovits, I've created a changelog YAML for you.

albertzaharovits · 2025-07-22T16:46:38Z

Here is a sample heap dump:

It shows the merge tasks that are still taking up budget (i.e. disk space) (i.e. still running) and what the available budget was when the merge tasks were dequeued for scheduling.

albertzaharovits · 2025-07-22T16:53:33Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

+ assert super.lock.isHeldByCurrentThread();
+ Tuple<MergeTask, Long> head = enqueuedByBudget.peek();
+ if (head != null && head.v2() > availableBudget) {
+ LOGGER.warn("There are merge tasks enqueued but there's insufficient disk space available to execute them");


These log messages will be printed every indices.merge.disk.check_interval (5 sec). That sounds Ok to you, right?

Would it be useful to include the availableBudget and head.v2() values in the log message too?

Agree, thanks for suggesting! Pushed: d334433

DaveCTurner

Seems reasonable to me but I'm probably not best-placed to review this properly. Could you try someone from the distrib/indexing team instead?

albertzaharovits · 2025-07-23T10:30:22Z

Thanks for looking, David!

tlrx

LGTM, I left minor comments

tlrx · 2025-07-23T10:51:49Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

+ assert super.lock.isHeldByCurrentThread();
+ Tuple<MergeTask, Long> head = enqueuedByBudget.peek();
+ if (head != null && head.v2() > availableBudget) {
+ LOGGER.warn("There are merge tasks enqueued but there's insufficient disk space available to execute them");


Would it be useful to include the availableBudget and head.v2() values in the log message too?

tlrx · 2025-07-23T10:53:10Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

+ Locale.ROOT,
+ "There are no merge tasks currently running, "
+ + "but there are [%d] enqueued ones that are blocked because of insufficient disk space",
+ enqueuedByBudget.size()


Similarly, would it be useful to have the availableBudget and head.v2() values in the log message here?

tlrx · 2025-07-23T10:53:22Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

+ messageBuilder.append(
+ "], and there are ["
+ + enqueuedByBudget.size()
+ + "] additional enqueued ones that are blocked because of insufficient disk space"


And here too?

tlrx · 2025-07-23T10:55:35Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

 }
 }

+ void postBudgetUpdate() {};


I would assert super.lock.isHeldByCurrentThread(); here too

…e merges (elastic#131711) This goal of this PR is to record information about the available disk space at the time that merge tasks are scheduled (the info will show up in heap dumps). This should aid troubleshooting in cases where there are running merge tasks all while the available disk space on the node is below the watermark. In this case we should increase the threshold for scheduling merge tasks `indices.merge.disk.watermark.high`, and possibly consider implementing aborting already executing merges too. Relates elastic#88606 (comment)

…e merges (#131711) (#131772) This goal of this PR is to record information about the available disk space at the time that merge tasks are scheduled (the info will show up in heap dumps). This should aid troubleshooting in cases where there are running merge tasks all while the available disk space on the node is below the watermark. In this case we should increase the threshold for scheduling merge tasks `indices.merge.disk.watermark.high`, and possibly consider implementing aborting already executing merges too. Relates #88606 (comment)

…e merges (#131711) (#131773) This goal of this PR is to record information about the available disk space at the time that merge tasks are scheduled (the info will show up in heap dumps). This should aid troubleshooting in cases where there are running merge tasks all while the available disk space on the node is below the watermark. In this case we should increase the threshold for scheduling merge tasks `indices.merge.disk.watermark.high`, and possibly consider implementing aborting already executing merges too. Relates #88606 (comment)

…e merges (#131711) (#131774) This goal of this PR is to record information about the available disk space at the time that merge tasks are scheduled (the info will show up in heap dumps). This should aid troubleshooting in cases where there are running merge tasks all while the available disk space on the node is below the watermark. In this case we should increase the threshold for scheduling merge tasks `indices.merge.disk.watermark.high`, and possibly consider implementing aborting already executing merges too. Relates #88606 (comment)

Track & log when there is insufficient disk space available to execute

5b72bb0

enqueued merge tasks

albertzaharovits requested a review from DaveCTurner July 22, 2025 16:41

albertzaharovits self-assigned this Jul 22, 2025

albertzaharovits added >enhancement :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v9.2.0 v9.1.1 v8.19.1 v9.0.5 labels Jul 22, 2025

elasticsearchmachine added the Team:Distributed Indexing Meta label for Distributed Indexing team label Jul 22, 2025

Update docs/changelog/131711.yaml

d0cac29

albertzaharovits mentioned this pull request Jul 22, 2025

Pause writing data to disk for merges when disk almost full #88606

Closed

[CI] Auto commit changes from spotless

7fc3e4d

albertzaharovits commented Jul 22, 2025

View reviewed changes

Merge branch 'main' into improve-unavailable-disk-space-diag

d941f2c

DaveCTurner reviewed Jul 23, 2025

View reviewed changes

tlrx approved these changes Jul 23, 2025

View reviewed changes

albertzaharovits added 4 commits July 23, 2025 14:11

Merge branch 'main' into improve-unavailable-disk-space-diag

88d5f0d

include the available and smallest merge task bytes

d334433

Actually log the message

7d33a29

Merge branch 'main' into improve-unavailable-disk-space-diag

91901dc

albertzaharovits added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport Automatically create backport pull requests when merged labels Jul 23, 2025

albertzaharovits added 2 commits July 23, 2025 15:29

Merge branch 'main' into improve-unavailable-disk-space-diag

5352c0a

Merge branch 'main' into improve-unavailable-disk-space-diag

595ad45

elasticsearchmachine merged commit 9caa861 into elastic:main Jul 23, 2025
33 checks passed

albertzaharovits deleted the improve-unavailable-disk-space-diag branch July 23, 2025 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Track & log when there is insufficient disk space available to execute merges #131711

Track & log when there is insufficient disk space available to execute merges #131711

Uh oh!

albertzaharovits commented Jul 22, 2025

elasticsearchmachine commented Jul 22, 2025

elasticsearchmachine commented Jul 22, 2025

albertzaharovits commented Jul 22, 2025

albertzaharovits Jul 22, 2025

tlrx Jul 23, 2025

albertzaharovits Jul 23, 2025

DaveCTurner left a comment

albertzaharovits commented Jul 23, 2025

tlrx left a comment

tlrx Jul 23, 2025

tlrx Jul 23, 2025

tlrx Jul 23, 2025

tlrx Jul 23, 2025

Uh oh!

Labels

4 participants

Track & log when there is insufficient disk space available to execute merges #131711

Track & log when there is insufficient disk space available to execute merges #131711

Uh oh!

Conversation

albertzaharovits commented Jul 22, 2025

elasticsearchmachine commented Jul 22, 2025

elasticsearchmachine commented Jul 22, 2025

albertzaharovits commented Jul 22, 2025

albertzaharovits Jul 22, 2025

Choose a reason for hiding this comment

tlrx Jul 23, 2025

Choose a reason for hiding this comment

albertzaharovits Jul 23, 2025

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

albertzaharovits commented Jul 23, 2025

tlrx left a comment

Choose a reason for hiding this comment

tlrx Jul 23, 2025

Choose a reason for hiding this comment

tlrx Jul 23, 2025

Choose a reason for hiding this comment

tlrx Jul 23, 2025

Choose a reason for hiding this comment

tlrx Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

Labels

4 participants