- Notifications
You must be signed in to change notification settings - Fork 25.5k
Description
This was discovered in cases where masterService#updateTask
had work enqueued, but no worker to process it.
The root cause of the issue is a bug in EsExecutors
. When the pool core size is set to 0 and max pool size is 1 (though, also possible with a higher max pool size, but less likely), EsThreadPoolExecutor
sometimes fails to add another worker to execute the task because we're already at the max pool size (expected). However, in rare cases, a single worker thread (or threads) can time out at about the same time (based on their keepAliveTime
) when then queueing the new task via ForceQueuePolicy
(triggered by the initial rejection as we failed to add a worker). Unless more tasks are submitted later (which is not the case for masterService#updateTask
), this task will starve in the queue without any worker to process it.
Respective code in EsExecutors
is old and unchanged. We were able to reproduce the bug on main
using Java 21, 22, 23 as well as 8.0
using Java 17. Likely the same is possible for older versions of ES.
It looks as if the bug is triggered more frequently with more recent versions of the JDK, but this might just be an observation bias as we haven't been aware of this bug earlier.