Skip to content

Enrich policy executions are incorrectly unlocked when wait_for_completion is false #94690

@jbaiera

Description

@jbaiera

When executing enrich policies, the master node locks the policy so that multiple policy executions cannot race each other. This locking mechanism also reserves the enrich index name so that it is not considered for cleanup by the background enrich maintenance task. Policy execution is coordinated on the master node, but it is executed on a randomly chosen remote node.

When executing normally, the client waits for the policy execution to complete, after which the policy is unlocked. However, if the client specifies wait_for_completion=false, the remote node returns immediately with a task id that can be used to check progress asynchronously. In this case, the master node still needs to be notified of when the task completes so that it may clean up the locked policy. It does this by submitting a Get Task action with wait_for_completion=true, and unlocking the policy once that call returns.

The bug is that the Get Task action has no specified timeout on its submission. The default timeout is 30 seconds. At the end of that 30 seconds, the call completes with a timeout exception. The exception is ignored and the policy is unlocked. Policies that take longer than 30 seconds to execute are at risk of having their indices removed by the background enrich maintenance task since they are no longer reserved for an active execution. Depending on timing, the enrich maintenance task has an increasing chance to delete the index as the policy execution time approaches the maintenance period (default 20 minutes).

One workaround here is to dramatically increase the maintenance period to allow enrich policy executions to complete, but this does not eliminate the chance of deleting the new enrich index, it only increases the window for the policy to complete executing within. Setting the period to an arbitrarily long period of time is also not advisable as it can keep the cluster from cleaning up old indices. The maintenance period setting also requires a node restart to take effect, as it is not dynamic.

Another workaround would be to not make use of the wait_for_completion=false setting value. Instead, a connection with a long, client-side timeout should be used along with wait_for_completion=true (the default). This will keep the policy locked for the entire period of the execution and keep the underlying enrich index from being deleted prematurely.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions