Skip to content

Conversation

@przemekwitek
Copy link
Contributor

@przemekwitek przemekwitek commented Apr 28, 2023

This PR fixes 2 places that may yield issues with status reporting when the transform finishes:

  1. it makes the TransformTask.shutdown method wait until the persistent task disappears.
  2. It makes the _stats action report stopping rather than stopped state if there is no persistent task seen by the TaskManager but there still is a task in the ClusterState.

Additionally, this PR adds a test (TransformRobustnessIT.testCreateAndDeleteTransformInALoop) which shows that every time we are about to delete the transform, the persistent task is already gone.

Relates #95327

@przemekwitek przemekwitek force-pushed the debug_failing_test branch 9 times, most recently from 9623618 to a0b53e0 Compare May 4, 2023 12:43
@przemekwitek przemekwitek force-pushed the debug_failing_test branch from a0b53e0 to 2beb76a Compare May 4, 2023 13:27
@przemekwitek przemekwitek changed the title [Transform] Fix race condition in TransformIndexer [Transform] Improve reporting status of the transform that is about to finish May 5, 2023
@przemekwitek przemekwitek removed the WIP label May 5, 2023
@przemekwitek przemekwitek marked this pull request as ready for review May 5, 2023 08:00
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label May 5, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @przemekwitek, I've created a changelog YAML for you.

Copy link

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I think it would be clearer if you renamed one variable.

@przemekwitek przemekwitek merged commit d9c169f into elastic:main May 5, 2023
@przemekwitek przemekwitek deleted the debug_failing_test branch May 5, 2023 11:54
@hendrikmuhs
Copy link

2. It makes the `_stats` action report `stopping` rather than `stopped` state if there is no persistent task seen by the `TaskManager` but there still is a task in the `ClusterState`. 

This is great.

1. it makes the `TransformTask.shutdown` method wait until the persistent task disappears. 

The stop API has a wait_completion query parameter, the added code is a duplication of that in a different place. You find it in the transport implementation of _stop.

The change breaks wait_completion=false.

I therefore think we should revert that part. It doesn't solve the fundamental problem that if the indexer is running _stop does not unregister the persistent task. That means that with a hanging indexer thread it is not possible to remove the p-task using _stop.

@przemekwitek
Copy link
Contributor Author

2. It makes the `_stats` action report `stopping` rather than `stopped` state if there is no persistent task seen by the `TaskManager` but there still is a task in the `ClusterState`. 

This is great.

1. it makes the `TransformTask.shutdown` method wait until the persistent task disappears. 

The stop API has a wait_completion query parameter, the added code is a duplication of that in a different place. You find it in the transport implementation of _stop.

The change breaks wait_completion=false.

I therefore think we should revert that part. It doesn't solve the fundamental problem that if the indexer is running _stop does not unregister the persistent task. That means that with a hanging indexer thread it is not possible to remove the p-task using _stop.

Right, removing the p-task using stop is a separate problem that I'd like to tackle.
I've verified (using the TransformRobustnessIT.testCreateAndDeleteTransformInALoop test case) that this change was indeed unnecessary.

The partial revert PR is ready for review: #96060

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :ml/Transform Transform Team:ML Meta label for the ML team v8.9.0

4 participants