Skip to content

Conversation

@mergify
Copy link
Contributor

@mergify mergify bot commented Sep 16, 2025

What does this PR do?

This PR fixes a regression introduced by #9634 where the coordinator’s overrideState could remain set if an upgrade attempt failed early (e.g. agent not upgradeable, capability check denied, or pre-upgrade callback returned an error).

Specifically, this PR:

  • Adds calls to ClearOverrideState() before returning from all early failure branches inside Coordinator.Upgrade.
  • Extends the coordinator test suite to assert that overrideState is reset to nil after a failing preUpgradeCallback, preventing stale state from leaking into subsequent upgrade attempts.

Why is it important?

Without this change, failed upgrades could leave the coordinator in a state that incorrectly reflects an ongoing upgrade. This blocks future upgrade attempts until the Elastic Agent is restarted, which is disruptive and operationally undesirable.

By clearing the override state on failure, we ensure the coordinator always returns to a clean state, enabling subsequent upgrade attempts to proceed without requiring a restart.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

Previously, users would need to manually restart the Elastic Agent after a failed upgrade attempt in order to retry an upgrade.
With this fix, the agent automatically clears the override state, removing the need for manual intervention.

How to test this PR locally

Run mage unitTest and confirm that all tests pass, including the updated coordinator tests.

Related issues


This is an automatic backport of pull request #9992 done by [Mergify](https://mergify.com).
…ade of coordinator (#9992) * fix: always clear the coordinator overridden state on err inside upgrade of coordinator * doc: add changelog fragment (cherry picked from commit a77d4cd)
@mergify mergify bot added the backport label Sep 16, 2025
@mergify mergify bot requested a review from a team as a code owner September 16, 2025 20:15
@mergify mergify bot requested review from michalpristas and removed request for a team September 16, 2025 20:15
@mergify mergify bot added the backport label Sep 16, 2025
@mergify mergify bot requested a review from pchila September 16, 2025 20:15
@github-actions github-actions bot added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Sep 16, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pkoutsovasilis pkoutsovasilis enabled auto-merge (squash) September 16, 2025 20:18
@pkoutsovasilis pkoutsovasilis merged commit f8a8f0e into 9.1 Sep 16, 2025
22 of 23 checks passed
@pkoutsovasilis pkoutsovasilis deleted the mergify/bp/9.1/pr-9992 branch September 16, 2025 22:24
@elasticmachine
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

cc @pkoutsovasilis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

3 participants