Skip to content

Conversation

pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis commented Aug 28, 2025

What does this PR do?

This PR refactors the Coordinator.Upgrade API to use functional options (UpgradeOpt) instead of long argument lists, and introduces a preUpgradeCallback.

Specifically:

  • Moves the action proxying of Endpoint upgrades into a callback passed to the coordinator. This ensures that action dispatch only happens after the initial upgrade checks succeed (already upgrading, agent upgradeable, agent upgrade capability allowed, etc.), which is a more logical sequence.
  • Reorders checks inside Coordinator.Upgrade to make the flow more consistent (e.g. first verifying that the coordinator is not already upgrading before continuing).
  • Updates tests and server code to use the new options-based API.

All business logic changes are captured here bcdba7b (+90 -29 lines changed)

Why is it important?

Currently, upgrade actions can remain stuck in bkgActions when tamper protection is enabled (see #9629). This occurs because errors returned from the Endpoint action proxying are not correctly handled, leaving stale entries behind.

By moving this logic into a preUpgradeCallback, any error from proxying is now surfaced through the coordinator’s upgrade flow, which ensures that bkgActions are cleaned up correctly.

I am not fully satisfied with these changes — they are a pragmatic fix until we perform a larger rewrite to centralize upgrade action handling in a single place. Still, they are necessary to resolve the immediate issue of stuck upgrade actions.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

None expected - this change only affects the internal proxying of upgrade actions to Endpoint and does not alter the upgrade request API.

How to test this PR locally

mage unitTest 

Related issues

@pkoutsovasilis pkoutsovasilis self-assigned this Aug 28, 2025
@pkoutsovasilis pkoutsovasilis added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team skip-changelog backport-active-all Automated backport with mergify to all the active branches labels Aug 28, 2025
@pkoutsovasilis pkoutsovasilis force-pushed the fix/endpoint_upgrade_action_error branch 2 times, most recently from deac63b to ad27fc6 Compare September 5, 2025 10:36
@pkoutsovasilis pkoutsovasilis force-pushed the fix/endpoint_upgrade_action_error branch from ad27fc6 to eb2f6a5 Compare September 5, 2025 13:47
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review September 8, 2025 05:51
@pkoutsovasilis pkoutsovasilis requested a review from a team as a code owner September 8, 2025 05:51
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good to me as a way to fix this without totally rewriting everything. A couple of small comments.

@cmacknz
Copy link
Member

cmacknz commented Sep 10, 2025

After your clarifications my only remaining comments are suggestions for readability or documentation, which I think are necessary but if you do these and get approval from someone else go ahead and merge and don't feel the need to wait for me to review again.

@elasticmachine
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

History

cc @pkoutsovasilis

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow up!

@pkoutsovasilis pkoutsovasilis merged commit bb98f2a into elastic:main Sep 10, 2025
23 checks passed
Copy link
Contributor

@Mergifyio backport 8.18 8.19 9.0 9.1

Copy link
Contributor

mergify bot commented Sep 10, 2025

backport 8.18 8.19 9.0 9.1

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Sep 10, 2025
* fix: refactor coordinator Upgrade to use opts and introduce pre-upgrade callback to properly handle errs and upgrade details * ci: add unit-tests * doc: add changelog fragment * fix: rename notifyUnitsOfProxiedAction to notifyUnitsOfProxiedActionFn * doc: add comment for notifyUnitsOfProxiedActionFn unit-test assertion (cherry picked from commit bb98f2a) # Conflicts: #	internal/pkg/agent/application/coordinator/coordinator.go
mergify bot pushed a commit that referenced this pull request Sep 10, 2025
* fix: refactor coordinator Upgrade to use opts and introduce pre-upgrade callback to properly handle errs and upgrade details * ci: add unit-tests * doc: add changelog fragment * fix: rename notifyUnitsOfProxiedAction to notifyUnitsOfProxiedActionFn * doc: add comment for notifyUnitsOfProxiedActionFn unit-test assertion (cherry picked from commit bb98f2a) # Conflicts: #	internal/pkg/agent/application/coordinator/coordinator.go
mergify bot pushed a commit that referenced this pull request Sep 10, 2025
* fix: refactor coordinator Upgrade to use opts and introduce pre-upgrade callback to properly handle errs and upgrade details * ci: add unit-tests * doc: add changelog fragment * fix: rename notifyUnitsOfProxiedAction to notifyUnitsOfProxiedActionFn * doc: add comment for notifyUnitsOfProxiedActionFn unit-test assertion (cherry picked from commit bb98f2a) # Conflicts: #	internal/pkg/agent/application/coordinator/coordinator.go
mergify bot pushed a commit that referenced this pull request Sep 10, 2025
* fix: refactor coordinator Upgrade to use opts and introduce pre-upgrade callback to properly handle errs and upgrade details * ci: add unit-tests * doc: add changelog fragment * fix: rename notifyUnitsOfProxiedAction to notifyUnitsOfProxiedActionFn * doc: add comment for notifyUnitsOfProxiedActionFn unit-test assertion (cherry picked from commit bb98f2a) # Conflicts: #	internal/pkg/agent/application/coordinator/coordinator.go
pkoutsovasilis added a commit that referenced this pull request Sep 10, 2025
…ng it to bkgActions (#9859) * fix: send upgrade action to units before adding it to bkgActions (#9634) * fix: refactor coordinator Upgrade to use opts and introduce pre-upgrade callback to properly handle errs and upgrade details * ci: add unit-tests * doc: add changelog fragment * fix: rename notifyUnitsOfProxiedAction to notifyUnitsOfProxiedActionFn * doc: add comment for notifyUnitsOfProxiedActionFn unit-test assertion (cherry picked from commit bb98f2a) # Conflicts: #	internal/pkg/agent/application/coordinator/coordinator.go * fix: resolve conflicts --------- Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>
pkoutsovasilis added a commit that referenced this pull request Sep 10, 2025
…g it to bkgActions (#9861) * fix: send upgrade action to units before adding it to bkgActions (#9634) * fix: refactor coordinator Upgrade to use opts and introduce pre-upgrade callback to properly handle errs and upgrade details * ci: add unit-tests * doc: add changelog fragment * fix: rename notifyUnitsOfProxiedAction to notifyUnitsOfProxiedActionFn * doc: add comment for notifyUnitsOfProxiedActionFn unit-test assertion (cherry picked from commit bb98f2a) # Conflicts: #	internal/pkg/agent/application/coordinator/coordinator.go * fix: resolve conflicts --------- Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>
pkoutsovasilis added a commit that referenced this pull request Sep 10, 2025
…ng it to bkgActions (#9860) * fix: send upgrade action to units before adding it to bkgActions (#9634) * fix: refactor coordinator Upgrade to use opts and introduce pre-upgrade callback to properly handle errs and upgrade details * ci: add unit-tests * doc: add changelog fragment * fix: rename notifyUnitsOfProxiedAction to notifyUnitsOfProxiedActionFn * doc: add comment for notifyUnitsOfProxiedActionFn unit-test assertion (cherry picked from commit bb98f2a) # Conflicts: #	internal/pkg/agent/application/coordinator/coordinator.go * fix: resolve conflicts --------- Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>
pkoutsovasilis added a commit that referenced this pull request Sep 10, 2025
…g it to bkgActions (#9862) * fix: send upgrade action to units before adding it to bkgActions (#9634) * fix: refactor coordinator Upgrade to use opts and introduce pre-upgrade callback to properly handle errs and upgrade details * ci: add unit-tests * doc: add changelog fragment * fix: rename notifyUnitsOfProxiedAction to notifyUnitsOfProxiedActionFn * doc: add comment for notifyUnitsOfProxiedActionFn unit-test assertion (cherry picked from commit bb98f2a) # Conflicts: #	internal/pkg/agent/application/coordinator/coordinator.go * fix: resolve conflicts --------- Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>
intxgo pushed a commit to intxgo/elastic-agent that referenced this pull request Sep 24, 2025
…stic#9634) * fix: refactor coordinator Upgrade to use opts and introduce pre-upgrade callback to properly handle errs and upgrade details * ci: add unit-tests * doc: add changelog fragment * fix: rename notifyUnitsOfProxiedAction to notifyUnitsOfProxiedActionFn * doc: add comment for notifyUnitsOfProxiedActionFn unit-test assertion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-all Automated backport with mergify to all the active branches bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

3 participants