Skip to content

Conversation

@mergify
Copy link
Contributor

@mergify mergify bot commented Sep 10, 2025

What does this PR do?

This PR refactors the Coordinator.Upgrade API to use functional options (UpgradeOpt) instead of long argument lists, and introduces a preUpgradeCallback.

Specifically:

  • Moves the action proxying of Endpoint upgrades into a callback passed to the coordinator. This ensures that action dispatch only happens after the initial upgrade checks succeed (already upgrading, agent upgradeable, agent upgrade capability allowed, etc.), which is a more logical sequence.
  • Reorders checks inside Coordinator.Upgrade to make the flow more consistent (e.g. first verifying that the coordinator is not already upgrading before continuing).
  • Updates tests and server code to use the new options-based API.

All business logic changes are captured here bcdba7b (+90 -29 lines changed)

Why is it important?

Currently, upgrade actions can remain stuck in bkgActions when tamper protection is enabled (see #9629). This occurs because errors returned from the Endpoint action proxying are not correctly handled, leaving stale entries behind.

By moving this logic into a preUpgradeCallback, any error from proxying is now surfaced through the coordinator’s upgrade flow, which ensures that bkgActions are cleaned up correctly.

I am not fully satisfied with these changes — they are a pragmatic fix until we perform a larger rewrite to centralize upgrade action handling in a single place. Still, they are necessary to resolve the immediate issue of stuck upgrade actions.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

None expected - this change only affects the internal proxying of upgrade actions to Endpoint and does not alter the upgrade request API.

How to test this PR locally

mage unitTest 

Related issues

* fix: refactor coordinator Upgrade to use opts and introduce pre-upgrade callback to properly handle errs and upgrade details * ci: add unit-tests * doc: add changelog fragment * fix: rename notifyUnitsOfProxiedAction to notifyUnitsOfProxiedActionFn * doc: add comment for notifyUnitsOfProxiedActionFn unit-test assertion (cherry picked from commit bb98f2a) # Conflicts: #	internal/pkg/agent/application/coordinator/coordinator.go
@mergify mergify bot requested a review from a team as a code owner September 10, 2025 14:04
@mergify mergify bot added backport conflicts There is a conflict in the backported pull request labels Sep 10, 2025
@mergify mergify bot requested review from blakerouse and pkoutsovasilis and removed request for a team September 10, 2025 14:04
@mergify
Copy link
Contributor Author

mergify bot commented Sep 10, 2025

Cherry-pick of bb98f2a has failed:

On branch mergify/bp/9.0/pr-9634 Your branch is up to date with 'origin/9.0'. You are currently cherry-picking commit bb98f2a12. (fix conflicts and run "git cherry-pick --continue") (use "git cherry-pick --skip" to skip this patch) (use "git cherry-pick --abort" to cancel the cherry-pick operation) Changes to be committed:	modified: .mockery.yaml	new file: changelog/fragments/1757079091-fix-stale-upgrade-actions.yaml	modified: internal/pkg/agent/application/actions/handlers/handler_action_upgrade.go	modified: internal/pkg/agent/application/actions/handlers/handler_action_upgrade_test.go	modified: internal/pkg/agent/application/actions/handlers/handler_helpers.go	modified: internal/pkg/agent/application/coordinator/coordinator_test.go	modified: internal/pkg/agent/application/coordinator/coordinator_unit_test.go	modified: pkg/control/v2/server/server.go	new file: testing/mocks/internal_/pkg/agent/application/actions/handlers/upgrade_coordinator_mock.go Unmerged paths: (use "git add <file>..." to mark resolution)	both modified: internal/pkg/agent/application/coordinator/coordinator.go 

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@github-actions github-actions bot added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Sep 10, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pkoutsovasilis pkoutsovasilis removed the conflicts There is a conflict in the backported pull request label Sep 10, 2025
@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

cc @pkoutsovasilis

@pkoutsovasilis pkoutsovasilis merged commit 08341bb into 9.0 Sep 10, 2025
20 checks passed
@pkoutsovasilis pkoutsovasilis deleted the mergify/bp/9.0/pr-9634 branch September 10, 2025 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

3 participants