Skip to content

Conversation

@jillguyonnet
Copy link
Contributor

@jillguyonnet jillguyonnet commented Feb 28, 2025

What is the problem this PR solves?

elastic/kibana#212744 adds retry logic to the task that automatically ugprades agents. Agents that were upgraded through this task have their new upgrade_attempts property populated. It is missing a way to clear this property when the upgrade completes successfully.

How does this PR solve the problem?

The change in this PR clears upgrade_attempts when the upgrade details of the agent get into UPG_WATCHING state and are processed in handleCheckin.

How to test this PR locally

This should be tested alongside elastic/kibana#212744 (or after it is merged - this is fine, since automatic upgrades are currently behind the enableAutomaticAgentUpgrades feature flag). With this change, agents upgraded through the automatic upgrade task should have their upgrade_attempts property set to null when the upgrade is successful.

Testing should also validate that upgrade_attempts stays set if the upgrade failed, e.g. after requesting an upgrade to an invalid version.

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Relates https://github.com/elastic/ingest-dev/issues/4720

@jillguyonnet jillguyonnet added the enhancement New feature or request label Feb 28, 2025
@jillguyonnet jillguyonnet self-assigned this Feb 28, 2025
@mergify
Copy link
Contributor

mergify bot commented Feb 28, 2025

This pull request does not have a backport label. Could you fix it @jillguyonnet? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.
@jillguyonnet jillguyonnet added the backport-skip Skip notification from the automated backport with mergify label Feb 28, 2025
@cmacknz cmacknz added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Feb 28, 2025
Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks sensible, holding off the approval until we have a green CI run (it seems we are getting some errors from ECH when creating a stack and 503s when trying to clean it up (not sure if it's related to the failed creation)

michel-laterman
michel-laterman previously approved these changes Mar 3, 2025
Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, merge when CI is green

@cmacknz
Copy link
Member

cmacknz commented Mar 3, 2025

There is a test you should update to check that this field is properly reset:

name: "agent has details checkin details are nil",
agent: &model.Agent{ESDocument: esd, Agent: &model.AgentMetadata{ID: "test-agent"}, UpgradeDetails: &model.UpgradeDetails{}},
details: nil,
bulk: func() *ftesting.MockBulk {
mBulk := ftesting.NewMockBulk()
mBulk.On("Update", mock.Anything, dl.FleetAgents, "doc-ID", mock.MatchedBy(func(p []byte) bool {
doc := struct {
Doc map[string]interface{} `json:"doc"`
}{}
if err := json.Unmarshal(p, &doc); err != nil {
t.Logf("bulk match unmarshal error: %v", err)
return false
}
return doc.Doc[dl.FieldUpgradeDetails] == nil && doc.Doc[dl.FieldUpgradeStartedAt] == nil && doc.Doc[dl.FieldUpgradeStatus] == nil && doc.Doc[dl.FieldUpgradedAt] != ""
}), mock.Anything, mock.Anything).Return(nil)
return mBulk
},
cache: func() *testcache.MockCache {
return testcache.NewMockCache()
},
err: nil,

@jillguyonnet
Copy link
Contributor Author

Upon more testing with elastic/kibana#212744, this is not behaving as expected as upgrade_attempts is cleared after failed upgrades. 👀

@jillguyonnet
Copy link
Contributor Author

jillguyonnet commented Mar 5, 2025

OK, my original approach didn't do what is expected, i.e. only reset upgrade_attempts if the upgrade is complete and successful. This is because in handleAck the upgrade is considered complete when done retrying, even if the outcome was failure. I was also under the impression that upgrade_details were updated as part of it, but I see that's actually handled in handleCheckin, which makes sense.

I have changed the approach to reset upgrade_attempts when the agent reports an upgrade_details.state value of UPG_WATCHING in handleCheckin. I could be missing some edge cases, but it seems to meet the primary expectation. My testing so far looks good.

@michel-laterman Would you please be able to review this approach? If it looks OK, I will update tests.

Edit: it seems we need to support agents with no upgrade details as well, so this won't be sufficient nevermind

jillguyonnet added a commit to elastic/kibana that referenced this pull request Mar 6, 2025
## Summary Relates elastic/ingest-dev#4720 This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in #211019. Complementary fleet-server change which sets the agent's `upgrade_attempts` to `null` once the upgrade is complete.: elastic/fleet-server#4528 ### Approach - A new `upgrade_attempts` property is added to agents and stored in the agent doc (ES mapping update in elastic/elasticsearch#123256). - When a bulk upgrade action is sent from the automatic upgrade task, it pushes the timestamp of the upgrade to the affected agents' `upgrade_attempts`. - The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h', '24h']` and can be overridden with the new `xpack.fleet.autoUpgrades.retryDelays` setting. - On every run, the automatic upgrade task will first process retries and then query more agents if necessary (cf. elastic/ingest-dev#4720 (comment)). - Once an agent has completed and failed the max retries defined by the retry delays array, it is no longer retried. ### Testing The ES query for fetching agents with existing `upgrade_attempts` needs the updated mappings, so it might be necessary to pull the latest `main` in the `elasticsearch` repo and run `yarn es source` instead of `yarn es snapshot` (requires an up-to-date Java environment, currently 23). In order to test that `upgrade_attempts` is set to `null` when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) ### Identify risks Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the `enableAutomaticAgentUpgrades` feature flag. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
michel-laterman
michel-laterman previously approved these changes Mar 7, 2025
Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@juliaElastic
Copy link
Contributor

@michel-laterman could you approve again? had to fix the sonarqube failure

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@juliaElastic juliaElastic merged commit 2b40416 into elastic:main Mar 10, 2025
9 checks passed
CAWilson94 pushed a commit to CAWilson94/kibana that referenced this pull request Mar 22, 2025
## Summary Relates elastic/ingest-dev#4720 This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in elastic#211019. Complementary fleet-server change which sets the agent's `upgrade_attempts` to `null` once the upgrade is complete.: elastic/fleet-server#4528 ### Approach - A new `upgrade_attempts` property is added to agents and stored in the agent doc (ES mapping update in elastic/elasticsearch#123256). - When a bulk upgrade action is sent from the automatic upgrade task, it pushes the timestamp of the upgrade to the affected agents' `upgrade_attempts`. - The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h', '24h']` and can be overridden with the new `xpack.fleet.autoUpgrades.retryDelays` setting. - On every run, the automatic upgrade task will first process retries and then query more agents if necessary (cf. elastic/ingest-dev#4720 (comment)). - Once an agent has completed and failed the max retries defined by the retry delays array, it is no longer retried. ### Testing The ES query for fetching agents with existing `upgrade_attempts` needs the updated mappings, so it might be necessary to pull the latest `main` in the `elasticsearch` repo and run `yarn es source` instead of `yarn es snapshot` (requires an up-to-date Java environment, currently 23). In order to test that `upgrade_attempts` is set to `null` when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) ### Identify risks Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the `enableAutomaticAgentUpgrades` feature flag. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
juliaElastic added a commit to juliaElastic/kibana that referenced this pull request Apr 8, 2025
Relates elastic/ingest-dev#4720 This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in elastic#211019. Complementary fleet-server change which sets the agent's `upgrade_attempts` to `null` once the upgrade is complete.: elastic/fleet-server#4528 - A new `upgrade_attempts` property is added to agents and stored in the agent doc (ES mapping update in elastic/elasticsearch#123256). - When a bulk upgrade action is sent from the automatic upgrade task, it pushes the timestamp of the upgrade to the affected agents' `upgrade_attempts`. - The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h', '24h']` and can be overridden with the new `xpack.fleet.autoUpgrades.retryDelays` setting. - On every run, the automatic upgrade task will first process retries and then query more agents if necessary (cf. elastic/ingest-dev#4720 (comment)). - Once an agent has completed and failed the max retries defined by the retry delays array, it is no longer retried. The ES query for fetching agents with existing `upgrade_attempts` needs the updated mappings, so it might be necessary to pull the latest `main` in the `elasticsearch` repo and run `yarn es source` instead of `yarn es snapshot` (requires an up-to-date Java environment, currently 23). In order to test that `upgrade_attempts` is set to `null` when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528. - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the `enableAutomaticAgentUpgrades` feature flag. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
@juliaElastic juliaElastic added backport-8.x Automated backport to the 8.x branch with mergify and removed backport-skip Skip notification from the automated backport with mergify labels Apr 8, 2025
mergify bot pushed a commit that referenced this pull request Apr 8, 2025
* Clear agent.upgrade_attemps on upgrade complete * This actually works * Silence nolintlint error in handleCheckin.go * Remove nolint comment altogether * Add changelog * Update handleCheckin unit test * Change approach * Revert unit test change * This seems needed * Run make generate * Remove internal link * add unit test * reduce complexity * return nil if action is nil --------- Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Julia Bardi <julia.bardi@elastic.co> (cherry picked from commit 2b40416)
juliaElastic pushed a commit that referenced this pull request Apr 8, 2025
* Clear agent.upgrade_attemps on upgrade complete * This actually works * Silence nolintlint error in handleCheckin.go * Remove nolint comment altogether * Add changelog * Update handleCheckin unit test * Change approach * Revert unit test change * This seems needed * Run make generate * Remove internal link * add unit test * reduce complexity * return nil if action is nil --------- Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Julia Bardi <julia.bardi@elastic.co> (cherry picked from commit 2b40416) Co-authored-by: Jill Guyonnet <jill.guyonnet@elastic.co>
juliaElastic pushed a commit that referenced this pull request Apr 8, 2025
* Clear agent.upgrade_attemps on upgrade complete * This actually works * Silence nolintlint error in handleCheckin.go * Remove nolint comment altogether * Add changelog * Update handleCheckin unit test * Change approach * Revert unit test change * This seems needed * Run make generate * Remove internal link * add unit test * reduce complexity * return nil if action is nil --------- Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Julia Bardi <julia.bardi@elastic.co> (cherry picked from commit 2b40416) Co-authored-by: Jill Guyonnet <jill.guyonnet@elastic.co>
juliaElastic added a commit that referenced this pull request Apr 9, 2025
* Clear upgrade_attempts on handleAck (#4762) * clear upgrade_attempts on handleAck * clear upgrade_attempts if upgrade_details is missing * added unit test (cherry picked from commit fb093cc) * Clear agent.upgrade_attempts on upgrade complete (#4528) (#4777) * Clear agent.upgrade_attemps on upgrade complete * This actually works * Silence nolintlint error in handleCheckin.go * Remove nolint comment altogether * Add changelog * Update handleCheckin unit test * Change approach * Revert unit test change * This seems needed * Run make generate * Remove internal link * add unit test * reduce complexity * return nil if action is nil --------- Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Julia Bardi <julia.bardi@elastic.co> (cherry picked from commit 2b40416) Co-authored-by: Jill Guyonnet <jill.guyonnet@elastic.co> --------- Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Jill Guyonnet <jill.guyonnet@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-8.x Automated backport to the 8.x branch with mergify enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

6 participants