- Notifications
You must be signed in to change notification settings - Fork 100
Clear agent.upgrade_attempts on upgrade complete #4528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clear agent.upgrade_attempts on upgrade complete #4528
Conversation
| This pull request does not have a backport label. Could you fix it @jillguyonnet? 🙏
|
pchila left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks sensible, holding off the approval until we have a green CI run (it seems we are getting some errors from ECH when creating a stack and 503s when trying to clean it up (not sure if it's related to the failed creation)
michel-laterman left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, merge when CI is green
| There is a test you should update to check that this field is properly reset: fleet-server/internal/pkg/api/handleCheckin_test.go Lines 334 to 354 in ba68a24
|
changelog/fragments/1740760469-Clear-agent-upgrade-attempts-when-upgrade-complete.yaml Outdated Show resolved Hide resolved
| Upon more testing with elastic/kibana#212744, this is not behaving as expected as |
| OK, my original approach didn't do what is expected, i.e. only reset I have changed the approach to reset @michel-laterman Would you please be able to review this approach? If it looks OK, I will update tests. Edit: |
changelog/fragments/1740760469-Clear-agent-upgrade-attempts-when-upgrade-complete.yaml Outdated Show resolved Hide resolved
## Summary Relates elastic/ingest-dev#4720 This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in #211019. Complementary fleet-server change which sets the agent's `upgrade_attempts` to `null` once the upgrade is complete.: elastic/fleet-server#4528 ### Approach - A new `upgrade_attempts` property is added to agents and stored in the agent doc (ES mapping update in elastic/elasticsearch#123256). - When a bulk upgrade action is sent from the automatic upgrade task, it pushes the timestamp of the upgrade to the affected agents' `upgrade_attempts`. - The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h', '24h']` and can be overridden with the new `xpack.fleet.autoUpgrades.retryDelays` setting. - On every run, the automatic upgrade task will first process retries and then query more agents if necessary (cf. elastic/ingest-dev#4720 (comment)). - Once an agent has completed and failed the max retries defined by the retry delays array, it is no longer retried. ### Testing The ES query for fetching agents with existing `upgrade_attempts` needs the updated mappings, so it might be necessary to pull the latest `main` in the `elasticsearch` repo and run `yarn es source` instead of `yarn es snapshot` (requires an up-to-date Java environment, currently 23). In order to test that `upgrade_attempts` is set to `null` when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) ### Identify risks Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the `enableAutomaticAgentUpgrades` feature flag. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
michel-laterman left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
|
| @michel-laterman could you approve again? had to fix the sonarqube failure |
blakerouse left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
## Summary Relates elastic/ingest-dev#4720 This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in elastic#211019. Complementary fleet-server change which sets the agent's `upgrade_attempts` to `null` once the upgrade is complete.: elastic/fleet-server#4528 ### Approach - A new `upgrade_attempts` property is added to agents and stored in the agent doc (ES mapping update in elastic/elasticsearch#123256). - When a bulk upgrade action is sent from the automatic upgrade task, it pushes the timestamp of the upgrade to the affected agents' `upgrade_attempts`. - The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h', '24h']` and can be overridden with the new `xpack.fleet.autoUpgrades.retryDelays` setting. - On every run, the automatic upgrade task will first process retries and then query more agents if necessary (cf. elastic/ingest-dev#4720 (comment)). - Once an agent has completed and failed the max retries defined by the retry delays array, it is no longer retried. ### Testing The ES query for fetching agents with existing `upgrade_attempts` needs the updated mappings, so it might be necessary to pull the latest `main` in the `elasticsearch` repo and run `yarn es source` instead of `yarn es snapshot` (requires an up-to-date Java environment, currently 23). In order to test that `upgrade_attempts` is set to `null` when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) ### Identify risks Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the `enableAutomaticAgentUpgrades` feature flag. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Relates elastic/ingest-dev#4720 This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in elastic#211019. Complementary fleet-server change which sets the agent's `upgrade_attempts` to `null` once the upgrade is complete.: elastic/fleet-server#4528 - A new `upgrade_attempts` property is added to agents and stored in the agent doc (ES mapping update in elastic/elasticsearch#123256). - When a bulk upgrade action is sent from the automatic upgrade task, it pushes the timestamp of the upgrade to the affected agents' `upgrade_attempts`. - The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h', '24h']` and can be overridden with the new `xpack.fleet.autoUpgrades.retryDelays` setting. - On every run, the automatic upgrade task will first process retries and then query more agents if necessary (cf. elastic/ingest-dev#4720 (comment)). - Once an agent has completed and failed the max retries defined by the retry delays array, it is no longer retried. The ES query for fetching agents with existing `upgrade_attempts` needs the updated mappings, so it might be necessary to pull the latest `main` in the `elasticsearch` repo and run `yarn es source` instead of `yarn es snapshot` (requires an up-to-date Java environment, currently 23). In order to test that `upgrade_attempts` is set to `null` when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528. - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the `enableAutomaticAgentUpgrades` feature flag. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
* Clear agent.upgrade_attemps on upgrade complete * This actually works * Silence nolintlint error in handleCheckin.go * Remove nolint comment altogether * Add changelog * Update handleCheckin unit test * Change approach * Revert unit test change * This seems needed * Run make generate * Remove internal link * add unit test * reduce complexity * return nil if action is nil --------- Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Julia Bardi <julia.bardi@elastic.co> (cherry picked from commit 2b40416)
* Clear agent.upgrade_attemps on upgrade complete * This actually works * Silence nolintlint error in handleCheckin.go * Remove nolint comment altogether * Add changelog * Update handleCheckin unit test * Change approach * Revert unit test change * This seems needed * Run make generate * Remove internal link * add unit test * reduce complexity * return nil if action is nil --------- Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Julia Bardi <julia.bardi@elastic.co> (cherry picked from commit 2b40416) Co-authored-by: Jill Guyonnet <jill.guyonnet@elastic.co>
* Clear agent.upgrade_attemps on upgrade complete * This actually works * Silence nolintlint error in handleCheckin.go * Remove nolint comment altogether * Add changelog * Update handleCheckin unit test * Change approach * Revert unit test change * This seems needed * Run make generate * Remove internal link * add unit test * reduce complexity * return nil if action is nil --------- Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Julia Bardi <julia.bardi@elastic.co> (cherry picked from commit 2b40416) Co-authored-by: Jill Guyonnet <jill.guyonnet@elastic.co>
* Clear upgrade_attempts on handleAck (#4762) * clear upgrade_attempts on handleAck * clear upgrade_attempts if upgrade_details is missing * added unit test (cherry picked from commit fb093cc) * Clear agent.upgrade_attempts on upgrade complete (#4528) (#4777) * Clear agent.upgrade_attemps on upgrade complete * This actually works * Silence nolintlint error in handleCheckin.go * Remove nolint comment altogether * Add changelog * Update handleCheckin unit test * Change approach * Revert unit test change * This seems needed * Run make generate * Remove internal link * add unit test * reduce complexity * return nil if action is nil --------- Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Julia Bardi <julia.bardi@elastic.co> (cherry picked from commit 2b40416) Co-authored-by: Jill Guyonnet <jill.guyonnet@elastic.co> --------- Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Jill Guyonnet <jill.guyonnet@elastic.co>




What is the problem this PR solves?
elastic/kibana#212744 adds retry logic to the task that automatically ugprades agents. Agents that were upgraded through this task have their new
upgrade_attemptsproperty populated. It is missing a way to clear this property when the upgrade completes successfully.How does this PR solve the problem?
The change in this PR clears
upgrade_attemptswhen the upgrade details of the agent get intoUPG_WATCHINGstate and are processed inhandleCheckin.How to test this PR locally
This should be tested alongside elastic/kibana#212744 (or after it is merged - this is fine, since automatic upgrades are currently behind the
enableAutomaticAgentUpgradesfeature flag). With this change, agents upgraded through the automatic upgrade task should have theirupgrade_attemptsproperty set tonullwhen the upgrade is successful.Testing should also validate that
upgrade_attemptsstays set if the upgrade failed, e.g. after requesting an upgrade to an invalid version.Design Checklist
Checklist
./changelog/fragmentsusing the changelog toolRelated issues
Relates https://github.com/elastic/ingest-dev/issues/4720