Skip to content

Conversation

@michel-laterman
Copy link
Contributor

@michel-laterman michel-laterman commented May 1, 2025

What does this PR do?

Retry enrollment requests when an error is returned until a timeout is reached.
Add --enroll-timeout and FLEET_ENROLL_TIMEOUT to control how long the timeout is; default 10m.

Why is it important?

Increase reliability of enrollments.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

Agents running via container, or command line without the delay enrollment option will retry for 10m instead of indefinitely.

@michel-laterman michel-laterman added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-active-all Automated backport with mergify to all the active branches labels May 1, 2025
@michel-laterman michel-laterman requested a review from a team as a code owner May 1, 2025 15:41
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@michel-laterman michel-laterman requested a review from blakerouse May 1, 2025 16:38
@michel-laterman michel-laterman requested a review from blakerouse May 1, 2025 18:11
@michel-laterman michel-laterman changed the title Retry enrollment requests when a generic error is returned Retry enrollment requests when a network error is returned May 2, 2025
@michel-laterman michel-laterman changed the title Retry enrollment requests when a network error is returned Retry enrollment requests when an error is returned, add enrollment timeout May 2, 2025
@michel-laterman michel-laterman added enhancement New feature or request backport-8.19 Automated backport to the 8.19 branch and removed bug Something isn't working backport-active-all Automated backport with mergify to all the active branches labels May 2, 2025
Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you resolve the lint failures? Otherwise the changes look good to me, although I'd feel more comfortable with an added test or two.

blakerouse
blakerouse previously approved these changes May 7, 2025
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change itself now looks good. I am happy with how this turned out.

@swiatekm swiatekm self-requested a review May 7, 2025 14:09
swiatekm
swiatekm previously approved these changes May 7, 2025
Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. We should address cancellation for the backoff in a follow-up though.

@michel-laterman
Copy link
Contributor Author

Made #8105 to track context cancelations

@michel-laterman michel-laterman dismissed stale reviews from blakerouse and swiatekm via aa74ece May 7, 2025 14:40
@elastic-sonarqube
Copy link

Quality Gate failed Quality Gate failed

Failed conditions
38.7% Coverage on New Code (required ≥ 40%)

See analysis details on SonarQube

@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

History

cc @michel-laterman

@michel-laterman michel-laterman merged commit b201e16 into elastic:main May 7, 2025
11 of 12 checks passed
@michel-laterman michel-laterman deleted the rety-enroll-generic-error branch May 7, 2025 19:45
mergify bot pushed a commit that referenced this pull request May 7, 2025
…imeout (#8056) Retry enrollment requests when an error is returned until a timeout is reached. Add --enroll-timeout and FLEET_ENROLL_TIMEOUT to control how long the timeout is; default 10m. A negative value disables the timeout. (cherry picked from commit b201e16)
michel-laterman added a commit that referenced this pull request May 7, 2025
…imeout (#8056) (#8108) Retry enrollment requests when an error is returned until a timeout is reached. Add --enroll-timeout and FLEET_ENROLL_TIMEOUT to control how long the timeout is; default 10m. A negative value disables the timeout. (cherry picked from commit b201e16) Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>
v1v added a commit to v1v/elastic-agent that referenced this pull request May 8, 2025
* upstream/main: Guard against `nil` pointer dereference (elastic#8107) Generate NOTICE.txt with only modules used by binaries (elastic#8053) Retry enrollment requests when an error is returned, add enrollment timeout (elastic#8056) Changelog for 8.17.6 version (elastic#8062) (elastic#8106) [main][Automation] Update versions (elastic#8098) Allow using beats receivers for self-monitoring (elastic#8031) Adding new configuration setting: `agent.upgrade.rollback.window` (elastic#8065) [Integration Testing] Allow tests to declare themselves as needing a FIPS environment (elastic#8083) fix(agentless): overcome SIGPIPE in agentless promotion pipeline (elastic#8094) ksm autosharing integration configuration update (elastic#8086)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-8.19 Automated backport to the 8.19 branch enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

5 participants