Skip to content

Conversation

@ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Aug 1, 2025

What does this PR do?

This PR retries saving the Agent information file, fleet.enc, on Windows. The retries are done every 50ms for up to 2 seconds.

Why is it important?

Saving the Agent information file involves renaming/moving a file to its final destination. However, on Windows, it is sometimes not possible to rename/move a file to its destination file because the destination file is locked by another process (e.g. antivirus software). For such situations, we now retry the save operation on Windows.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

None.

Related issues

@mergify
Copy link
Contributor

mergify bot commented Aug 1, 2025

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.
@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 1, 2025

@pchila @swiatekm Following the feedback on elastic/elastic-agent-libs#333, I'm replacing that PR with this one. At the moment, I've sketched out two alternate implementations in this PR — one that specifically retries saving the fleet.enc file when the policy is being handled from Fleet and the other that's a bit more generic / at a lower-level where it retries saving any file using the storage package. In both implementations, the retries are only done if the OS is Windows and error from the save/rotate operation is ACCESS_DENIED.

Let me know which of these approaches you prefer and I'll get rid of the other one + add tests. Or if you prefer a totally different approach to solving this bug, that's fine too — just let me know. Thanks!

Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would start by touching only the handler_action_policy_change with a retry and see how that goes (ideally we would need a test that reproduces the issue but as I understand we don't have one yet)

Please have a look at my comment below

@ycombinator ycombinator requested a review from pchila August 4, 2025 22:40
@ycombinator ycombinator added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-active-all Automated backport with mergify to all the active branches labels Aug 4, 2025
@ycombinator ycombinator changed the title Retry fleet.enc file rotation on Windows Retry saving Agent information file (fleet.enc) on Windows Aug 4, 2025
@ycombinator ycombinator marked this pull request as ready for review August 4, 2025 22:41
@ycombinator ycombinator requested a review from a team as a code owner August 4, 2025 22:41
@ycombinator ycombinator requested a review from blakerouse August 4, 2025 22:41
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@ycombinator ycombinator force-pushed the fleet-enc-save-retries branch from d835648 to fd1c7dc Compare August 4, 2025 22:41
Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can start with retrying only at the handler_policy_change level (without the additional retry loop at EncryptedStore.
There are also a couple of comments about the usage of the backoff library

@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 5, 2025

I think we can start with retrying only at the handler_policy_change level (without the additional retry loop at EncryptedStore

Can you clarify what you mean by this? In other words, what would you like to see done differently than the current implementation in this PR?

@ycombinator ycombinator requested a review from pchila August 5, 2025 17:11
@pchila
Copy link
Member

pchila commented Aug 6, 2025

I think we can start with retrying only at the handler_policy_change level (without the additional retry loop at EncryptedStore

Can you clarify what you mean by this? In other words, what would you like to see done differently than the current implementation in this PR?

I would do without the encrypted store retry (I am referring to this part that duplicate the original implementation of the handler retry).

The bug we are trying to fix is relative to not being able to write a new fleet.enc on windows because of Access is denied error so I would keep the fix focused on that operation (unless there is concrete evidence that the bug originates in the *diskstore implementation).

@ycombinator ycombinator force-pushed the fleet-enc-save-retries branch from 0d1c73f to 866d457 Compare August 7, 2025 20:32
@ycombinator
Copy link
Contributor Author

I would do without the encrypted store retry (I am referring to this part that duplicate the original implementation of the handler retry).

The bug we are trying to fix is relative to not being able to write a new fleet.enc on windows because of Access is denied error so I would keep the fix focused on that operation (unless there is concrete evidence that the bug originates in the *diskstore implementation).

@pchila My bad, I just forgot to remove the previous implementation (the one you linked to) which is why I was unsure as to what I was missing 🤦. Thanks for the clarification. Removed the previous implementation in 866d457.

@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 8, 2025

The TestSaveConfigToStore unit test introduced in this PR is failing.

PS C:\Users\Shaunak Kashyap\development\github\elastic-agent> go test -v ./internal/pkg/agent/application/actions/handlers/... -run TestSaveConfigToStore -count 10 === RUN TestSaveConfigToStore --- PASS: TestSaveConfigToStore (1.55s) === RUN TestSaveConfigToStore --- PASS: TestSaveConfigToStore (1.55s) === RUN TestSaveConfigToStore --- PASS: TestSaveConfigToStore (1.51s) === RUN TestSaveConfigToStore --- PASS: TestSaveConfigToStore (1.51s) === RUN TestSaveConfigToStore --- PASS: TestSaveConfigToStore (1.51s) === RUN TestSaveConfigToStore --- PASS: TestSaveConfigToStore (1.55s) === RUN TestSaveConfigToStore --- PASS: TestSaveConfigToStore (1.56s) === RUN TestSaveConfigToStore --- PASS: TestSaveConfigToStore (1.52s) === RUN TestSaveConfigToStore --- PASS: TestSaveConfigToStore (1.55s) === RUN TestSaveConfigToStore --- PASS: TestSaveConfigToStore (1.51s) PASS ok github.com/elastic/elastic-agent/internal/pkg/agent/application/actions/handlers 15.461s PS C:\Users\Shaunak Kashyap\development\github\elastic-agent> 

Moving PR to draft while I investigate this issue.

@ycombinator ycombinator marked this pull request as draft August 8, 2025 02:50
@ycombinator ycombinator force-pushed the fleet-enc-save-retries branch from e52e8c3 to 7d79815 Compare August 11, 2025 21:35
@ycombinator ycombinator marked this pull request as ready for review August 11, 2025 21:37
pchila
pchila previously approved these changes Aug 12, 2025
Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of nitpicks, nothing blocking. Please have a look.

pchila
pchila previously approved these changes Aug 12, 2025
Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small change in the debug logs, not blocking

@ycombinator ycombinator force-pushed the fleet-enc-save-retries branch from c2e79a5 to 553678b Compare August 20, 2025 21:57
@ycombinator ycombinator enabled auto-merge (squash) August 20, 2025 22:53
@ycombinator ycombinator merged commit ead9249 into elastic:main Aug 21, 2025
19 checks passed
@github-actions
Copy link
Contributor

@Mergifyio backport 8.17 8.18 8.19 9.0 9.1

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @ycombinator

@mergify
Copy link
Contributor

mergify bot commented Aug 21, 2025

mergify bot pushed a commit that referenced this pull request Aug 21, 2025
* Retry saving Agent information file on Windows * Implement retries on saving fleet.enc file * Remove previous implementation * Adding CHANGELOG file * Running mage fmt * Remove unnecessary changes (cherry picked from commit ead9249)
mergify bot pushed a commit that referenced this pull request Aug 21, 2025
* Retry saving Agent information file on Windows * Implement retries on saving fleet.enc file * Remove previous implementation * Adding CHANGELOG file * Running mage fmt * Remove unnecessary changes (cherry picked from commit ead9249)
mergify bot pushed a commit that referenced this pull request Aug 21, 2025
* Retry saving Agent information file on Windows * Implement retries on saving fleet.enc file * Remove previous implementation * Adding CHANGELOG file * Running mage fmt * Remove unnecessary changes (cherry picked from commit ead9249)
mergify bot pushed a commit that referenced this pull request Aug 21, 2025
* Retry saving Agent information file on Windows * Implement retries on saving fleet.enc file * Remove previous implementation * Adding CHANGELOG file * Running mage fmt * Remove unnecessary changes (cherry picked from commit ead9249)
mergify bot pushed a commit that referenced this pull request Aug 21, 2025
* Retry saving Agent information file on Windows * Implement retries on saving fleet.enc file * Remove previous implementation * Adding CHANGELOG file * Running mage fmt * Remove unnecessary changes (cherry picked from commit ead9249)
@ycombinator ycombinator deleted the fleet-enc-save-retries branch August 21, 2025 00:11
ycombinator added a commit that referenced this pull request Aug 21, 2025
…9507) * Retry saving Agent information file on Windows * Implement retries on saving fleet.enc file * Remove previous implementation * Adding CHANGELOG file * Running mage fmt * Remove unnecessary changes (cherry picked from commit ead9249) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
ycombinator added a commit that referenced this pull request Aug 21, 2025
…9510) * Retry saving Agent information file on Windows * Implement retries on saving fleet.enc file * Remove previous implementation * Adding CHANGELOG file * Running mage fmt * Remove unnecessary changes (cherry picked from commit ead9249) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
ycombinator added a commit that referenced this pull request Aug 21, 2025
…9508) * Retry saving Agent information file on Windows * Implement retries on saving fleet.enc file * Remove previous implementation * Adding CHANGELOG file * Running mage fmt * Remove unnecessary changes (cherry picked from commit ead9249) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
ycombinator added a commit that referenced this pull request Aug 21, 2025
…9511) * Retry saving Agent information file on Windows * Implement retries on saving fleet.enc file * Remove previous implementation * Adding CHANGELOG file * Running mage fmt * Remove unnecessary changes (cherry picked from commit ead9249) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
ycombinator added a commit that referenced this pull request Aug 21, 2025
…9509) * Retry saving Agent information file on Windows * Implement retries on saving fleet.enc file * Remove previous implementation * Adding CHANGELOG file * Running mage fmt * Remove unnecessary changes (cherry picked from commit ead9249) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
kaanyalti pushed a commit to kaanyalti/elastic-agent that referenced this pull request Sep 4, 2025
…#9224) * Retry saving Agent information file on Windows * Implement retries on saving fleet.enc file * Remove previous implementation * Adding CHANGELOG file * Running mage fmt * Remove unnecessary changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-all Automated backport with mergify to all the active branches Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

5 participants