Skip to content

Conversation

@ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Jul 29, 2025

What does this PR do?

This PR enhances the SafeFileRotate implementation take a variadic opts argument. This argument can be used to configure the function to retry it's final rename step with a specific interval and up to a specified duration. On Windows, the default behavior is to retry.

This is similar to a fix made for Agent uninstallation.

Why is it important?

Windows does not permit renaming files that are currently being held open by other processes, e.g. antivirus scanners. Retrying the rename increases the likelihood that the rename will succeed on one of the retry attempts when the file is not being help open by another process.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works

How to test this PR locally

  1. On a Windows machine, reproduce the failing scenario.

    1. Revert the commit with the fix.
    git checkout a711c5aa00e500d7136d95e2b658db15e974261a 
    1. Run the unit test introduced in this PR and ensure that it fails with the Access denied error.
    go test -v ./file/... -run ^TestSafeFileRotate$ 
  2. On a Windows machine, re-run the same test with the fix.

    1. Checkout this PR's branch.
    2. Run the unit test introduced in this PR multiple times and ensure that it no longer fails with the Access denied error.
    go test -v ./file/... -run ^TestSafeFileRotate$ -count 10 

Related issues

@ycombinator ycombinator requested a review from a team as a code owner July 29, 2025 21:21
@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jul 29, 2025
@ycombinator ycombinator requested review from a team, khushijain21, michel-laterman, pchila and rdner and removed request for a team July 29, 2025 21:21
@pchila
Copy link
Member

pchila commented Jul 30, 2025

@ycombinator could you please take a look at the test failures?

@swiatekm
Copy link

I feel like this should either be a new function, or we should change the signature and make the retry timings configurable. A file operation potentially blocking for two seconds sounds like it might be unexpected for users, even if the alternative is to get an error on Windows.

@ycombinator
Copy link
Contributor Author

I feel like this should either be a new function, or we should change the signature and make the retry timings configurable. A file operation potentially blocking for two seconds sounds like it might be unexpected for users, even if the alternative is to get an error on Windows.

I'm torn about this. I see the retry as an implementation detail specifically for Windows. Perhaps we could make the retry frequency more aggressive, e.g. every 50ms or a random duration between 25-75ms? I know that still means the operation could potentially block for two seconds, but it at least drops the lower bound to something much lower than the current 500ms?

@swiatekm
Copy link

I feel like this should either be a new function, or we should change the signature and make the retry timings configurable. A file operation potentially blocking for two seconds sounds like it might be unexpected for users, even if the alternative is to get an error on Windows.

I'm torn about this. I see the retry as an implementation detail specifically for Windows. Perhaps we could make the retry frequency more aggressive, e.g. every 50ms or a random duration between 25-75ms? I know that still means the operation could potentially block for two seconds, but it at least drops the lower bound to something much lower than the current 500ms?

I feel like the wait time here depends on the use case and shouldn't be hardcoded. Sometimes you're ok waiting potentially indefinitely, other times you just want to skip the operation if it runs into a conflict. Is the intent here to retry if the file happened to be held temporarily by an antivirus or similar tool?

@ycombinator
Copy link
Contributor Author

ycombinator commented Jul 30, 2025

Is the intent here to retry if the file happened to be held temporarily by an antivirus or similar tool?

Yes, that's exactly right. The retry mechanism is not really meant to be controllable by callers but rather something we need only in the Windows implementation. The same function's implementation for other OSes does not have retries.

FWIW, we do something similar during Agent uninstallation: https://github.com/elastic/elastic-agent/blob/8fbdaaa979db0976e33612be647bb69ffaac70f8/internal/pkg/agent/install/uninstall.go#L277-L284

@ycombinator
Copy link
Contributor Author

@ycombinator could you please take a look at the test failures?

Test failure looks unrelated to the changes in this PR:

=== RUN Test_diagError/Server_requires_client_auth --   | 2025/07/29 21:47:50 http: TLS handshake error from 127.0.0.1:50387: tls: client didn't provide a certificate   | diag_test.go:249:   | Error Trace:	C:/buildkite-agent/builds/bk-agent-prod-gcp-1753825353569298220/elastic/elastic-agent-libs/transport/httpcommon/diag_test.go:249   | Error:	"Get \"https://127.0.0.1:50386\": read tcp 127.0.0.1:50387->127.0.0.1:50386: wsarecv: An established connection was aborted by the software in your host machine." does not contain "caused by missing mTLS client cert."   | Test:	Test_diagError/Server_requires_client_auth 

Moreover, the same test passes locally (on Windows):

Screenshot 2025-07-30 at 12 08 20

Retrying in CI...

@pchila
Copy link
Member

pchila commented Jul 31, 2025

FWIW, we do something similar during Agent uninstallation: https://github.com/elastic/elastic-agent/blob/8fbdaaa979db0976e33612be647bb69ffaac70f8/internal/pkg/agent/install/uninstall.go#L277-L284

To be fair, during uninstall we don't care about the final contents of a file, we just want them gone, here we are writing file content and the conflict may arise from some other concurrent writes and the final result may be unexpected.
Retrying like this may overwrite updates happening on the same file, I would rather give control of retries to the caller rather than silently brute forcing our way through a file lock for 2 seconds.

@ycombinator
Copy link
Contributor Author

here we are writing file content and the conflict may arise from some other concurrent writes and the final result may be unexpected.
Retrying like this may overwrite updates happening on the same file, I would rather give control of retries to the caller rather than silently brute forcing our way through a file lock for 2 seconds.

I'm not sure I understand how a conflict may arise. If another process is writing to this file, this function would not be able to rename it until the other process closes its handle, no?

@pchila
Copy link
Member

pchila commented Aug 1, 2025

here we are writing file content and the conflict may arise from some other concurrent writes and the final result may be unexpected.
Retrying like this may overwrite updates happening on the same file, I would rather give control of retries to the caller rather than silently brute forcing our way through a file lock for 2 seconds.

I'm not sure I understand how a conflict may arise. If another process is writing to this file, this function would not be able to rename it until the other process closes its handle, no?

Yes but depending on the use case the caller of this function may want to handle it differently.
With this PR there is no chance to handle anything, we would always retry for 2 seconds and I am not sure that this is always ok

@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 1, 2025

Alright, I'm not really convinced the retrying shouldn't just be an implementation detail for Windows but I see I'm in the minority here. So I'll concede and make a new function that explicitly gives the caller more control over the rename.

How much control do we want to give the caller here? Is it sufficient to let the caller provide the retry frequency and duration (as @swiatekm suggested earlier)? Or are you thinking something more than that — like a callback that is called when the rename fails, allowing the caller to completely decide what the next action should be (e.g. retry immediately, retry after some time, do nothing, do something else first and then retry, etc.)?

[EDIT] Come to think of the callback idea — that sounds like overkill as the caller could simply use the function as it is today and implement any additional logic around it on it's end. So I would not go with this idea. Either we just allow the caller to control the retry frequency and duration or we leave this function as it is today (before this PR) and implement the retry logic entirely on the caller's end around the existing function call. WDYT?

@pchila
Copy link
Member

pchila commented Aug 1, 2025

[EDIT] Come to think of the callback idea — that sounds like overkill as the caller could simply use the function as it is today and implement any additional logic around it on it's end. So I would not go with this idea. Either we just allow the caller to control the retry frequency and duration or we leave this function as it is today (before this PR) and implement the retry logic entirely on the caller's end around the existing function call. WDYT?

I would be in favor of handling the retries at the point where new content is saved to fleet.enc by calling this function.

@ycombinator
Copy link
Contributor Author

[EDIT] Come to think of the callback idea — that sounds like overkill as the caller could simply use the function as it is today and implement any additional logic around it on it's end. So I would not go with this idea. Either we just allow the caller to control the retry frequency and duration or we leave this function as it is today (before this PR) and implement the retry logic entirely on the caller's end around the existing function call. WDYT?

I would be in favor of handling the retries at the point where new content is saved to fleet.enc by calling this function.

Got it, thanks. I'll close this PR then to leave the implementation of SafeFileRotate as it is today and open a new PR against elastic-agent to add the retry logic where fleet.enc is being saved by calling SafeFileRotate.

@ycombinator ycombinator closed this Aug 1, 2025
@ycombinator ycombinator deleted the win-safe-rotate branch August 1, 2025 12:46
@ycombinator ycombinator restored the win-safe-rotate branch August 20, 2025 13:58
@ycombinator
Copy link
Contributor Author

Bringing back the implementation in this PR in favor of elastic/elastic-agent#9224. See elastic/elastic-agent#9224 (comment) for details.

@ycombinator ycombinator reopened this Aug 20, 2025
pkoutsovasilis
pkoutsovasilis previously approved these changes Aug 20, 2025
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with the variadic opts approach I think that all sides of the discussion above are addressed

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making Windows the default.

@ycombinator ycombinator enabled auto-merge (squash) August 20, 2025 15:58
@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

@ycombinator ycombinator merged commit 0a783a4 into elastic:main Aug 20, 2025
5 checks passed
@ycombinator ycombinator deleted the win-safe-rotate branch August 20, 2025 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

7 participants