Skip to content

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented May 23, 2025

What does this PR do?

Handle a race condition that can occur in Agent diagnostics if log rotation happens while logs are being zipped. There is a time window between when filepath.WalkDir reads the directory contents and when log files are opened for reading. During this time window, log rotation can happen, resulting in files that no longer exist and cannot be added to the diagnostic archive.

To handle this, fs.ErrNotExist errors are ignored when attempting to add log files.

Why is it important?

This race can happen at any time, so any user could be affected. The problem is worse if debug logging is enabled and logs are being rotated quickly.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Screenshots

This is the user reported error:

diag-error

via OCR, it says:

Error generating file: unable to open log file: open /opt/Elastic/Agent/ data/elastic-agent-8.17.0-96f2b9/ logs/elastic-agent-20250220-42457.ndjson: no such file or directory

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

This is an automatic backport of pull request #8215 done by [Mergify](https://mergify.com).
Handle a race condition that can occur in Agent diagnostics if log rotation happens while logs are being zipped. There is a time window between when `filepath.WalkDir` reads the directory contents and when log files are opened for reading. During this time window, log rotation can happen, resulting in files that no longer exist and cannot be added to the diagnostic archive. To handle this, `fs.ErrNotExist` errors are ignored when attempting to add log files. Also, address and unrelated linter warning by using fmt.Fprintf. (cherry picked from commit 751acc1)
@mergify mergify bot requested a review from a team as a code owner May 23, 2025 15:43
@mergify mergify bot added the backport label May 23, 2025
@mergify mergify bot requested review from michel-laterman and pchila and removed request for a team May 23, 2025 15:43
@github-actions github-actions bot added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels May 23, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

cc @andrewkroh

@andrewkroh andrewkroh enabled auto-merge (squash) May 23, 2025 18:58
Copy link
Contributor Author

mergify bot commented May 26, 2025

This pull request has not been merged yet. Could you please review and merge it @andrewkroh? 🙏

@andrewkroh andrewkroh merged commit 22367da into 8.19 May 26, 2025
14 checks passed
@andrewkroh andrewkroh deleted the mergify/bp/8.19/pr-8215 branch May 26, 2025 06:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

3 participants