Skip to content

Conversation

andrewkroh
Copy link
Member

@andrewkroh andrewkroh commented May 22, 2025

What does this PR do?

Handle a race condition that can occur in Agent diagnostics if log rotation happens while logs are being zipped. There is a time window between when filepath.WalkDir reads the directory contents and when log files are opened for reading. During this time window, log rotation can happen, resulting in files that no longer exist and cannot be added to the diagnostic archive.

To handle this, fs.ErrNotExist errors are ignored when attempting to add log files.

Why is it important?

This race can happen at any time, so any user could be affected. The problem is worse if debug logging is enabled and logs are being rotated quickly.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Screenshots

This is the user reported error:

diag-error

via OCR, it says:

Error generating file: unable to open log file: open /opt/Elastic/Agent/ data/elastic-agent-8.17.0-96f2b9/ logs/elastic-agent-20250220-42457.ndjson: no such file or directory

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...
@andrewkroh andrewkroh added the bug Something isn't working label May 22, 2025
Copy link
Contributor

mergify bot commented May 22, 2025

This pull request does not have a backport label. Could you fix it @andrewkroh? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.
Handle a race condition that can occur in Agent diagnostics if log rotation happens while logs are being zipped. There is a time window between when `filepath.WalkDir` reads the directory contents and when log files are opened for reading. During this time window, log rotation can happen, resulting in files that no longer exist and cannot be added to the diagnostic archive. To handle this, `fs.ErrNotExist` errors are ignored when attempting to add log files.
@andrewkroh andrewkroh force-pushed the fix/diag-zip-log-rotatation-race branch from d48294a to f1188bc Compare May 22, 2025 18:38
@andrewkroh andrewkroh marked this pull request as ready for review May 23, 2025 00:51
@andrewkroh andrewkroh requested a review from a team as a code owner May 23, 2025 00:51
@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label May 23, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pkoutsovasilis pkoutsovasilis added the backport-active-all Automated backport with mergify to all the active branches label May 23, 2025
pkoutsovasilis
pkoutsovasilis previously approved these changes May 23, 2025
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside a nit about non-relevant to this PR linter issue this LGTM, thanks for handling it @andrewkroh 🙂 Also since this is a bug-fix I added the backport to all active branches

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @andrewkroh

Copy link
Contributor

@pkoutsovasilis pkoutsovasilis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🙂

@andrewkroh andrewkroh merged commit 751acc1 into elastic:main May 23, 2025
12 checks passed
Copy link
Contributor

@Mergifyio backport 8.17 8.18 8.19 9.0

Copy link
Contributor

mergify bot commented May 23, 2025

backport 8.17 8.18 8.19 9.0

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request May 23, 2025
Handle a race condition that can occur in Agent diagnostics if log rotation happens while logs are being zipped. There is a time window between when `filepath.WalkDir` reads the directory contents and when log files are opened for reading. During this time window, log rotation can happen, resulting in files that no longer exist and cannot be added to the diagnostic archive. To handle this, `fs.ErrNotExist` errors are ignored when attempting to add log files. Also, address and unrelated linter warning by using fmt.Fprintf. (cherry picked from commit 751acc1)
mergify bot pushed a commit that referenced this pull request May 23, 2025
Handle a race condition that can occur in Agent diagnostics if log rotation happens while logs are being zipped. There is a time window between when `filepath.WalkDir` reads the directory contents and when log files are opened for reading. During this time window, log rotation can happen, resulting in files that no longer exist and cannot be added to the diagnostic archive. To handle this, `fs.ErrNotExist` errors are ignored when attempting to add log files. Also, address and unrelated linter warning by using fmt.Fprintf. (cherry picked from commit 751acc1)
mergify bot pushed a commit that referenced this pull request May 23, 2025
Handle a race condition that can occur in Agent diagnostics if log rotation happens while logs are being zipped. There is a time window between when `filepath.WalkDir` reads the directory contents and when log files are opened for reading. During this time window, log rotation can happen, resulting in files that no longer exist and cannot be added to the diagnostic archive. To handle this, `fs.ErrNotExist` errors are ignored when attempting to add log files. Also, address and unrelated linter warning by using fmt.Fprintf. (cherry picked from commit 751acc1)
mergify bot pushed a commit that referenced this pull request May 23, 2025
Handle a race condition that can occur in Agent diagnostics if log rotation happens while logs are being zipped. There is a time window between when `filepath.WalkDir` reads the directory contents and when log files are opened for reading. During this time window, log rotation can happen, resulting in files that no longer exist and cannot be added to the diagnostic archive. To handle this, `fs.ErrNotExist` errors are ignored when attempting to add log files. Also, address and unrelated linter warning by using fmt.Fprintf. (cherry picked from commit 751acc1)
andrewkroh added a commit that referenced this pull request May 26, 2025
Handle a race condition that can occur in Agent diagnostics if log rotation happens while logs are being zipped. There is a time window between when `filepath.WalkDir` reads the directory contents and when log files are opened for reading. During this time window, log rotation can happen, resulting in files that no longer exist and cannot be added to the diagnostic archive. To handle this, `fs.ErrNotExist` errors are ignored when attempting to add log files. Also, address and unrelated linter warning by using fmt.Fprintf. (cherry picked from commit 751acc1) Co-authored-by: Andrew Kroh <andrew.kroh@elastic.co>
andrewkroh added a commit that referenced this pull request May 26, 2025
Handle a race condition that can occur in Agent diagnostics if log rotation happens while logs are being zipped. There is a time window between when `filepath.WalkDir` reads the directory contents and when log files are opened for reading. During this time window, log rotation can happen, resulting in files that no longer exist and cannot be added to the diagnostic archive. To handle this, `fs.ErrNotExist` errors are ignored when attempting to add log files. Also, address and unrelated linter warning by using fmt.Fprintf. (cherry picked from commit 751acc1) Co-authored-by: Andrew Kroh <andrew.kroh@elastic.co>
andrewkroh added a commit that referenced this pull request May 26, 2025
Handle a race condition that can occur in Agent diagnostics if log rotation happens while logs are being zipped. There is a time window between when `filepath.WalkDir` reads the directory contents and when log files are opened for reading. During this time window, log rotation can happen, resulting in files that no longer exist and cannot be added to the diagnostic archive. To handle this, `fs.ErrNotExist` errors are ignored when attempting to add log files. Also, address and unrelated linter warning by using fmt.Fprintf. (cherry picked from commit 751acc1) Co-authored-by: Andrew Kroh <andrew.kroh@elastic.co>
andrewkroh added a commit that referenced this pull request May 26, 2025
Handle a race condition that can occur in Agent diagnostics if log rotation happens while logs are being zipped. There is a time window between when `filepath.WalkDir` reads the directory contents and when log files are opened for reading. During this time window, log rotation can happen, resulting in files that no longer exist and cannot be added to the diagnostic archive. To handle this, `fs.ErrNotExist` errors are ignored when attempting to add log files. Also, address and unrelated linter warning by using fmt.Fprintf. (cherry picked from commit 751acc1) Co-authored-by: Andrew Kroh <andrew.kroh@elastic.co>
v1v added a commit to v1v/elastic-agent that referenced this pull request May 27, 2025
* main: chore: Update to elastic/beats@c51fee3a852e (elastic#8246) chore: Update to elastic/beats@82f47f73acbd (elastic#8239) Fix minor typo in beats update job (elastic#8233) fix(diagnostics): handle log rotation races (elastic#8215) Fix some minor issues with the beats version update job (elastic#8224) [main][Automation] Update elastic/beats to 9382cc20546b (elastic#8222) Automatically update beats module versions (elastic#8174) chore: deps(ironbank): Bump ubi version to 9.6 (elastic#8218)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-all Automated backport with mergify to all the active branches bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

4 participants