Skip to content

Conversation

@ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Aug 12, 2025

What does this PR do?

This PR allows services such as Endpoint to define a list of non-fatal exit codes for any of their operations. If the service runtime detects such a non-fatal exit code when it attempts to execute the operation, it will log a warning and continue to run the service.

Why is it important?

Exit code 28 from Endpoint is non-fatal because the Endpoint service is actually still running and able to check-in with Elastic Agent. This exit code is returned by the Endpoint check and install operations when Agent is unable to (re)install Endpoint while being tamper protected.

The fix in this PR is a temporary fix for a corner case bug that arises when the following scenario unfolds:

  • Agent is running with a policy that has Endpoint in it and has tamper protection enabled.
  • Agent is upgraded.
  • During the upgrade, while Agent is in the UPG_WATCHING state and Endpoint has been upgraded, the Upgrade Watcher encounters a situation that causes it to roll back Agent, e.g. the Agent service is killed.
  • Agent rolls back successfully but is unable to rollback Endpoint as well. This is because Agent's Service Runtime tries to invoke Endpoint's install operation, which attempts to rollback Endpoint but isn't able to due to tamper protection. The operation exits with a 28 exit code, which Agent interprets as a fatal error and reports the Endpoint component as FAILED.
  • Further, Agent shuts down the connection info server so even though Endpoint is actually still running (albeit the wrong version compared to Agent), Endpoint is unable to check-in with Agent and eventually reports itself as being ORPHANED.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

None; in the corner case described above, Agents that were reporting as Orphaned will now instead report as Healthy. Agent logs will contain a warning about continuing to run the Endpoint service after receiving a 28 non-fatal exit code.

How to test this PR locally

Ensure Agent remains healthy through upgrade rollback

  1. Build Agent with this PR and enroll it in Fleet. You may want to build the Agent with AGENT_PACKAGE_VERSION lower than the latest so you can easily upgrade the Agent later.
  2. Add the Elastic Defend integration to the Agent's policy.
  3. Enable tamper protection for the Agent.
  4. Upgrade the Agent.
  5. On the Agent's host, run sudo /opt/Elastic/Endpoint/elastic-endpoint version until you notice that the Endpoint version has been upgraded.
  6. Kill the Elastic Agent service. On Linux: sudo systemctl stop elastic-agent.
  7. Wait for Agent to be downgraded.
  8. Check the Fleet UI for the next few minutes and ensure that Agent keeps reporting itself as Healthy.

Ensure Agent diagnostics show that Endpoint has upgraded while Agent has rolled back

  1. Follow all the steps from the previous step.
  2. Request diagnostics and expand the diagnostics archive.
  3. Check the Agent version in the diagnostics: cat version.txt. Ensure that this is the pre-upgrade (rolled back) version.
  4. Check the Endpoint version in the diagnostics: cat components/endpoint-default/version.txt. Ensure that this is the post-upgrade version.

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...
@mergify
Copy link
Contributor

mergify bot commented Aug 12, 2025

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.
@cmacknz
Copy link
Member

cmacknz commented Aug 12, 2025

This exit code is returned by the Endpoint install operation when Agent is unable to (re)install Endpoint while being tamper protected.

It comes from the check command during upgrades as well, since the endpoint-security verify command is what upgrades endpoint.

@cmacknz
Copy link
Member

cmacknz commented Aug 12, 2025

There is another ongoing support case where we suspect a spurious error caused us to get into this situation, I think because any error causes us to stop the connection info server and not restart it again.

This makes me think we should have us periodically re-attempt starting endpoint on failure at a low rate (could just be a constant 30s). I still think the non fatal error approach is best for the known case of invalid uninstall token on rollback, but there could be other cases where things are left in a broken state otherwise.

@ycombinator
Copy link
Contributor Author

This makes me think we should have us periodically re-attempt starting endpoint on failure at a low rate (could just be a constant 30s). I still think the non fatal error approach is best for the known case of invalid uninstall token on rollback, but there could be other cases where things are left in a broken state otherwise.

I'll keep this PR here about the non-fatal exit code. Here's a separate PR for restarting endpoint: #9313. It's missing the 30s delay there; I'll add it.

@ycombinator ycombinator force-pushed the endpoint-28-not-fatal branch from 4c6a087 to 350cd7d Compare August 13, 2025 00:26
@ycombinator ycombinator marked this pull request as ready for review August 13, 2025 00:53
@ycombinator ycombinator requested a review from a team as a code owner August 13, 2025 00:53
@ycombinator ycombinator force-pushed the endpoint-28-not-fatal branch from 350cd7d to 94d220e Compare August 13, 2025 00:54
@ycombinator ycombinator requested a review from blakerouse August 13, 2025 00:59
@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Aug 13, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@ycombinator ycombinator added the backport-active-all Automated backport with mergify to all the active branches label Aug 13, 2025
@pkoutsovasilis
Copy link
Contributor

@ycombinator 👋 this PR should help with the CI failures in these PRs right?

@ycombinator
Copy link
Contributor Author

@ycombinator 👋 this PR should help with the CI failures in these PRs right?

Ummm, no, because in these cases it's Endpoint's uninstall command that's returning the 28 error code. This PR here only accepts that error as non-fatal when the service runtime is trying to start Endpoint, which calls its check and install commands.

Also, this PR here was not made in response to those CI failures. It was more in response to SDHs. Something must've changed recently; I don't recall those tests failing consistently in CI like this.

@ycombinator
Copy link
Contributor Author

Unit test introduced in this PR, TestCISKeepsRunningOnNonFatalExitCodeFromStart is failing on Windows. Moving PR back to draft while I investigate.

@ycombinator ycombinator marked this pull request as draft August 13, 2025 17:18
@ycombinator ycombinator marked this pull request as ready for review August 14, 2025 20:58
mergify bot pushed a commit that referenced this pull request Aug 15, 2025
…non-fatal (#9320) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard (cherry picked from commit 958814d)
ycombinator added a commit that referenced this pull request Aug 15, 2025
…non-fatal (#9320) (#9389) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard (cherry picked from commit 958814d) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
@ycombinator ycombinator deleted the endpoint-28-not-fatal branch August 15, 2025 19:05
ycombinator added a commit that referenced this pull request Aug 15, 2025
…non-fatal (#9320) (#9387) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard (cherry picked from commit 958814d) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
ycombinator added a commit that referenced this pull request Aug 15, 2025
…non-fatal (#9320) (#9388) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard (cherry picked from commit 958814d) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
ycombinator added a commit that referenced this pull request Aug 15, 2025
…non-fatal (#9320) (#9390) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard (cherry picked from commit 958814d) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
ycombinator added a commit that referenced this pull request Aug 15, 2025
…non-fatal (#9320) (#9391) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard (cherry picked from commit 958814d) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
@kevinr-security
Copy link

@cmacknz which version is this planned to ship in & when, if you don't mind me asking?

@cmacknz
Copy link
Member

cmacknz commented Aug 22, 2025

It'll be in the next patch for each version. There will be both a 9.0.6 and a 9.1.3 very soon that will contain this fix.

8.18.6 and 8.19.3 will come a little bit after that with the fix in that series.

kaanyalti pushed a commit to kaanyalti/elastic-agent that referenced this pull request Sep 4, 2025
…non-fatal (elastic#9320) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard
@kevinr-security
Copy link

image

This issue does not appear to be fixed in 9.1.3 - The Orphaned problem that is.

@cmacknz
Copy link
Member

cmacknz commented Sep 9, 2025

On Windows we need #9445 will be in the upcoming 9.1.4.

@cmacknz
Copy link
Member

cmacknz commented Sep 9, 2025

There is also another active support case related to the orphaned status that is possibly a different issue, but there are no conclusions yet. That investigation is with the Elastic Defend engineers right now.

@cmacknz
Copy link
Member

cmacknz commented Sep 10, 2025

Coming back with the outcome of that other support case, there is an issue in endpoint as well that will also be fixed in 9.1.4.

Quoting one of the Defend engineers:

We're already aware of a regression problem introduced in:
v9.1.1, v9.1.2, v9.1.3, v9.0.5, v9.0.6, v8.19.1, v8.19.2, v8.19.3, v8.18.5, v8.18.6

With the mentioned release we changed behavior of verify command to check if Endpoint service is really running, start if it's not running, otherwise report error to Agent to reinstall it.

We've pinpointed the problem to a tiny race condition in comms routine, which we fixed and will release with next patch releases: v9.1.4, v9.0.7, v8.19.4, v8.18.7

The problem happens only on Windows.

In the log the relevant messages are:

error: Main.cpp:652 Failed to start endpoint services: Unreachable: exit status 30, try install" error: Util.cpp:525 Setting up minifilter registry keys failed.: exit status 284 However bear in mind that the customer mentioned problems with v9.0.4 are not related to this. 

The workaround is to restart the Agent service whilst the Endpoint service is already running.

The orphaned status reporting has let us a detect a lot of previously invisible problems we are working through fixing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-all Automated backport with mergify to all the active branches Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

6 participants