Allow Endpoint service to keep running if its `install` exit code is non-fatal #9320

ycombinator · 2025-08-12T04:02:51Z

What does this PR do?

This PR allows services such as Endpoint to define a list of non-fatal exit codes for any of their operations. If the service runtime detects such a non-fatal exit code when it attempts to execute the operation, it will log a warning and continue to run the service.

Why is it important?

Exit code 28 from Endpoint is non-fatal because the Endpoint service is actually still running and able to check-in with Elastic Agent. This exit code is returned by the Endpoint check and install operations when Agent is unable to (re)install Endpoint while being tamper protected.

The fix in this PR is a temporary fix for a corner case bug that arises when the following scenario unfolds:

Agent is running with a policy that has Endpoint in it and has tamper protection enabled.
Agent is upgraded.
During the upgrade, while Agent is in the UPG_WATCHING state and Endpoint has been upgraded, the Upgrade Watcher encounters a situation that causes it to roll back Agent, e.g. the Agent service is killed.
Agent rolls back successfully but is unable to rollback Endpoint as well. This is because Agent's Service Runtime tries to invoke Endpoint's install operation, which attempts to rollback Endpoint but isn't able to due to tamper protection. The operation exits with a 28 exit code, which Agent interprets as a fatal error and reports the Endpoint component as FAILED.
Further, Agent shuts down the connection info server so even though Endpoint is actually still running (albeit the wrong version compared to Agent), Endpoint is unable to check-in with Agent and eventually reports itself as being ORPHANED.

Checklist

I have read and understood the pull request guidelines of this project.
My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
~~I have added an integration test or an E2E test~~

Disruptive User Impact

None; in the corner case described above, Agents that were reporting as Orphaned will now instead report as Healthy. Agent logs will contain a warning about continuing to run the Endpoint service after receiving a 28 non-fatal exit code.

How to test this PR locally

Ensure Agent remains healthy through upgrade rollback

Build Agent with this PR and enroll it in Fleet. You may want to build the Agent with AGENT_PACKAGE_VERSION lower than the latest so you can easily upgrade the Agent later.
Add the Elastic Defend integration to the Agent's policy.
Enable tamper protection for the Agent.
Upgrade the Agent.
On the Agent's host, run sudo /opt/Elastic/Endpoint/elastic-endpoint version until you notice that the Endpoint version has been upgraded.
Kill the Elastic Agent service. On Linux: sudo systemctl stop elastic-agent.
Wait for Agent to be downgraded.
Check the Fleet UI for the next few minutes and ensure that Agent keeps reporting itself as Healthy.

Ensure Agent diagnostics show that Endpoint has upgraded while Agent has rolled back

Follow all the steps from the previous step.
Request diagnostics and expand the diagnostics archive.
Check the Agent version in the diagnostics: cat version.txt. Ensure that this is the pre-upgrade (rolled back) version.
Check the Endpoint version in the diagnostics: cat components/endpoint-default/version.txt. Ensure that this is the post-upgrade version.

Related issues

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

mergify · 2025-08-12T04:03:30Z

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

cmacknz · 2025-08-12T16:32:03Z

This exit code is returned by the Endpoint install operation when Agent is unable to (re)install Endpoint while being tamper protected.

It comes from the check command during upgrades as well, since the endpoint-security verify command is what upgrades endpoint.

cmacknz · 2025-08-12T16:56:03Z

There is another ongoing support case where we suspect a spurious error caused us to get into this situation, I think because any error causes us to stop the connection info server and not restart it again.

This makes me think we should have us periodically re-attempt starting endpoint on failure at a low rate (could just be a constant 30s). I still think the non fatal error approach is best for the known case of invalid uninstall token on rollback, but there could be other cases where things are left in a broken state otherwise.

ycombinator · 2025-08-12T17:19:06Z

This makes me think we should have us periodically re-attempt starting endpoint on failure at a low rate (could just be a constant 30s). I still think the non fatal error approach is best for the known case of invalid uninstall token on rollback, but there could be other cases where things are left in a broken state otherwise.

I'll keep this PR here about the non-fatal exit code. Here's a separate PR for restarting endpoint: #9313. It's missing the 30s delay there; I'll add it.

elasticmachine · 2025-08-13T00:59:42Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

pkoutsovasilis · 2025-08-13T04:57:49Z

@ycombinator 👋 this PR should help with the CI failures in these PRs right?

[main][Automation] Update versions #9339
[8.19][Automation] Update versions #9335
[9.1][Automation] Update versions #9334 (the Debian and Ubuntu failures)

ycombinator · 2025-08-13T05:07:12Z

@ycombinator 👋 this PR should help with the CI failures in these PRs right?

[main][Automation] Update versions #9339

[8.19][Automation] Update versions #9335

[9.1][Automation] Update versions #9334 (the Debian and Ubuntu failures)

Ummm, no, because in these cases it's Endpoint's uninstall command that's returning the 28 error code. This PR here only accepts that error as non-fatal when the service runtime is trying to start Endpoint, which calls its check and install commands.

Also, this PR here was not made in response to those CI failures. It was more in response to SDHs. Something must've changed recently; I don't recall those tests failing consistently in CI like this.

ycombinator · 2025-08-13T17:17:55Z

Unit test introduced in this PR, TestCISKeepsRunningOnNonFatalExitCodeFromStart is failing on Windows. Moving PR back to draft while I investigate.

…non-fatal (#9320) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard (cherry picked from commit 958814d)

…non-fatal (#9320) (#9389) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard (cherry picked from commit 958814d) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

…non-fatal (#9320) (#9387) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard (cherry picked from commit 958814d) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

…non-fatal (#9320) (#9388) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard (cherry picked from commit 958814d) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

…non-fatal (#9320) (#9390) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard (cherry picked from commit 958814d) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

…non-fatal (#9320) (#9391) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard (cherry picked from commit 958814d) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

specs/endpoint-security.spec.yml

kevinr-security · 2025-08-22T15:18:58Z

@cmacknz which version is this planned to ship in & when, if you don't mind me asking?

cmacknz · 2025-08-22T15:23:45Z

It'll be in the next patch for each version. There will be both a 9.0.6 and a 9.1.3 very soon that will contain this fix.

8.18.6 and 8.19.3 will come a little bit after that with the fix in that series.

…non-fatal (elastic#9320) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard

kevinr-security · 2025-09-09T12:30:28Z

This issue does not appear to be fixed in 9.1.3 - The Orphaned problem that is.

cmacknz · 2025-09-09T17:25:25Z

On Windows we need #9445 will be in the upcoming 9.1.4.

cmacknz · 2025-09-09T17:45:47Z

There is also another active support case related to the orphaned status that is possibly a different issue, but there are no conclusions yet. That investigation is with the Elastic Defend engineers right now.

cmacknz · 2025-09-10T19:33:41Z

Coming back with the outcome of that other support case, there is an issue in endpoint as well that will also be fixed in 9.1.4.

Quoting one of the Defend engineers:

We're already aware of a regression problem introduced in:
v9.1.1, v9.1.2, v9.1.3, v9.0.5, v9.0.6, v8.19.1, v8.19.2, v8.19.3, v8.18.5, v8.18.6

With the mentioned release we changed behavior of verify command to check if Endpoint service is really running, start if it's not running, otherwise report error to Agent to reinstall it.

We've pinpointed the problem to a tiny race condition in comms routine, which we fixed and will release with next patch releases: v9.1.4, v9.0.7, v8.19.4, v8.18.7

The problem happens only on Windows.

In the log the relevant messages are:
error: Main.cpp:652 Failed to start endpoint services: Unreachable: exit status 30, try install" error: Util.cpp:525 Setting up minifilter registry keys failed.: exit status 284 However bear in mind that the customer mentioned problems with v9.0.4 are not related to this. 
The workaround is to restart the Agent service whilst the Endpoint service is already running.

The orphaned status reporting has let us a detect a lot of previously invisible problems we are working through fixing.

mergify bot assigned ycombinator Aug 12, 2025

ycombinator force-pushed the endpoint-28-not-fatal branch from 4c6a087 to 350cd7d Compare August 13, 2025 00:26

ycombinator marked this pull request as ready for review August 13, 2025 00:53

ycombinator requested a review from a team as a code owner August 13, 2025 00:53

ycombinator requested review from michalpristas and pchila August 13, 2025 00:53

ycombinator added 4 commits August 12, 2025 17:54

Allow services to keep running if install exit code is non-fatal

801b20e

Improve logging

2efedfe

Adding a mock binary

282d62f

Adding test

94d220e

ycombinator force-pushed the endpoint-28-not-fatal branch from 350cd7d to 94d220e Compare August 13, 2025 00:54

ycombinator requested a review from blakerouse August 13, 2025 00:59

ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Aug 13, 2025

ycombinator added the backport-active-all Automated backport with mergify to all the active branches label Aug 13, 2025

ycombinator added 2 commits August 12, 2025 21:35

Adding missing headers

f9b81af

Adding CHANGELOG entry

1a2cd97

Add error assertions

7b9161f

ycombinator marked this pull request as draft August 13, 2025 17:18

ycombinator added 3 commits August 13, 2025 16:54

Use npipe.Dial if OS is Windows

18d53ee

Add .exe extension to mock binary file name

996658b

Log mock binary filename in test

0c11de8

ycombinator marked this pull request as ready for review August 14, 2025 20:58

This was referenced Aug 15, 2025

[9.0] (backport #9320) Allow Endpoint service to keep running if its install exit code is non-fatal #9390

Merged

[9.1] (backport #9320) Allow Endpoint service to keep running if its install exit code is non-fatal #9391

Merged

ycombinator deleted the endpoint-28-not-fatal branch August 15, 2025 19:05

cmacknz reviewed Aug 15, 2025

View reviewed changes

specs/endpoint-security.spec.yml Show resolved Hide resolved

ycombinator mentioned this pull request Aug 16, 2025

Add non-fatal exit code 28 to Endpoint check command #9401

Merged

cmacknz mentioned this pull request Aug 22, 2025

[windows] move service startup to beginning of run function #4971

Open

intxgo mentioned this pull request Sep 2, 2025

add non-fatal endpoint security exit code #9687

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow Endpoint service to keep running if its `install` exit code is non-fatal #9320

Allow Endpoint service to keep running if its `install` exit code is non-fatal #9320

Uh oh!

ycombinator commented Aug 12, 2025 •

edited

Loading

mergify bot commented Aug 12, 2025

cmacknz commented Aug 12, 2025

cmacknz commented Aug 12, 2025

ycombinator commented Aug 12, 2025

elasticmachine commented Aug 13, 2025

pkoutsovasilis commented Aug 13, 2025

ycombinator commented Aug 13, 2025

ycombinator commented Aug 13, 2025

Uh oh!

kevinr-security commented Aug 22, 2025

cmacknz commented Aug 22, 2025

kevinr-security commented Sep 9, 2025

cmacknz commented Sep 9, 2025

cmacknz commented Sep 9, 2025

cmacknz commented Sep 10, 2025

Labels

6 participants

Allow Endpoint service to keep running if its install exit code is non-fatal #9320

Allow Endpoint service to keep running if its install exit code is non-fatal #9320

Uh oh!

Conversation

ycombinator commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Ensure Agent remains healthy through upgrade rollback

Ensure Agent diagnostics show that Endpoint has upgraded while Agent has rolled back

Related issues

Questions to ask yourself

mergify bot commented Aug 12, 2025

cmacknz commented Aug 12, 2025

cmacknz commented Aug 12, 2025

ycombinator commented Aug 12, 2025

elasticmachine commented Aug 13, 2025

pkoutsovasilis commented Aug 13, 2025

ycombinator commented Aug 13, 2025

ycombinator commented Aug 13, 2025

Uh oh!

kevinr-security commented Aug 22, 2025

cmacknz commented Aug 22, 2025

kevinr-security commented Sep 9, 2025

cmacknz commented Sep 9, 2025

cmacknz commented Sep 9, 2025

cmacknz commented Sep 10, 2025

Labels

6 participants

Allow Endpoint service to keep running if its `install` exit code is non-fatal #9320

Allow Endpoint service to keep running if its `install` exit code is non-fatal #9320

ycombinator commented Aug 12, 2025 •

edited

Loading