- Notifications
You must be signed in to change notification settings - Fork 204
Allow Endpoint service to keep running if its install exit code is non-fatal #9320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
|
It comes from the check command during upgrades as well, since the |
| There is another ongoing support case where we suspect a spurious error caused us to get into this situation, I think because any error causes us to stop the connection info server and not restart it again. This makes me think we should have us periodically re-attempt starting endpoint on failure at a low rate (could just be a constant 30s). I still think the non fatal error approach is best for the known case of invalid uninstall token on rollback, but there could be other cases where things are left in a broken state otherwise. |
I'll keep this PR here about the non-fatal exit code. Here's a separate PR for restarting endpoint: #9313. It's missing the 30s delay there; I'll add it. |
4c6a087 to 350cd7d Compare 350cd7d to 94d220e Compare | Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
| @ycombinator 👋 this PR should help with the CI failures in these PRs right?
|
Ummm, no, because in these cases it's Endpoint's Also, this PR here was not made in response to those CI failures. It was more in response to SDHs. Something must've changed recently; I don't recall those tests failing consistently in CI like this. |
| Unit test introduced in this PR, |
…non-fatal (#9320) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard (cherry picked from commit 958814d)
…non-fatal (#9320) (#9389) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard (cherry picked from commit 958814d) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
…non-fatal (#9320) (#9387) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard (cherry picked from commit 958814d) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
…non-fatal (#9320) (#9388) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard (cherry picked from commit 958814d) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
…non-fatal (#9320) (#9390) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard (cherry picked from commit 958814d) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
…non-fatal (#9320) (#9391) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard (cherry picked from commit 958814d) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
| @cmacknz which version is this planned to ship in & when, if you don't mind me asking? |
| It'll be in the next patch for each version. There will be both a 9.0.6 and a 9.1.3 very soon that will contain this fix. 8.18.6 and 8.19.3 will come a little bit after that with the fix in that series. |
…non-fatal (elastic#9320) * Allow services to keep running if install exit code is non-fatal * Improve logging * Adding a mock binary * Adding test * Adding missing headers * Adding CHANGELOG entry * Add error assertions * Use npipe.Dial if OS is Windows * Add .exe extension to mock binary file name * Log mock binary filename in test * Add nil guard
| On Windows we need #9445 will be in the upcoming 9.1.4. |
| There is also another active support case related to the orphaned status that is possibly a different issue, but there are no conclusions yet. That investigation is with the Elastic Defend engineers right now. |
| Coming back with the outcome of that other support case, there is an issue in endpoint as well that will also be fixed in 9.1.4. Quoting one of the Defend engineers:
The orphaned status reporting has let us a detect a lot of previously invisible problems we are working through fixing. |

What does this PR do?
This PR allows services such as Endpoint to define a list of non-fatal exit codes for any of their operations. If the service runtime detects such a non-fatal exit code when it attempts to execute the operation, it will log a warning and continue to run the service.
Why is it important?
Exit code 28 from Endpoint is non-fatal because the Endpoint service is actually still running and able to check-in with Elastic Agent. This exit code is returned by the Endpoint
checkandinstalloperations when Agent is unable to (re)install Endpoint while being tamper protected.The fix in this PR is a temporary fix for a corner case bug that arises when the following scenario unfolds:
UPG_WATCHINGstate and Endpoint has been upgraded, the Upgrade Watcher encounters a situation that causes it to roll back Agent, e.g. the Agent service is killed.installoperation, which attempts to rollback Endpoint but isn't able to due to tamper protection. The operation exits with a28exit code, which Agent interprets as a fatal error and reports the Endpoint component asFAILED.ORPHANED.Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration files./changelog/fragmentsusing the changelog toolI have added an integration test or an E2E testDisruptive User Impact
None; in the corner case described above, Agents that were reporting as Orphaned will now instead report as Healthy. Agent logs will contain a warning about continuing to run the Endpoint service after receiving a
28non-fatal exit code.How to test this PR locally
Ensure Agent remains healthy through upgrade rollback
AGENT_PACKAGE_VERSIONlower than the latest so you can easily upgrade the Agent later.sudo /opt/Elastic/Endpoint/elastic-endpoint versionuntil you notice that the Endpoint version has been upgraded.sudo systemctl stop elastic-agent.Ensure Agent diagnostics show that Endpoint has upgraded while Agent has rolled back
cat version.txt. Ensure that this is the pre-upgrade (rolled back) version.cat components/endpoint-default/version.txt. Ensure that this is the post-upgrade version.Related issues
Questions to ask yourself