- Notifications
You must be signed in to change notification settings - Fork 197
6875 do not kill process on windows #7624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
6875 do not kill process on windows #7624
Conversation
|
Calling proc.Kill causes the process to exit immediately, which prevents Beats from executing some clean up functions. Now windows.GenerateConsoleCtrlEvent to send a CTRL+C signal that is haled as a SIGINT, thus allowing the Beat to execute all necessary clean up.
05ab75d
to de315e9
Compare This pull request does not have a backport label. Could you fix it @belimawr? 🙏
|
@belimawr I believe you are going to have to update the fake component code to handle that signal on Windows. I don't believe it does currently and that is why you are getting those unit test failures. |
The fake component does handle the signals: elastic-agent/pkg/component/fake/component/main.go Lines 46 to 57 in 8441cbe
I didn't get to do any test with the fake component, but as far as I can see it would handle it well. Some log entries would help to inspect its behaviour from the outside. However I had other problems to get the Elastic-Agent running with the fake component, so I gave up on it and I'm just using Filebeat. |
It does look to me like Go's OS package automatically converts |
Also, to note the outcome of #6875 is not "Beats get SIGINT" it is "Beat graceful shutdown logic executes" which is slightly different. This is the only change necessary if graceful shutdown on windows is correctly hooked up to SIGINT and I wouldn't trust that without explicitly testing it. There is another bug report about lack of graceful shutdown in APM server on Linux that is a hint to me that even if you got SIGINT, you might not be done fixing the problems. |
Also on Linux, we use SIGTERM: https://github.com/elastic/elastic-agent/blob/6b4b0dc69bca88fbdd4e5dca869645c05755e734/pkg/core/process/cmd_linux.go#L55-L58 So using SIGINT on Windows creates inconsistency, looking at the Go source we'd want one of these three signals to mimic SIGINT on Windows. case _CTRL_CLOSE_EVENT, _CTRL_LOGOFF_EVENT, _CTRL_SHUTDOWN_EVENT: s = _SIGTERM If we for some reason want to switch to SIGINT, we should do it everywhere in case the behavior is different across the different clients. There are non-Go clients that might not handle SIGINT the way beats do. |
Yes, I'm well aware of it. I tested your suggestion of sending
Once again, yes, that is the case. All Beats handle SIGINT correctly, we even log it as the cause for the shutdown. I'm writing integration tests that ensures the Beats (Filebeat) shutdown process works correctly, so not only the Elastic-Agent is sending the correct signal, but also ensure Beats is handling it as expected. |
Awesome thanks for confirming! Then my main comment is to prefer keeping to SIGTERM for consistency. We have several agent clients that are not Beats or even written in Go and introducing SIGINT instead introduces inconsistency that might not be handled properly everywhere. |
The Elastic-Agent code already uses SIGTERM to all OSes but Windows, I'll look on how to send one of the Windows signals Go converts to SIGTERM on Windows. Looking at the Go doc:
That might not be exactly what we want. It is a bit unclear to me what process will still get terminated exactly means for the process (is it a grace period?), anyways I'll do some more research and testing to find a reliable solution. |
2a54f06
to 76f9b90
Compare Quick update on this PR: There are two problems preventing the graceful shutdown of Beats:
I have a fix for |
Fix 2 in a separate PR or just defer it to a separate issue, this will need to be a net new feature that can be configured. Not all components are Beats or well behaved, and in the case of Defend has a lifetime separate from agent. The only case I would consider tackling here is uninstall where we have a nice place to hook into (only handles defend right now) and the configuration for wait time can be handled with a command line parameter. This still feels like a candidate for a separate PR. elastic-agent/internal/pkg/agent/install/uninstall.go Lines 124 to 134 in 3229be3
Solving this sounds like it will require multiple steps I'd rather treat the original issue more like a parent issue to the sub-issue steps then try to get it all right at once. |
For the change in 315c33f, make sure you have no defeated the job hierarchy that let's the sub-processes terminate when the agent as parent process terminates. You want to be able to signal the child processes independent of the parent, but if the parent is terminated the child processes need to terminate as well. |
|
@cmacknz I've been research about it and doing some testing, and even without setting any flag (a plain The good news is that when Elastic-Agent runs and exits, it does a good job signalling Beats to exit and Beats do a good job setting the correct signal handlers and responding to them. The Elastic-Agent even launches a goroutine that kills the child process if it does not exit in a pre-defined timeout (default is 30s). If anything, the Elastic-Agent process is doing a pretty good job at ensure all of its child process stop running once it exits. Hence, Beats not running their teardown. I haven't dug deep enough to find exactly the point in the code the Beats get killed during the shutdown. Tomorrow I'll look a little more into it. |
I finally found it! The TL;DR: (I'll create an issue describing the problem in details) |
I created a detailed issue: #7756 |
Yes, the JobObject is like the Windows version of a Linux cgroup. |
What does this PR do?
Calling proc.Kill causes the process to exit immediately, which
prevents Beats from executing some clean up functions.
Now windows.GenerateConsoleCtrlEvent to send a CTRL+C signal that is
haled as a SIGINT, thus allowing the Beat to execute all necessary
clean up.
Why is it important?
It allows Beats or any subprocess from the Elastic-Agent to gracefully terminate
Checklist
I have made corresponding change to the default configuration files./changelog/fragments
using the changelog tool## Disruptive User ImpactHow to test this PR locally
TODO
Related issues
Questions to ask yourself