Skip to content

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Mar 28, 2025

What does this PR do?

Calling proc.Kill causes the process to exit immediately, which
prevents Beats from executing some clean up functions.

Now windows.GenerateConsoleCtrlEvent to send a CTRL+C signal that is
haled as a SIGINT, thus allowing the Beat to execute all necessary
clean up.

Why is it important?

It allows Beats or any subprocess from the Elastic-Agent to gracefully terminate

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

## Disruptive User Impact

How to test this PR locally

TODO

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...
@belimawr belimawr added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Mar 28, 2025
@belimawr belimawr self-assigned this Mar 28, 2025
Copy link
Contributor

mergify bot commented Mar 28, 2025

⚠️ The sha of the head commit of this PR conflicts with #7604. Mergify cannot evaluate rules on this PR. ⚠️

Calling proc.Kill causes the process to exit immediately, which prevents Beats from executing some clean up functions. Now windows.GenerateConsoleCtrlEvent to send a CTRL+C signal that is haled as a SIGINT, thus allowing the Beat to execute all necessary clean up.
@belimawr belimawr force-pushed the 6875-do-not-kill-process-on-windows branch from 05ab75d to de315e9 Compare March 28, 2025 19:53
Copy link
Contributor

mergify bot commented Mar 28, 2025

This pull request does not have a backport label. Could you fix it @belimawr? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.
@blakerouse
Copy link
Contributor

@belimawr I believe you are going to have to update the fake component code to handle that signal on Windows. I don't believe it does currently and that is why you are getting those unit test failures.

@belimawr
Copy link
Contributor Author

@belimawr I believe you are going to have to update the fake component code to handle that signal on Windows. I don't believe it does currently and that is why you are getting those unit test failures.

The fake component does handle the signals:

signal.Notify(n, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT)
defer func() {
signal.Stop(n)
cancel()
}()
go func() {
select {
case <-n:
cancel()
case <-ctx.Done():
}
}()

I didn't get to do any test with the fake component, but as far as I can see it would handle it well. Some log entries would help to inspect its behaviour from the outside.

However I had other problems to get the Elastic-Agent running with the fake component, so I gave up on it and I'm just using Filebeat.

@cmacknz
Copy link
Member

cmacknz commented Mar 31, 2025

It does look to me like Go's OS package automatically converts CTRL_BREAK_EVENT to SIGINT for us so that the Beats and other client's won't need any other changes to handle this. I didn't expect this when I suggested CTRL_BREAK_EVENT, I thought we'd need to implement something on the client side as well.

https://github.com/golang/go/blob/b9cbb65384f6bebd58f7a8354759b8c7b1bbb13f/src/runtime/os_windows.go#L1097-L1102

@cmacknz
Copy link
Member

cmacknz commented Mar 31, 2025

Also, to note the outcome of #6875 is not "Beats get SIGINT" it is "Beat graceful shutdown logic executes" which is slightly different.

This is the only change necessary if graceful shutdown on windows is correctly hooked up to SIGINT and I wouldn't trust that without explicitly testing it. There is another bug report about lack of graceful shutdown in APM server on Linux that is a hint to me that even if you got SIGINT, you might not be done fixing the problems.

@cmacknz
Copy link
Member

cmacknz commented Mar 31, 2025

Also on Linux, we use SIGTERM: https://github.com/elastic/elastic-agent/blob/6b4b0dc69bca88fbdd4e5dca869645c05755e734/pkg/core/process/cmd_linux.go#L55-L58

So using SIGINT on Windows creates inconsistency, looking at the Go source we'd want one of these three signals to mimic SIGINT on Windows.

https://github.com/golang/go/blob/b9cbb65384f6bebd58f7a8354759b8c7b1bbb13f/src/runtime/os_windows.go#L1097-L1102

case _CTRL_CLOSE_EVENT, _CTRL_LOGOFF_EVENT, _CTRL_SHUTDOWN_EVENT: s = _SIGTERM

If we for some reason want to switch to SIGINT, we should do it everywhere in case the behavior is different across the different clients. There are non-Go clients that might not handle SIGINT the way beats do.

@belimawr
Copy link
Contributor Author

belimawr commented Apr 1, 2025

Also, to note the outcome of #6875 is not "Beats get SIGINT" it is "Beat graceful shutdown logic executes" which is slightly different.

Yes, I'm well aware of it. I tested your suggestion of sending CTRL_C_EVENT and it worked perfectly with the current small change on this PR. I tested with Filebeat and the whole shutdown process works well, I even tested with the EWT integration as described by #6875, and the ETW trace is removed as expected, as well as I can see all the shutdown logs from Filebeat.

This is the only change necessary if graceful shutdown on windows is correctly hooked up to SIGINT and I wouldn't trust that without explicitly testing it. There is another bug report about lack of graceful shutdown in APM server on Linux that is a hint to me that even if you got SIGINT, you might not be done fixing the problems.

Once again, yes, that is the case. All Beats handle SIGINT correctly, we even log it as the cause for the shutdown. I'm writing integration tests that ensures the Beats (Filebeat) shutdown process works correctly, so not only the Elastic-Agent is sending the correct signal, but also ensure Beats is handling it as expected.

@cmacknz
Copy link
Member

cmacknz commented Apr 1, 2025

Awesome thanks for confirming!

Then my main comment is to prefer keeping to SIGTERM for consistency. We have several agent clients that are not Beats or even written in Go and introducing SIGINT instead introduces inconsistency that might not be handled properly everywhere.

@belimawr
Copy link
Contributor Author

belimawr commented Apr 1, 2025

Awesome thanks for confirming!

Then my main comment is to prefer keeping to SIGTERM for consistency. We have several agent clients that are not Beats or even written in Go and introducing SIGINT instead introduces inconsistency that might not be handled properly everywhere.

The Elastic-Agent code already uses SIGTERM to all OSes but Windows, I'll look on how to send one of the Windows signals Go converts to SIGTERM on Windows.

Looking at the Go doc:

Additionally, if Notify is called, and Windows sends CTRL_CLOSE_EVENT, CTRL_LOGOFF_EVENT or CTRL_SHUTDOWN_EVENT to the process, Notify will return syscall.SIGTERM. Unlike Control-C and Control-Break, Notify does not change process behavior when either CTRL_CLOSE_EVENT, CTRL_LOGOFF_EVENT or CTRL_SHUTDOWN_EVENT is received - the process will still get terminated unless it exits. But receiving syscall.SIGTERM will give the process an opportunity to clean up before termination.

That might not be exactly what we want. It is a bit unclear to me what process will still get terminated exactly means for the process (is it a grace period?), anyways I'll do some more research and testing to find a reliable solution.

@belimawr belimawr force-pushed the 6875-do-not-kill-process-on-windows branch from 2a54f06 to 76f9b90 Compare April 3, 2025 14:52
@belimawr
Copy link
Contributor Author

belimawr commented Apr 4, 2025

Quick update on this PR: There are two problems preventing the graceful shutdown of Beats:

  1. The signal was not correctly sent on Windows, we were killing the Beat instead of signalling for a graceful shutdown.
  2. Even if we send the correct signal, Elastic-Agent does not wait for the components to gracefully shutdown and exit when the agent itself is shutting down. Which makes the Elastic-Agent process to terminate before the components/Beats have time to actually gracefully exit.

I have a fix for #1 (315c33f) and a lot of print debug (1f6834d) investigating #2

@elasticmachine
Copy link
Collaborator

elasticmachine commented Apr 4, 2025

💔 Build Failed

Failed CI Steps

History

cc @belimawr

@cmacknz
Copy link
Member

cmacknz commented Apr 4, 2025

Even if we send the correct signal, Elastic-Agent does not wait for the components to gracefully shutdown and exit when the agent itself is shutting down. Which makes the Elastic-Agent process to terminate before the components/Beats have time to actually gracefully exit.

Fix 2 in a separate PR or just defer it to a separate issue, this will need to be a net new feature that can be configured. Not all components are Beats or well behaved, and in the case of Defend has a lifetime separate from agent.

The only case I would consider tackling here is uninstall where we have a nice place to hook into (only handles defend right now) and the configuration for wait time can be handled with a command line parameter. This still feels like a candidate for a separate PR.

if err := uninstallComponents(ctx, cfgFile, uninstallToken, log, pt, unprivileged); err != nil {
// If service status was running it was stopped to uninstall the components.
// If the components uninstall failed start the service again
if status == service.StatusRunning {
if startErr := StartService(topPath); startErr != nil {
// context for the error already provided in the StartService function
return err
}
}
return fmt.Errorf("error uninstalling components: %w", err)
}

Solving this sounds like it will require multiple steps I'd rather treat the original issue more like a parent issue to the sub-issue steps then try to get it all right at once.

@cmacknz
Copy link
Member

cmacknz commented Apr 4, 2025

For the change in 315c33f, make sure you have no defeated the job hierarchy that let's the sub-processes terminate when the agent as parent process terminates.

You want to be able to signal the child processes independent of the parent, but if the parent is terminated the child processes need to terminate as well.

Copy link

Quality Gate failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 40%)

See analysis details on SonarQube

@belimawr
Copy link
Contributor Author

belimawr commented Apr 7, 2025

For the change in 315c33f, make sure you have no defeated the job hierarchy that let's the sub-processes terminate when the agent as parent process terminates.

You want to be able to signal the child processes independent of the parent, but if the parent is terminated the child processes need to terminate as well.

@cmacknz I've been research about it and doing some testing, and even without setting any flag (a plain exec.Command), on Windows, if the parent exits, the child stays executing. Setting CREATE_NEW_PROCESS_GROUP does not directly affect the behaviour of the child once the parent process exits.

The good news is that when Elastic-Agent runs and exits, it does a good job signalling Beats to exit and Beats do a good job setting the correct signal handlers and responding to them.

The Elastic-Agent even launches a goroutine that kills the child process if it does not exit in a pre-defined timeout (default is 30s).

If anything, the Elastic-Agent process is doing a pretty good job at ensure all of its child process stop running once it exits. Hence, Beats not running their teardown.

I haven't dug deep enough to find exactly the point in the code the Beats get killed during the shutdown. Tomorrow I'll look a little more into it.

@belimawr
Copy link
Contributor Author

belimawr commented Apr 8, 2025

I finally found it! The TL;DR: (I'll create an issue describing the problem in details)
At startup we create a "JobObject", then whenever a child process is created, we assign it to this JobObject, once the process holding the JobObject (the Elastic-Agent process) exits, all other process associated with it are killed.

@belimawr
Copy link
Contributor Author

belimawr commented Apr 8, 2025

I created a detailed issue: #7756

@belimawr
Copy link
Contributor Author

belimawr commented Apr 8, 2025

Closing this PR in favour of #7738 and the follow up issue #7756

@belimawr belimawr closed this Apr 8, 2025
@cmacknz
Copy link
Member

cmacknz commented Apr 8, 2025

At startup we create a "JobObject", then whenever a child process is created, we assign it to this JobObject, once the process holding the JobObject (the Elastic-Agent process) exits, all other process associated with it are killed.

Yes, the JobObject is like the Windows version of a Linux cgroup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

4 participants