Skip to content

Conversation

@pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis commented Jul 8, 2025

What does this PR do?

This PR fixes multiple deadlock conditions in the runtime communicator logic and improves overall lifecycle handling. Specifically, it:

  • Handles potential deadlock when sending the initial observed message to the runtime if the communicator is destroyed.
  • Handles deadlock when the runtime calls CheckinExpected (init expected check-in) and the communicator is already destroyed.
  • Handles deadlock when the runtime calls CheckinExpected (init expected check-in) and the server has been disconnected.
  • Although highly unlikely due to existing synchronisation primitives, it doesn't block against multiple init checkin messages that arrive after the first that completed the init checkin process.
  • Removes a redundant goroutine in the checkin method, which also allows returning accurate gRPC status codes back to the client. (PS: this optimisation can be also applied for the actions handling)
  • Adds comprehensive unit tests covering all the above scenarios to prevent regressions.

Why is it important?

These fixes ensure the runtime communicator behaves predictably and safely across lifecycle boundaries like shutdown, reconnection, and concurrent access. Without these changes, users could experience hangs or lost check-in signals in rare but critical failure modes.

You can see that the previous implementation fails to complete these scenarios under test in CI here

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

This is a bug-fix for internal lifecycle logic and is not expected to introduce changes in behaviour or configuration requirements for end users.

How to test this PR locally

mage unitTest 

Related issues

@pkoutsovasilis pkoutsovasilis self-assigned this Jul 8, 2025
@pkoutsovasilis pkoutsovasilis added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-active-all Automated backport with mergify to all the active branches labels Jul 8, 2025
@pkoutsovasilis pkoutsovasilis requested a review from blakerouse July 8, 2025 10:54
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review July 8, 2025 10:54
@pkoutsovasilis pkoutsovasilis requested a review from a team as a code owner July 8, 2025 10:54
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pkoutsovasilis pkoutsovasilis requested a review from cmacknz July 8, 2025 10:54
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. I like all the additional testing. That will hopefully keep this from dead locks in the future as well.

I do have one comment that is inline. See if it make sense.

@elasticmachine
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

History

cc @pkoutsovasilis

@pkoutsovasilis pkoutsovasilis requested a review from blakerouse July 9, 2025 12:50
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. With the latest change, this looks good.

@pkoutsovasilis pkoutsovasilis merged commit 0d316ce into elastic:main Jul 11, 2025
19 checks passed
@pkoutsovasilis pkoutsovasilis deleted the fix/runtime_comm_init_checkin branch July 11, 2025 03:11
@github-actions
Copy link
Contributor

@Mergifyio backport 8.17 8.18 8.19 9.0 9.1

@mergify
Copy link
Contributor

mergify bot commented Jul 11, 2025

mergify bot pushed a commit that referenced this pull request Jul 11, 2025
* ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce)
mergify bot pushed a commit that referenced this pull request Jul 11, 2025
* ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce)
mergify bot pushed a commit that referenced this pull request Jul 11, 2025
* ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce)
mergify bot pushed a commit that referenced this pull request Jul 11, 2025
* ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce)
mergify bot pushed a commit that referenced this pull request Jul 11, 2025
* ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce)
pkoutsovasilis added a commit that referenced this pull request Jul 11, 2025
* ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce) Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>
pkoutsovasilis added a commit that referenced this pull request Jul 11, 2025
…8945) * fix: blocking issues of runtime communicator (#8881) * ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce) * fix: G115 linter error --------- Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>
pkoutsovasilis added a commit that referenced this pull request Jul 11, 2025
* ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce) Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>
pkoutsovasilis added a commit that referenced this pull request Jul 11, 2025
…8944) * fix: blocking issues of runtime communicator (#8881) * ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce) * fix: G115 linter error --------- Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>
pkoutsovasilis added a commit that referenced this pull request Jul 11, 2025
* ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce) Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-all Automated backport with mergify to all the active branches bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

3 participants