- Notifications
You must be signed in to change notification settings - Fork 198
fix: blocking issues of runtime communicator #8881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: blocking issues of runtime communicator #8881
Conversation
| Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good. I like all the additional testing. That will hopefully keep this from dead locks in the future as well.
I do have one comment that is inline. See if it make sense.
|
💛 Build succeeded, but was flaky
Failed CI StepsHistory
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. With the latest change, this looks good.
| @Mergifyio backport 8.17 8.18 8.19 9.0 9.1 |
✅ Backports have been created
|
* ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce)
* ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce)
* ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce)
* ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce)
* ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce)
* ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce) Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>
…8945) * fix: blocking issues of runtime communicator (#8881) * ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce) * fix: G115 linter error --------- Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>
* ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce) Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>
…8944) * fix: blocking issues of runtime communicator (#8881) * ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce) * fix: G115 linter error --------- Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>
* ci: write unit-tests for runtime_comm.go * fix: blocking issues in runtime_comm.go * fix: QF1004 use strings.ReplaceAll * fix: guard closing c.runtimeCheckinDone with a local variable (cherry picked from commit 0d316ce) Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>




What does this PR do?
This PR fixes multiple deadlock conditions in the runtime communicator logic and improves overall lifecycle handling. Specifically, it:
CheckinExpected(init expected check-in) and the communicator is already destroyed.CheckinExpected(init expected check-in) and the server has been disconnected.checkinmethod, which also allows returning accurate gRPC status codes back to the client. (PS: this optimisation can be also applied for the actions handling)Why is it important?
These fixes ensure the runtime communicator behaves predictably and safely across lifecycle boundaries like shutdown, reconnection, and concurrent access. Without these changes, users could experience hangs or lost check-in signals in rare but critical failure modes.
You can see that the previous implementation fails to complete these scenarios under test in CI here
Checklist
./changelog/fragmentsusing the changelog toolDisruptive User Impact
This is a bug-fix for internal lifecycle logic and is not expected to introduce changes in behaviour or configuration requirements for end users.
How to test this PR locally
Related issues