Skip to content

CLOUDP-324440 Save rs member ids #206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 3, 2025
Merged

Conversation

lucian-tosa
Copy link
Contributor

@lucian-tosa lucian-tosa commented Jun 20, 2025

Summary

This PR fixes the behaviour of changing the project in a MongoDBMultiCluster resource. Multicluster resources have a different method of computing the member ids of each process in a replicaset (replicasets field in AC). Because of that, it is possible to get non-sequential ids by doing scaling operations. For example, we start from this

  • mongodb-0-0: id 0
  • mongodb-0-1: id 1
  • mongodb-1-0: id 2

If we now scale the first cluster up we end up with:

  • mongodb-0-0: id 0
  • mongodb-0-1: id 1
  • mongodb-1-0: id 2
  • mongodb-0-2: id 3

If we now change the project, the operator will find an empty AC and will recalculate the ids it of the replicaset members.

  • mongodb-0-0: id 0
  • mongodb-0-1: id 1
  • mongodb-0-2: id 2
  • mongodb-1-0: id 3

But now, member 0-2 has id 2 instead of 3, and member 1-0 has id 3 instead of 2. This will instruct the agents to change the members of the replicaset, but this change will be rejected, leaving the deployment stuck. The operator can't retrieve the previous member ids.

In this PR we will now save the member ids we achieved in an annotation, and whenever there is an empty AC in OM, we try to read the annotation in case the deployment is being migrated to a new project.

Other considerations

  • The id of a replica set member has to be an integer, so we can't use the hostnames instead.
  • We can bring a future improvement to the id computation method by using a hash between cluster index and pod index. This way, any id the operator computes will be consistent and do not require reading the AC. The issue is that changing the id of a replicaset member is a convoluted process and migrating the old pseudo-sequential ids to hashed ids is extremely complex. This means that while there still are deployments that use these kind of id's we need to back these id's somewhere in case the deployment is moved to a different project. This idea will be discussed further.
  • If the MongoDBMultiCluster is backed up with annotations and then re-applied to a new project, the annotation will be used to compute the member ids. This is not an issue, there is no restrictions on these ids (apart from being an integer and unique). After the initial deployment, any scaling operations will simply compute new sequential ids.

Proof of Work

Added unit tests to check that annotations are created and updated successfully, as well as using the annotation when building the replicaset.
Also added an E2E test to ensure the resource will become ready.

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you checked for release_note changes?

Reminder (Please remove this when merging)

  • Please try to Approve or Reject Changes the PR, keep PRs in review as short as possible
  • Our Short Guide for PRs: Link
  • Remember the following Communication Standards - use comment prefixes for clarity:
    • blocking: Must be addressed before approval.
    • follow-up: Can be addressed in a later PR or ticket.
    • q: Clarifying question.
    • nit: Non-blocking suggestions.
    • note: Side-note, non-actionable. Example: Praise
    • --> no prefix is considered a question
@lucian-tosa lucian-tosa force-pushed the CLOUDP-324440-fix-rs-ids branch from add6a32 to 8ae690c Compare June 30, 2025 15:48
@lucian-tosa lucian-tosa marked this pull request as ready for review July 1, 2025 07:07
@lucian-tosa lucian-tosa requested a review from a team as a code owner July 1, 2025 07:07
@lucian-tosa lucian-tosa requested review from nammn and removed request for Julien-Ben July 1, 2025 08:39
@lucian-tosa lucian-tosa force-pushed the CLOUDP-324440-fix-rs-ids branch from f76fc3a to 1da9a93 Compare July 1, 2025 10:57
@lucian-tosa lucian-tosa requested a review from vinilage July 1, 2025 11:14
@nammn nammn requested a review from Copilot July 1, 2025 11:19
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an issue with migrating a MongoDBMultiCluster resource to a new project by saving and restoring replica set member IDs in an annotation. It also updates test cases and refactors various functions to use the new GetReplicaSets naming convention.

  • Added a new constant for the replica set member IDs annotation.
  • Updated reconciliation logic to save and use member IDs during project changes.
  • Refactored tests and Om client methods for consistent API usage.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/util/constants.go Introduces a new constant for saving RS member IDs.
docker/mongodb-kubernetes-tests/tests/multicluster/multi_cluster_scale_up_cluster.py Updates test scenarios to account for scaled member counts and project changes.
controllers/operator/mongodbmultireplicaset_controller.go Adds logic to retrieve and save RS member IDs from OM and updates the signature for saving the last achieved spec.
controllers/operator/mongodbmultireplicaset_controller_test.go Adds new test cases to verify annotation persistence and proper member ID usage.
controllers/om/* Adds new methods (e.g., MemberIds and GetReplicaSetMemberIds) and updates references to use the new GetReplicaSets API.
RELEASE_NOTES.md Documents the bug fix related to project migration for MongoDBMultiCluster resources.
Comments suppressed due to low confidence (1)

controllers/operator/mongodbmultireplicaset_controller.go:815

  • [nitpick] The inner variable 'processIds' is a nested map; consider renaming it (for example to 'allMemberIds') to more clearly reflect its structure and improve code readability.
func getReplicaSetProcessIdsFromAnnotation(mrs mdbmultiv1.MongoDBMultiCluster) map[string]int { 
if err != nil {
return err
}
if string(rsMemberIdsBytes) != "null" {
Copy link
Preview

Copilot AI Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider checking if rsMemberIds is nil or if the marshalled byte slice is empty instead of comparing to the string "null", as this would improve clarity and robustness if the serialization behavior changes.

Suggested change
if string(rsMemberIdsBytes) != "null" {
if rsMemberIds != nil && len(rsMemberIdsBytes) > 0 {

Copilot uses AI. Check for mistakes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: oh - yeah definitely please don't check for a null string. I think in other code places we have a better way of doing that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -699,6 +713,10 @@ func (r *ReconcileMongoDbMultiReplicaSet) updateOmDeploymentRs(ctx context.Conte
}

processIds := getReplicaSetProcessIdsFromReplicaSets(mrs.Name, existingDeployment)
// If there is no replicaset configuration saved in OM, it might be a new project, so we check the ids saved in annotation
if len(processIds) == 0 {
processIds = getReplicaSetProcessIdsFromAnnotation(mrs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use conn.GetReplicaSetMemberIds() here instead? Couldn't that save us the whole annotation work?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed offline - its the migration case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucian-tosa Can you please explain the "migration case" that requires use of annotation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: let's add a comment here on why this can happen (the migration case) for clarity

Copy link
Contributor Author

@lucian-tosa lucian-tosa Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the migration case is detailed in the PR description. The steps that lead to a broken deployment is:

  1. scale the deployment in a way that results in non-sequential ids
  2. move the deployment to a new project.

I also added some other considerations and edge cases to the PR description.
@anandsyncs let me know if I should explain something further

Copy link
Contributor

@lsierant lsierant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally - awesome fix! 👏
But I've left one blocking comment regarding not adding another method to the OM's connection interface. Consider the rest optional.

if err != nil {
return err
}
if string(rsMemberIdsBytes) != "null" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion.

processIds := make(map[string]map[string]int)
if processIdsStr, ok := mrs.Annotations[util.LastAchievedRsMemberIds]; ok {
if err := json.Unmarshal([]byte(processIdsStr), &processIds); err != nil {
return map[string]int{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider making this a fatal error? If there is some garbage in the annotation, isn't it too risky to ignore its content?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not - we should at least log this very clearly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only case where a broken annotation would be fatal is if the deployment is being migrated to a new OM project, and the annotation was manually changed at the same time.
I can't think of any other reason for why the annotation can have garbage, I wouldn't make this fatal. If there is garbage in the annotation, the operator will rewrite anyway.
Wdyt?

Copy link
Contributor

@lsierant lsierant Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is garbage in annotation that means someone or something changed it and in this case we should probably error the reconciliation as the annotation is there, but we don't have correct information from it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's like, check spec.version and find garbage in it - should we ignore it and continue reconciling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, done

@@ -97,7 +131,6 @@ def test_ops_manager_has_been_updated_correctly_before_scaling():

@pytest.mark.e2e_multi_cluster_scale_up_cluster
def test_scale_mongodb_multi(mongodb_multi: MongoDBMulti, member_cluster_clients: List[MultiClusterClient]):
mongodb_multi.load()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to not need load() you must change mongodb_multi's fixture scope to "function" to load it every time it's injected into function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -139,3 +159,59 @@ def test_ops_manager_has_been_updated_correctly_after_scaling():
def test_replica_set_is_reachable(mongodb_multi: MongoDBMulti, ca_path: str):
tester = mongodb_multi.tester()
tester.assert_connectivity(opts=[with_tls(use_tls=True, ca_path=ca_path)])


# From here on, the tests are for verifying that we can change the project of the MongoDBMulti resource even with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I find this comment a bit unnecessary, but you don't need to remove it.
It would be better to wrap such a set of test functions into a class and name it properly, e.g. TestNonSequentialMemberIdsInReplicaSet or sth like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

assert oldRsMembers == newRsMembers


def sts_are_ready(mdb: MongoDBMulti, member_cluster_clients: List[MultiClusterClient], members: List[int]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks to me like something we must already have somewhere in our tests. Or is it doing something different than just waiting for sts to be ready? If yes, then pls name it so we can see in the name why it's different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't already have it. But I will add it as part of the MongoDBMulti class so we can reuse it at least for that resource.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -141,6 +141,22 @@ func (oc *MockedOmConnection) ConfigureProject(project *Project) {
oc.context.OrgID = project.OrgID
}

func (oc *MockedOmConnection) GetReplicaSetMemberIds() (map[string]map[string]int, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR: Given that we are just copy pasting the the whole functionality, we can simply wrap the omclient with this mock client and only override what is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this method entirely following Lukasz's suggestion

@@ -699,6 +713,10 @@ func (r *ReconcileMongoDbMultiReplicaSet) updateOmDeploymentRs(ctx context.Conte
}

processIds := getReplicaSetProcessIdsFromReplicaSets(mrs.Name, existingDeployment)
// If there is no replicaset configuration saved in OM, it might be a new project, so we check the ids saved in annotation
if len(processIds) == 0 {
processIds = getReplicaSetProcessIdsFromAnnotation(mrs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucian-tosa Can you please explain the "migration case" that requires use of annotation?

Copy link
Collaborator

@nammn nammn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome and simple code! Great tests and straight-forward. 2 blocking things and we are good to go

if err != nil {
return err
}
if string(rsMemberIdsBytes) != "null" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: oh - yeah definitely please don't check for a null string. I think in other code places we have a better way of doing that

@@ -699,6 +713,10 @@ func (r *ReconcileMongoDbMultiReplicaSet) updateOmDeploymentRs(ctx context.Conte
}

processIds := getReplicaSetProcessIdsFromReplicaSets(mrs.Name, existingDeployment)
// If there is no replicaset configuration saved in OM, it might be a new project, so we check the ids saved in annotation
if len(processIds) == 0 {
processIds = getReplicaSetProcessIdsFromAnnotation(mrs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: let's add a comment here on why this can happen (the migration case) for clarity

processIds := make(map[string]map[string]int)
if processIdsStr, ok := mrs.Annotations[util.LastAchievedRsMemberIds]; ok {
if err := json.Unmarshal([]byte(processIdsStr), &processIds); err != nil {
return map[string]int{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not - we should at least log this very clearly

Copy link
Contributor

@lsierant lsierant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@lucian-tosa lucian-tosa merged commit 36f1085 into master Jul 3, 2025
35 checks passed
@lucian-tosa lucian-tosa deleted the CLOUDP-324440-fix-rs-ids branch July 3, 2025 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
5 participants