Skip to content

Conversation

@xumengpanda
Copy link
Contributor

When a storage server has misconfigured locality information, e.g., missing data_hall or zone_id entry, the current data distribution algorithm may get stuck and cannot recover by itself. This may leave a cluster to an unrecoverable situation.

This change aims to make DD resilient to such misconfiguration.

A server or machine has a valid locality only if it sets correct locality entries. Build teams should only use the valid locality servers or machines
@xumengpanda xumengpanda changed the title Defend DD from misconfigured locality of servers DD:Defend DD from misconfigured locality of servers Sep 13, 2019
When we add simulation test that misconfigure a cluster by not setting some locality entries, we should set DD_VALIDATE_LOCALITY always true. Otherwise, simulation tests may fail.
@ajbeamon
Copy link
Contributor

This looks good, let me now test how it behaves in our reproduction sequence.

@ajbeamon
Copy link
Contributor

It seems to work ok, but it's now repeatedly recruiting a new storage server on the process that is bad.

@xumengpanda
Copy link
Contributor Author

xumengpanda commented Sep 14, 2019

It seems to work ok, but it's now repeatedly recruiting a new storage server on the process that is bad.

The reason is that the worker with invalid locality is viewed as an idle storage worker. The storageRecruiter keeps recruiting SS on it, which will later be marked as failed and removed.

The new commit should fix it.

Now if a worker has incorrect locality, and later its locality is corrected. The worker will not participate as a storage worker, even after its locality is corrected, unless DD is restarted. <-- This is not an ideal fix. The better fix will involve a bit more change.

@ajbeamon
Copy link
Contributor

That seems to solve that problem, though I've now noticed that the storage role doesn't come back if you fix its locality parameters. It does come back if I bounce the cluster (presumably from bouncing the DD process?).

@xumengpanda
Copy link
Contributor Author

That seems to solve that problem, though I've now noticed that the storage role doesn't come back if you fix its locality parameters. It does come back if I bounce the cluster (presumably from bouncing the DD process?).

Yes, that's expected as I described in the above comment last night.

I had an idea of solving this problem without the need of bouncing the DD late last night but didn't have a chance to implement it.

I'm implementing it this morning and will update the PR soon.

When a worker has incorrect locality, the worker will be excluded from storage recruitment. When the worker has its locality corrected by system operators, the worker will be reincluded for storage recruitment.
@xumengpanda
Copy link
Contributor Author

That seems to solve that problem, though I've now noticed that the storage role doesn't come back if you fix its locality parameters. It does come back if I bounce the cluster (presumably from bouncing the DD process?).

@ajbeamon

The updated PR should have solved this behavior.
Once DD found there exist workers that have invalid locality, it will periodically check those addresses. Once system operators correct the localities for those workers, DD will remove those addresses from the excluded list.

If all addresses are configured with correct localities, DD will not invoke the above operation and thus no performance overhead.

@xumengpanda
Copy link
Contributor Author

The correctness test shows no error out of 200K random tests.

I also added the release note for 6.2.4

When a storage process is rebooted in simulation, we randomly misconfigure its locality to test if DD can survive the locality misconfiguration. We later use the correct locality to reboot the same process to ensure the simulation proceeds and not report false positive due to misconfiguration.
Remove the unneeded comment as well.
Later configure the correct locality for the machine and reboot it so that the cluster is in a correct configuration.
@xumengpanda
Copy link
Contributor Author

I run the simulation test with the misconfigured localities.
Good news: Tests did find bugs triggered by misconfigured localities;
Bad news: tLog locations are also affected by misconfigured localities. Figuring out the detailed reason and fixing it will take one or two days.

@xumengpanda
Copy link
Contributor Author

The problem caused by misconfigured localities turns out to be harder to solve.
The solution is broken into multiple smaller solutions, as documented in Issue #2100.
As a result, the part 1 will just fix the DD.

Because this PR includes the simulation test that broke the tLogs' locality related code, I create a new PR#2120 without the simulation test for review and merge.

I leave this PR as it is for now.

@xumengpanda xumengpanda changed the title DD:Defend DD from misconfigured locality of servers DD:Defend DD from misconfigured locality of servers -- Part 2 Sep 19, 2019
}

int size() const { return _data.size(); }
std::pair<Standalone<StringRef>, Optional<Standalone<StringRef>>> getItem(int i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a comment about this on the other PR, but I think using a begin and end would actually be simpler code-wise and wouldn't imply some property (i.e. efficient random access) that doesn't hold.

@ajbeamon
Copy link
Contributor

ajbeamon commented Dec 2, 2019

What is the state of this PR? Do we still want the changes here in some form?

@xumengpanda
Copy link
Contributor Author

What is the state of this PR? Do we still want the changes here in some form?

I'm not working on this PR right now. I think it found some asserts failures in tLog team building process and fixed it (?). But I haven't verified the "fix" fixes the root cause.

Do we want this PR be revamped and merged soon?

@ajbeamon
Copy link
Contributor

ajbeamon commented Dec 2, 2019

Do we want this PR be revamped and merged soon?

That's probably a question for @etschannen

@alexmiller-apple
Copy link
Contributor

Evan says not for 7.0, so I'll close this PR, and file an issue to track picking the work back up at a later time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants