- Notifications
You must be signed in to change notification settings - Fork 1.5k
DD:Defend DD from misconfigured locality of servers -- Part 2 #2110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A server or machine has a valid locality only if it sets correct locality entries. Build teams should only use the valid locality servers or machines
1bb4570 to 90d6a27 Compare 44e0bd3 to e1dcdbf Compare When we add simulation test that misconfigure a cluster by not setting some locality entries, we should set DD_VALIDATE_LOCALITY always true. Otherwise, simulation tests may fail.
| This looks good, let me now test how it behaves in our reproduction sequence. |
| It seems to work ok, but it's now repeatedly recruiting a new storage server on the process that is bad. |
The reason is that the worker with invalid locality is viewed as an idle storage worker. The The new commit should fix it. Now if a worker has incorrect locality, and later its locality is corrected. The worker will not participate as a storage worker, even after its locality is corrected, unless DD is restarted. <-- This is not an ideal fix. The better fix will involve a bit more change. |
1f07597 to c3960ab Compare | That seems to solve that problem, though I've now noticed that the storage role doesn't come back if you fix its locality parameters. It does come back if I bounce the cluster (presumably from bouncing the DD process?). |
Yes, that's expected as I described in the above comment last night. I had an idea of solving this problem without the need of bouncing the DD late last night but didn't have a chance to implement it. I'm implementing it this morning and will update the PR soon. |
When a worker has incorrect locality, the worker will be excluded from storage recruitment. When the worker has its locality corrected by system operators, the worker will be reincluded for storage recruitment.
The updated PR should have solved this behavior. If all addresses are configured with correct localities, DD will not invoke the above operation and thus no performance overhead. |
| The correctness test shows no error out of 200K random tests. I also added the release note for 6.2.4 |
052aebe to a8fff56 Compare When a storage process is rebooted in simulation, we randomly misconfigure its locality to test if DD can survive the locality misconfiguration. We later use the correct locality to reboot the same process to ensure the simulation proceeds and not report false positive due to misconfiguration.
Remove the unneeded comment as well.
Later configure the correct locality for the machine and reboot it so that the cluster is in a correct configuration.
| I run the simulation test with the misconfigured localities. |
| The problem caused by misconfigured localities turns out to be harder to solve. Because this PR includes the simulation test that broke the tLogs' locality related code, I create a new PR#2120 without the simulation test for review and merge. I leave this PR as it is for now. |
Change getWorkers param from txn handler to db
| } | ||
| | ||
| int size() const { return _data.size(); } | ||
| std::pair<Standalone<StringRef>, Optional<Standalone<StringRef>>> getItem(int i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a comment about this on the other PR, but I think using a begin and end would actually be simpler code-wise and wouldn't imply some property (i.e. efficient random access) that doesn't hold.
| What is the state of this PR? Do we still want the changes here in some form? |
I'm not working on this PR right now. I think it found some asserts failures in tLog team building process and fixed it (?). But I haven't verified the "fix" fixes the root cause. Do we want this PR be revamped and merged soon? |
That's probably a question for @etschannen |
| Evan says not for 7.0, so I'll close this PR, and file an issue to track picking the work back up at a later time. |
When a storage server has misconfigured locality information, e.g., missing
data_hallorzone_identry, the current data distribution algorithm may get stuck and cannot recover by itself. This may leave a cluster to an unrecoverable situation.This change aims to make DD resilient to such misconfiguration.