DD:Defend DD from misconfigured locality of servers -- Part 2 #2110

xumengpanda · 2019-09-13T18:59:03Z

When a storage server has misconfigured locality information, e.g., missing data_hall or zone_id entry, the current data distribution algorithm may get stuck and cannot recover by itself. This may leave a cluster to an unrecoverable situation.

This change aims to make DD resilient to such misconfiguration.

A server or machine has a valid locality only if it sets correct locality entries. Build teams should only use the valid locality servers or machines

fdbrpc/Locality.h

fdbserver/DataDistribution.actor.cpp

fdbserver/Knobs.cpp

fdbserver/DataDistribution.actor.cpp

fdbrpc/Locality.h

fdbserver/DataDistribution.actor.cpp

When we add simulation test that misconfigure a cluster by not setting some locality entries, we should set DD_VALIDATE_LOCALITY always true. Otherwise, simulation tests may fail.

ajbeamon · 2019-09-14T01:44:02Z

This looks good, let me now test how it behaves in our reproduction sequence.

ajbeamon · 2019-09-14T04:40:51Z

It seems to work ok, but it's now repeatedly recruiting a new storage server on the process that is bad.

xumengpanda · 2019-09-14T05:03:58Z

It seems to work ok, but it's now repeatedly recruiting a new storage server on the process that is bad.

The reason is that the worker with invalid locality is viewed as an idle storage worker. The storageRecruiter keeps recruiting SS on it, which will later be marked as failed and removed.

The new commit should fix it.

Now if a worker has incorrect locality, and later its locality is corrected. The worker will not participate as a storage worker, even after its locality is corrected, unless DD is restarted. <-- This is not an ideal fix. The better fix will involve a bit more change.

ajbeamon · 2019-09-14T15:03:12Z

That seems to solve that problem, though I've now noticed that the storage role doesn't come back if you fix its locality parameters. It does come back if I bounce the cluster (presumably from bouncing the DD process?).

xumengpanda · 2019-09-14T18:25:46Z

That seems to solve that problem, though I've now noticed that the storage role doesn't come back if you fix its locality parameters. It does come back if I bounce the cluster (presumably from bouncing the DD process?).

Yes, that's expected as I described in the above comment last night.

I had an idea of solving this problem without the need of bouncing the DD late last night but didn't have a chance to implement it.

I'm implementing it this morning and will update the PR soon.

When a worker has incorrect locality, the worker will be excluded from storage recruitment. When the worker has its locality corrected by system operators, the worker will be reincluded for storage recruitment.

xumengpanda · 2019-09-14T19:30:41Z

That seems to solve that problem, though I've now noticed that the storage role doesn't come back if you fix its locality parameters. It does come back if I bounce the cluster (presumably from bouncing the DD process?).

@ajbeamon

The updated PR should have solved this behavior.
Once DD found there exist workers that have invalid locality, it will periodically check those addresses. Once system operators correct the localities for those workers, DD will remove those addresses from the excluded list.

If all addresses are configured with correct localities, DD will not invoke the above operation and thus no performance overhead.

xumengpanda · 2019-09-15T04:41:42Z

The correctness test shows no error out of 200K random tests.

I also added the release note for 6.2.4

documentation/sphinx/source/release-notes.rst

fdbserver/DataDistribution.actor.cpp

fdbserver/Knobs.cpp

When a storage process is rebooted in simulation, we randomly misconfigure its locality to test if DD can survive the locality misconfiguration. We later use the correct locality to reboot the same process to ensure the simulation proceeds and not report false positive due to misconfiguration.

Remove the unneeded comment as well.

…patch

Later configure the correct locality for the machine and reboot it so that the cluster is in a correct configuration.

…ality

xumengpanda · 2019-09-17T04:14:22Z

I run the simulation test with the misconfigured localities.
Good news: Tests did find bugs triggered by misconfigured localities;
Bad news: tLog locations are also affected by misconfigured localities. Figuring out the detailed reason and fixing it will take one or two days.

xumengpanda · 2019-09-17T20:10:45Z

The problem caused by misconfigured localities turns out to be harder to solve.
The solution is broken into multiple smaller solutions, as documented in Issue #2100.
As a result, the part 1 will just fix the DD.

Because this PR includes the simulation test that broke the tLogs' locality related code, I create a new PR#2120 without the simulation test for review and merge.

I leave this PR as it is for now.

Change getWorkers param from txn handler to db

…DD-patch

ajbeamon · 2019-09-19T16:06:47Z

fdbrpc/Locality.h

 }

+int size() const { return _data.size(); }
+std::pair<Standalone<StringRef>, Optional<Standalone<StringRef>>> getItem(int i) {


I had a comment about this on the other PR, but I think using a begin and end would actually be simpler code-wise and wouldn't imply some property (i.e. efficient random access) that doesn't hold.

ajbeamon · 2019-12-02T18:20:52Z

What is the state of this PR? Do we still want the changes here in some form?

xumengpanda · 2019-12-02T18:23:08Z

What is the state of this PR? Do we still want the changes here in some form?

I'm not working on this PR right now. I think it found some asserts failures in tLog team building process and fixed it (?). But I haven't verified the "fix" fixes the root cause.

Do we want this PR be revamped and merged soon?

ajbeamon · 2019-12-02T18:27:25Z

Do we want this PR be revamped and merged soon?

That's probably a question for @etschannen

alexmiller-apple · 2020-01-25T01:51:32Z

Evan says not for 7.0, so I'll close this PR, and file an issue to track picking the work back up at a later time.

xumengpanda added 2 commits September 13, 2019 11:30

DD:Introduce isValidLocality

52f6297

A server or machine has a valid locality only if it sets correct locality entries. Build teams should only use the valid locality servers or machines

DD:IsValidLocality:Consider configured replica policy

90d6a27

xumengpanda force-pushed the mengxu/6.2.4-DD-patch branch from 1bb4570 to 90d6a27 Compare September 13, 2019 19:04

ajbeamon reviewed Sep 13, 2019

View reviewed changes

xumengpanda added 6 commits September 13, 2019 13:19

DD:DD_VALIDATE_LOCALITY:Guard the checking of locality validity

3ad7e3a

DD:TCMachine:Use processID as machineID if zoneID is unset

1596e2e

DD:Valid locality includes an empty but set locality entry

e8878b1

DD:IsValidLocality:Clang format

1196841

DD:isValidLocality:A generic way to check any replicationPolicy

8970d98

LocalityData:Remove verbose check for valid locality

e1dcdbf

xumengpanda force-pushed the mengxu/6.2.4-DD-patch branch from 44e0bd3 to e1dcdbf Compare September 13, 2019 22:11

ajbeamon reviewed Sep 13, 2019

View reviewed changes

xumengpanda changed the title ~~Defend DD from misconfigured locality of servers~~ DD:Defend DD from misconfigured locality of servers Sep 13, 2019

xumengpanda added 2 commits September 13, 2019 15:35

DD:ValidLocality:Resolve review comment

78b8e48

DD_VALIDATE_LOCALITY:Add comment for our future selves

7546008

When we add simulation test that misconfigure a cluster by not setting some locality entries, we should set DD_VALIDATE_LOCALITY always true. Otherwise, simulation tests may fail.

xumengpanda requested a review from ajbeamon September 13, 2019 23:46

xumengpanda assigned ajbeamon Sep 13, 2019

xumengpanda requested a review from mpilman September 13, 2019 23:46

DD:initializeStorage:Exclude worker with invalid locality

c3960ab

xumengpanda force-pushed the mengxu/6.2.4-DD-patch branch from 1f07597 to c3960ab Compare September 14, 2019 05:05

DD:Handle worker with incorrect locality

37d2318

When a worker has incorrect locality, the worker will be excluded from storage recruitment. When the worker has its locality corrected by system operators, the worker will be reincluded for storage recruitment.

ReleaseNotes:Add DD fix for 6.2.5

a8fff56

xumengpanda force-pushed the mengxu/6.2.4-DD-patch branch from 052aebe to a8fff56 Compare September 15, 2019 18:33

ajbeamon reviewed Sep 16, 2019

View reviewed changes

xumengpanda added 10 commits September 16, 2019 13:11

SimulatedMachine:Must have machineId

2207b5c

SimulatedMachine:Apply clang format

1d9dc01

Remove the unneeded comment as well.

Merge branch 'mengxu/6.2.4-invalidlocalityTest' into mengxu/6.2.4-DD-…

c2b3be1

…patch

SimulatedMachine:Must set zoneId

a3a4b56

DD:checkAndRemoveInvalidLocalityAddr:Must wait on getWorkers

d494794

RebootMachine:Misconfig locality for reboot machine

0b1223f

Later configure the correct locality for the machine and reboot it so that the cluster is in a correct configuration.

Knob:DD_VALIDATE_LOCALITY to guard the feature of checking invalidLoc…

2e7fdc7

…ality

SimulatedFDBDRebooter:Set invalid locality only to storage process

02a762e

SimulatedCluster:Apply clang format

e9ac206

xumengpanda added 2 commits September 17, 2019 10:52

tLog:Fix misconfigured locality

36477f9

DD:MisconfiguredLocality:Fix review comments

d2fd1f4

xumengpanda added 3 commits September 17, 2019 15:28

set:erase:Use return value of erase iterator as next iterator

93bbc26

DD:Trace when invalid locality is corrected

d175b61

Change getWorkers param from txn handler to db

Merge branch 'mengxu/6.2.4-DD-patch-no-simulation' into mengxu/6.2.4-…

cdd8f82

…DD-patch

xumengpanda changed the title ~~DD:Defend DD from misconfigured locality of servers~~ DD:Defend DD from misconfigured locality of servers -- Part 2 Sep 19, 2019

ajbeamon reviewed Sep 19, 2019

View reviewed changes

xumengpanda mentioned this pull request Nov 4, 2019

LocalityData constructor does not initialize datahall #2315

Open

alexmiller-apple closed this Jan 25, 2020

alexmiller-apple mentioned this pull request Jan 25, 2020

Improve FDB reliability when localities are misconfigured and later corrected #2100

Open

DD:Defend DD from misconfigured locality of servers -- Part 2 #2110

DD:Defend DD from misconfigured locality of servers -- Part 2 #2110

Uh oh!

Conversation

xumengpanda commented Sep 13, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajbeamon commented Sep 14, 2019

ajbeamon commented Sep 14, 2019

xumengpanda commented Sep 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ajbeamon commented Sep 14, 2019

xumengpanda commented Sep 14, 2019

xumengpanda commented Sep 14, 2019

xumengpanda commented Sep 15, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xumengpanda commented Sep 17, 2019

xumengpanda commented Sep 17, 2019

ajbeamon Sep 19, 2019

Choose a reason for hiding this comment

ajbeamon commented Dec 2, 2019

xumengpanda commented Dec 2, 2019

ajbeamon commented Dec 2, 2019

alexmiller-apple commented Jan 25, 2020

Labels

4 participants

xumengpanda commented Sep 14, 2019 •

edited

Loading