Skip to content

Conversation

dapengzhang0
Copy link
Contributor

@dapengzhang0 dapengzhang0 commented Sep 1, 2021

Fixes #8397.
#8397 is caused by mistakenly clearing up a map entry right after the entry is recreated after gc. Reproduced in regression test.

(SharedCallCounterMap is hard to read and can easily be a source of bugs. It's impossible to sufficiently test the class with unit test because GC can happen anytime concurrently with the method being tested. I'm not 100% confident about the correctness of the fix. If possible I would avoid using WeakReference in the first place.)

@dapengzhang0 dapengzhang0 requested a review from ejona86 September 1, 2021 15:25
void cleanQueue() {
CounterReference ref;
while ((ref = (CounterReference) refQueue.poll()) != null) {
Map<String, CounterReference> clusterCounter = counters.get(ref.cluster);
Copy link
Contributor Author

@dapengzhang0 dapengzhang0 Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing clusterCounter shouldn't be null, because refs should be enqueued in the same order as the order of underlying referents being nullified by garbage collector. But I did not see javadoc explicitly say that.

Is there any risk of NPE in extreme race case like the following?
ref1.referent nullified by gc => ref2 created and put in the counters map => ref2.referent nullified by gc => ref2 enqueued => ref1 enqueued.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With C1 it doesn't seem too far-fetched, especially if enqueuing is a separate stage of the process from clearing. I doubt it would actually happen, but it seems fair to consider.

A simple solution for that is to call ref.enqueue() if ref.get() == null, before replacing the reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. TIL thanks.

public void gcAndRecreate() {
@SuppressWarnings("UnusedVariable") // assign to null for GC only
AtomicLong counter = map.getOrCreate(CLUSTER, EDS_SERVICE_NAME);
final CounterReference ref = counters.get(CLUSTER).get(EDS_SERVICE_NAME);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, you can call ref.clear() and ref.enqueue() manually instead of relying on GC.

void cleanQueue() {
CounterReference ref;
while ((ref = (CounterReference) refQueue.poll()) != null) {
Map<String, CounterReference> clusterCounter = counters.get(ref.cluster);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With C1 it doesn't seem too far-fetched, especially if enqueuing is a separate stage of the process from clearing. I doubt it would actually happen, but it seems fair to consider.

A simple solution for that is to call ref.enqueue() if ref.get() == null, before replacing the reference.

@dapengzhang0 dapengzhang0 merged commit 07747c5 into grpc:master Sep 2, 2021
@dapengzhang0 dapengzhang0 deleted the fix-weakref-bug branch September 2, 2021 17:25
dapengzhang0 added a commit to dapengzhang0/grpc-java that referenced this pull request Sep 2, 2021
Fixes grpc#8397. grpc#8397 is caused by mistakenly clearing up a map entry right after the entry is recreated after gc. Reproduced in regression test.
dapengzhang0 added a commit that referenced this pull request Sep 2, 2021
Fixes #8397. #8397 is caused by mistakenly clearing up a map entry right after the entry is recreated after gc. Reproduced in regression test.
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

2 participants