Optimize KeyedLock and related concurrency primitives #96372

original-brownbear · 2023-05-26T10:19:31Z

This class was quite hot in recent benchmarks of shared-cached based searches. The keyed lock did way too many redundant map lookups that I fixed into a single compute which removes the needless looping as well.
Also, those same benchmarks showed a lot of visible time spent on dealing with ref counts. I removed one layer of indirection in atomic use from both the release-once and the abstract ref count which should save a little in CPU caches as well.
As a neat side effect, this should also reduce some contention on the live versions map.

elasticsearchmachine · 2023-05-26T10:19:57Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

libs/core/src/main/java/org/elasticsearch/core/AbstractRefCounted.java

This class was quite hot in recent benchmarks of shared-cached based searches. The keyed lock did way too many redundant map lookups that I fixed into a single compute which removes the needless looping as well. Also, those same benchmarks showed a lot of visible time spent on dealing with ref counts. I removed one layer of indirection in atomic use from both the release-once and the abstract ref count which should save a little in CPU caches as well.

DaveCTurner · 2023-05-26T12:48:53Z

libs/core/src/main/java/org/elasticsearch/core/Releasables.java

 }
 }
+
+ private static class ReleaseOnce extends AtomicReference<Releasable> implements Releasable {


This seems ok because the caller can't see the AtomicReference bit. Maybe do the same thing to ActionListener#notifyOnce and RunOnce too?

Ah for those I can't do without making things visible to the caller I think. At least not unless I move more things around. Maybe do that later if we see it matter anywhere?

henningandersen

I wonder if we need KeyedLock even in SharedBlobCacheService. Seems doable to avoid it (but might be too big to do just now). I could at least imagine us doing a hash of the CacheKey (and region) and do a segmented lock approach instead with fixed array of locks (perhaps 16x #cpus), to avoid all the map compute/removes that happens in KeyedLock.

The Releasables change is obviously good on it's own as well as using the releasableLock method instead of the ReleasableLock class.

libs/core/src/main/java/org/elasticsearch/core/AbstractRefCounted.java

henningandersen · 2023-05-27T20:08:48Z

server/src/main/java/org/elasticsearch/common/util/concurrent/KeyedLock.java

- }
- }
- }
+ KeyLock perNodeLock = map.compute(key, computeLock(fair));


I am not convinced that this will be faster. Compute has a stronger guarantee around the remapping function only being called once, which leads to a bit of locking and storing twice (in the non-existing case). And even if the remapping results in the same value, I think it stores the value back. Last part might not be too important since we probably expect to be uncontended in most cases and contended case would likely be dominated by the lock.

Have you done benchmarks to validate this part of the change here?

I wonder if optimistically doing tryCreateNewLock before any map lookup would not be faster, since in the uncontended case that would go through and in the contended case, it does not matter?

I wonder if optimistically doing tryCreateNewLock before any map lookup would not be faster, since in the uncontended case that would go through and in the contended case, it does not matter?

I guess that may be true. I'm just a little worried about how bad the contended case will behave then. It seems like in the real world, most of the time gained by my change comes from looking up the right bucket in the map only once. In the end it's really hard to measure the effect of this change in the contended scenario well. For the uncontended case, this solution wins hands down in some quick benchmarks just by only doing the map lookup once and inlining much better. For contended case I wasn't able to find a good benchmark approximation because it depends massively on the specific keys used etc. I would hope though that on average, that spinning on a synchronized block (I think no matter what the lock will rarely be contended to the point where it's not a spin-lock) is still faster than spinning by doing object creation and repeated map lookups.

Can you share (perhaps privately) the benchmarks done with and without the change and the results?

If we have done benchmarks, showing this is faster in the uncontended case, then I am good with this.

Avoid the use of KeyedLock, which has a high overhead for uncontended locks. Reduce granularity of lock during #get to the actual region. Relates elastic#96372

original-brownbear · 2023-05-28T13:13:24Z

Regardless of #96399 I think I still like this change in isolation for other uses of the keyed lock and the other improvements in here :)

…-primitives

original-brownbear · 2023-05-30T14:44:14Z

@henningandersen as discussed on Slack, I reran those benchmarks on x86 after getting home to my workstation. You are mostly right here, the compute call is slightly slower or equal to the get - put combination without contention.
I also tried your idea of optimistically doing a putIfAbsent without a prior get but it hardly made any measurable difference. I still have no idea how this was faster on M1 over the weekend ...
The only thing in this PR that got me a small but measurable improvement is what is left in this PR: moving the releasable to extend AtomicBoolean (the lambda variant reusing ReleaseOnce actually turned out to be slower which is obvious because it consists in more allocations).
=> I left that change here and also left the release-once optimization because "why not" :) Obviously a rather trivial change now left here.

henningandersen

LGTM.

Thanks for the extra efforts on benchmarking this.

original-brownbear · 2023-05-30T19:54:57Z

Thanks Henning!

Avoid the use of KeyedLock, which has a high overhead for uncontended locks. Reduce granularity of lock during #get to the actual region. Relates #96372

original-brownbear added >non-issue :Core/Infra/Core Core issues without another label v8.9.0 labels May 26, 2023

elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label May 26, 2023

original-brownbear commented May 26, 2023

View reviewed changes

libs/core/src/main/java/org/elasticsearch/core/AbstractRefCounted.java Outdated Show resolved Hide resolved

original-brownbear force-pushed the optimize-concurrency-primitives branch from f9f27b6 to ff36302 Compare May 26, 2023 11:23

DaveCTurner reviewed May 26, 2023

View reviewed changes

not going too far

718805c

original-brownbear requested a review from DaveCTurner May 26, 2023 13:08

henningandersen reviewed May 28, 2023

View reviewed changes

henningandersen mentioned this pull request May 28, 2023

Reduce overhead in blob cache service get #96399

Merged

original-brownbear requested a review from henningandersen May 28, 2023 13:13

original-brownbear added 2 commits May 29, 2023 19:30

Merge remote-tracking branch 'elastic/main' into optimize-concurrency…

87bfd57

…-primitives

only retain clear wins

002e54b

henningandersen approved these changes May 30, 2023

View reviewed changes

original-brownbear merged commit 1ae63ac into elastic:main May 30, 2023

original-brownbear deleted the optimize-concurrency-primitives branch May 30, 2023 19:55

henningandersen added a commit that referenced this pull request May 31, 2023

Reduce overhead in blob cache service get (#96399)

856a244

Avoid the use of KeyedLock, which has a high overhead for uncontended locks. Reduce granularity of lock during #get to the actual region. Relates #96372

original-brownbear restored the optimize-concurrency-primitives branch November 8, 2023 22:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize KeyedLock and related concurrency primitives #96372

Optimize KeyedLock and related concurrency primitives #96372

Uh oh!

original-brownbear commented May 26, 2023 •

edited

Loading

elasticsearchmachine commented May 26, 2023

Uh oh!

DaveCTurner May 26, 2023

original-brownbear May 26, 2023

henningandersen left a comment

Uh oh!

henningandersen May 27, 2023

original-brownbear May 28, 2023

henningandersen May 29, 2023

original-brownbear commented May 28, 2023

original-brownbear commented May 30, 2023

henningandersen left a comment

original-brownbear commented May 30, 2023

Labels

4 participants

Optimize KeyedLock and related concurrency primitives #96372

Optimize KeyedLock and related concurrency primitives #96372

Uh oh!

Conversation

original-brownbear commented May 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

elasticsearchmachine commented May 26, 2023

Uh oh!

DaveCTurner May 26, 2023

Choose a reason for hiding this comment

original-brownbear May 26, 2023

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

henningandersen May 27, 2023

Choose a reason for hiding this comment

original-brownbear May 28, 2023

Choose a reason for hiding this comment

henningandersen May 29, 2023

Choose a reason for hiding this comment

original-brownbear commented May 28, 2023

original-brownbear commented May 30, 2023

henningandersen left a comment

Choose a reason for hiding this comment

original-brownbear commented May 30, 2023

Labels

4 participants

original-brownbear commented May 26, 2023 •

edited

Loading