- Notifications
You must be signed in to change notification settings - Fork 25.6k
Optimize KeyedLock and related concurrency primitives #96372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize KeyedLock and related concurrency primitives #96372
Conversation
| Pinging @elastic/es-core-infra (Team:Core/Infra) |
libs/core/src/main/java/org/elasticsearch/core/AbstractRefCounted.java Outdated Show resolved Hide resolved
This class was quite hot in recent benchmarks of shared-cached based searches. The keyed lock did way too many redundant map lookups that I fixed into a single compute which removes the needless looping as well. Also, those same benchmarks showed a lot of visible time spent on dealing with ref counts. I removed one layer of indirection in atomic use from both the release-once and the abstract ref count which should save a little in CPU caches as well.
f9f27b6 to ff36302 Compare | } | ||
| } | ||
| | ||
| private static class ReleaseOnce extends AtomicReference<Releasable> implements Releasable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems ok because the caller can't see the AtomicReference bit. Maybe do the same thing to ActionListener#notifyOnce and RunOnce too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah for those I can't do without making things visible to the caller I think. At least not unless I move more things around. Maybe do that later if we see it matter anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need KeyedLock even in SharedBlobCacheService. Seems doable to avoid it (but might be too big to do just now). I could at least imagine us doing a hash of the CacheKey (and region) and do a segmented lock approach instead with fixed array of locks (perhaps 16x #cpus), to avoid all the map compute/removes that happens in KeyedLock.
The Releasables change is obviously good on it's own as well as using the releasableLock method instead of the ReleasableLock class.
libs/core/src/main/java/org/elasticsearch/core/AbstractRefCounted.java Outdated Show resolved Hide resolved
| } | ||
| } | ||
| } | ||
| KeyLock perNodeLock = map.compute(key, computeLock(fair)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced that this will be faster. Compute has a stronger guarantee around the remapping function only being called once, which leads to a bit of locking and storing twice (in the non-existing case). And even if the remapping results in the same value, I think it stores the value back. Last part might not be too important since we probably expect to be uncontended in most cases and contended case would likely be dominated by the lock.
Have you done benchmarks to validate this part of the change here?
I wonder if optimistically doing tryCreateNewLock before any map lookup would not be faster, since in the uncontended case that would go through and in the contended case, it does not matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if optimistically doing tryCreateNewLock before any map lookup would not be faster, since in the uncontended case that would go through and in the contended case, it does not matter?
I guess that may be true. I'm just a little worried about how bad the contended case will behave then. It seems like in the real world, most of the time gained by my change comes from looking up the right bucket in the map only once. In the end it's really hard to measure the effect of this change in the contended scenario well. For the uncontended case, this solution wins hands down in some quick benchmarks just by only doing the map lookup once and inlining much better. For contended case I wasn't able to find a good benchmark approximation because it depends massively on the specific keys used etc. I would hope though that on average, that spinning on a synchronized block (I think no matter what the lock will rarely be contended to the point where it's not a spin-lock) is still faster than spinning by doing object creation and repeated map lookups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share (perhaps privately) the benchmarks done with and without the change and the results?
If we have done benchmarks, showing this is faster in the uncontended case, then I am good with this.
Avoid the use of KeyedLock, which has a high overhead for uncontended locks. Reduce granularity of lock during #get to the actual region. Relates elastic#96372
| Regardless of #96399 I think I still like this change in isolation for other uses of the keyed lock and the other improvements in here :) |
| @henningandersen as discussed on Slack, I reran those benchmarks on x86 after getting home to my workstation. You are mostly right here, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks for the extra efforts on benchmarking this.
| Thanks Henning! |
Avoid the use of KeyedLock, which has a high overhead for uncontended locks. Reduce granularity of lock during #get to the actual region. Relates #96372
This class was quite hot in recent benchmarks of shared-cached based searches. The keyed lock did way too many redundant map lookups that I fixed into a single compute which removes the needless looping as well.
Also, those same benchmarks showed a lot of visible time spent on dealing with ref counts. I removed one layer of indirection in atomic use from both the release-once and the abstract ref count which should save a little in CPU caches as well.
As a neat side effect, this should also reduce some contention on the live versions map.