Skip to content

Conversation

@lejeunerenard
Copy link
Contributor

This PR currently is only the first step of moving the bitfield into the replicator. Since the bitfield is only required by the replicator for checking local vs remote blocks.

This refactor will require updating the following known locations:

lib/core.js Outdated

this.replicator.onupgrade()
this.replicator.onhave(start, length, drop)
this.replicator.onupgrade()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be before onhave so it doesn’t signal out of bounds blocks

mafintosh and others added 20 commits October 7, 2025 13:32
Everywhere that updates or reads the bitfield needs to be async since it's going to storage. So many internal functions inherited the bitfield's async.
Awaiting to reduce potential timing issues for reading from the bitfield.
`_updateNonPrimary()` reads from the bitfield for clamping the range request via `clampRange()`. This clamping can hit a race condition where it is updated via `onhave`'s `_setBitfieldRanges()` mid reading. This causes the clamp to revolve the requests before the "primary" can respond etc.
With the change to the bitfield being asynchronous, the previous synchronous's implicit batch reading & writing ensure no conflicts from writing while reading. To enforce that operations are sequential an internal lock was added so if operations are called without awaiting or are called between event loops because of an external message etc, they will still not execute simultaneously. To protect against read & write operations interleaving when they are intended to be sequential, an external lock was added to claim the bitfield roughly per protocol message. Because this theoretically will cover the internal lock scenario by keeping access to a single chain of async calls, it might be possible to remove the internal lock in the future. Ideally these wont be necessary but currently they solve the above issues.
Enabling locks here fixes the test `bigger download range` in `test/replicate.js`. This test would flake when the download event would trigger after the request was resolved. This happened from `_updateNonPrimary()` triggering the range request resolved before the primary processing could emit due to race condition where the bitfield was updated while reading giving a false clamped range. The bitfield was primarily updated in `core.verify()` and setting a lock around the entire `ondata` call chain enabled other `data` messages to queue up instead of verifying blocks and updating the bitfield before previous requests could respond.
Used for bitfield locks.
The `_request*` methods are assumed to be synchronous in the replication state machine. To avoid converting the entire state machine to async, the pages are loaded preemptively and checked synchronously. This currently happens in `_requestSeek()` & `_requestRangeBlock()`.
Ensures the bitfield remains unchanged while iterating through want call.
Part of the previous commit to await all `replicator.onupgrade()` calls.
Without this lock, the `_update()` call in the `onsync` event doesn't know about the remote bitfield update. So while `onsync` doesn't access the `localBitfield` it does rely on updates to the `remoteBitfield` which happen along with the `localBitfield` updates. Fixes the 'restore after cancelled block request' test which could fail because the sync after the new append wouldn't cause a `block` request since the `b` peer assumed the `a` peer doesn't have the block. Since the test waits for an append event (which doesn't mean the `upload` event on `a` will be triggered) and the connection is destroyed afterwards.
Because `.broadcastRange()` called during `onopen` is now async, the peer can be added to the replicator after a sync close is called on the protomux channel. Since `onclose` assumed synchronous calls, it assumes the peer is already added before it's closed. With it not added yet, the peer isn't removed from the replicator and `replicator.destroy()` will loop forever. Added a destroy method to the bitfield for destroying the locks. Not required for the fix, but should be reasonable to destroy regardless.
Now that bitfield operations go to disk and not just to memory, they are slower and need more time for larger number of blocks.
Caused timing errors when it attempts to read from storage but it has been closed.
Now that checking the bitfield is async which makes iterating through ranges async, the `_ranges` array can be mutated elsewhere while awaiting. This means the current index of the range request can be inaccurate when resolving the request. To prevent this, the request index is looked up synchronously when the request is resolved. This way the index is accurate and another request isn't potentially clobbered by the popped head. Since this logic already exists to unref the request (for gc'ing & cancelling), the `_unref()` is reused. A success boolean is needed to update the index in the `_updateNonPrimary()` ranges loop. So all request `_unref()`s are updated to return a success bool.
This prevents the `session id reuse does not stall` test from failing randomly, usually with only `99` download events instead of the expected `100`. This is because processing ondata requests requires more event loop iterations to update / check the bitfield in storage. This allows the close promise for a session to resolve while the request is processed meaning the test checks the number of downloads before all events can be emitted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants