WIP: Refactor bitfield to replicator & reduce bitfield memory by relying on storage #736

lejeunerenard · 2025-09-25T19:53:40Z

This PR currently is only the first step of moving the bitfield into the replicator. Since the bitfield is only required by the replicator for checking local vs remote blocks.

This refactor will require updating the following known locations:

Now that the replicator is updating the bitfield in `onhave`, signaling the replicator also updates the in-memory bitfield.

…red()`

mafintosh · 2025-10-01T05:35:13Z

lib/core.js

-
- this.replicator.onupgrade()
 this.replicator.onhave(start, length, drop)
+ this.replicator.onupgrade()


Should be before onhave so it doesn’t signal out of bounds blocks

Everywhere that updates or reads the bitfield needs to be async since it's going to storage. So many internal functions inherited the bitfield's async.

Awaiting to reduce potential timing issues for reading from the bitfield.

`_updateNonPrimary()` reads from the bitfield for clamping the range request via `clampRange()`. This clamping can hit a race condition where it is updated via `onhave`'s `_setBitfieldRanges()` mid reading. This causes the clamp to revolve the requests before the "primary" can respond etc.

With the change to the bitfield being asynchronous, the previous synchronous's implicit batch reading & writing ensure no conflicts from writing while reading. To enforce that operations are sequential an internal lock was added so if operations are called without awaiting or are called between event loops because of an external message etc, they will still not execute simultaneously. To protect against read & write operations interleaving when they are intended to be sequential, an external lock was added to claim the bitfield roughly per protocol message. Because this theoretically will cover the internal lock scenario by keeping access to a single chain of async calls, it might be possible to remove the internal lock in the future. Ideally these wont be necessary but currently they solve the above issues.

Enabling locks here fixes the test `bigger download range` in `test/replicate.js`. This test would flake when the download event would trigger after the request was resolved. This happened from `_updateNonPrimary()` triggering the range request resolved before the primary processing could emit due to race condition where the bitfield was updated while reading giving a false clamped range. The bitfield was primarily updated in `core.verify()` and setting a lock around the entire `ondata` call chain enabled other `data` messages to queue up instead of verifying blocks and updating the bitfield before previous requests could respond.

Used for bitfield locks.

The `_request*` methods are assumed to be synchronous in the replication state machine. To avoid converting the entire state machine to async, the pages are loaded preemptively and checked synchronously. This currently happens in `_requestSeek()` & `_requestRangeBlock()`.

Ensures the bitfield remains unchanged while iterating through want call.

Part of the previous commit to await all `replicator.onupgrade()` calls.

Without this lock, the `_update()` call in the `onsync` event doesn't know about the remote bitfield update. So while `onsync` doesn't access the `localBitfield` it does rely on updates to the `remoteBitfield` which happen along with the `localBitfield` updates. Fixes the 'restore after cancelled block request' test which could fail because the sync after the new append wouldn't cause a `block` request since the `b` peer assumed the `a` peer doesn't have the block. Since the test waits for an append event (which doesn't mean the `upload` event on `a` will be triggered) and the connection is destroyed afterwards.

Because `.broadcastRange()` called during `onopen` is now async, the peer can be added to the replicator after a sync close is called on the protomux channel. Since `onclose` assumed synchronous calls, it assumes the peer is already added before it's closed. With it not added yet, the peer isn't removed from the replicator and `replicator.destroy()` will loop forever. Added a destroy method to the bitfield for destroying the locks. Not required for the fix, but should be reasonable to destroy regardless.

Now that bitfield operations go to disk and not just to memory, they are slower and need more time for larger number of blocks.

Caused timing errors when it attempts to read from storage but it has been closed.

Now that checking the bitfield is async which makes iterating through ranges async, the `_ranges` array can be mutated elsewhere while awaiting. This means the current index of the range request can be inaccurate when resolving the request. To prevent this, the request index is looked up synchronously when the request is resolved. This way the index is accurate and another request isn't potentially clobbered by the popped head. Since this logic already exists to unref the request (for gc'ing & cancelling), the `_unref()` is reused. A success boolean is needed to update the index in the `_updateNonPrimary()` ranges loop. So all request `_unref()`s are updated to return a success bool.

This prevents the `session id reuse does not stall` test from failing randomly, usually with only `99` download events instead of the expected `100`. This is because processing ondata requests requires more event loop iterations to update / check the bitfield in storage. This allows the close promise for a session to resolve while the request is processed meaning the test checks the number of downloads before all events can be emitted.

lejeunerenard added 8 commits September 24, 2025 11:44

Move core.bitfield to core.replicator.localBitfield

28cdd5e

Move skipBitfield to replicator

0f5f6ba

Update bitfield & contiguousLength in replicator.onhave

95d0999

Rely on replicator to update in memory bitfield in _verifyShared()

c4cd05c

Remove vestigial flushed arg from state._updateBitfield()

a039a17

Rely on replicator to update mem bitfield while copying prologue

96248b4

Now that the replicator is updating the bitfield in `onhave`, signaling the replicator also updates the in-memory bitfield.

Update lib/audit.js to use replicator.localBitfield

5713481

Use BitInterlude's ranges when broadcasting to peers in `_verifySha…

9052e51

…red()`

lejeunerenard mentioned this pull request Sep 26, 2025

WIP: Fix checking bitfield to use bitfield on replicator holepunchto/hyperblobs#27

Draft

mafintosh reviewed Oct 1, 2025

View reviewed changes

mafintosh and others added 20 commits October 7, 2025 13:32

add prettier

90f49f1

(Re)Apply prettier formatting

debb1e6

Back bitfield w/ storage instead of memory

3378a5e

Everywhere that updates or reads the bitfield needs to be async since it's going to storage. So many internal functions inherited the bitfield's async.

Reapply formatting

96462f3

Add missing await in replicate test for bitfield

74061d8

Await _resetMissingBlock & _unmarkInflight where async available

2500e2d

Awaiting to reduce potential timing issues for reading from the bitfield.

Add missing awaits in bitfield test

03be77c

Add missing scope-lock dependency

a9186a9

Used for bitfield locks.

Update formatting w/ prettier

c3b5bff

Await _updateCanUpgradeAndSync() for timing sync sent from ondata

636d43b

Add bitfield lock for _onwant call

afa41fb

Ensures the bitfield remains unchanged while iterating through want call.

Await all replicator.onupgrade() calls

2aa10a2

Convert _setManifest to async & await where called

7eea070

Part of the previous commit to await all `replicator.onupgrade()` calls.

Fix linting in test/replicate.js

b4d64d0

Lint lib/replicator.js

68a1b78

lejeunerenard added 9 commits October 15, 2025 14:11

Merge branch 'main' into refact-bitfield-to-replicator

4d15a74

Remove completed TODO comment

5d01e1e

Lint lib/replicator.js

d08cba7

Increase timeout for big download range test to account for bitfield

dcdcea2

Now that bitfield operations go to disk and not just to memory, they are slower and need more time for larger number of blocks.

Expand timeout time for big download range to 2 min max

6e33751

Add missing await for _resolveBlocksLocally()

4a9e5e0

Caused timing errors when it attempts to read from storage but it has been closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Refactor bitfield to replicator & reduce bitfield memory by relying on storage #736

WIP: Refactor bitfield to replicator & reduce bitfield memory by relying on storage #736

Uh oh!

lejeunerenard commented Sep 25, 2025

mafintosh Oct 1, 2025

Labels

3 participants

WIP: Refactor bitfield to replicator & reduce bitfield memory by relying on storage #736

Are you sure you want to change the base?

WIP: Refactor bitfield to replicator & reduce bitfield memory by relying on storage #736

Uh oh!

Conversation

lejeunerenard commented Sep 25, 2025

mafintosh Oct 1, 2025

Choose a reason for hiding this comment

Labels

3 participants