- Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
🚀 Feature
A core feature of Narwhal is that transmissions may be included in multiple proposals.
The current setup of snarkOS has an overreliance on safety, because clients and validators propagate valid seen transmissions to all of their peers, incurring compute, bandwidth, and generally a lot of latency. The graph below shows that certificate generation slows down significantly under load. Previous measurements have shown that this in turn is caused by nodes waiting on transmission fetching.
We shouldn't get rid of the propagation entirely either. When running a load test with 6000 transmissions on reference hardware and no propagation at all, sometimes all transmissions would land within a few rounds, but more often only around 5750 would land, which is an indication some older certificates get left behind under stress and we need at least some propagation.
To improve network throughput, I propose the following:
- We add a
propagate: boolfield tostruct UnconfirmedTransaction - When clients receive a transaction via
/broadcast/transaction, broadcast to all peers withpropagate: true - When clients receive a transaction via the P2P network, they broadcast to all peers with
propagate: false, but only ifpropagate: trueon the incoming message - When validators receive a transaction via
/broadcast/transaction, broadcast to all validators withpropagate: false. - When validators receive a transaction via the P2P network, they broadcast to all peers with
propagate: false, but only ifpropagate: trueon the incoming message. Receiving validators should immediately add the transmission intocache_transmissionsso they don't have to fetch it. - In order to ensure all transmissions land, validators periodically - say every
PRIMARY_PING_IN_MS- include transmissions which they have not seen in any proposal/certificate/ledger yet fromcache_transmissions. We may want to tackle this last point only after [Feature] Add metric for subdag width #3961 is done. This can also be done based on the validator index.
The above is also applicable to solutions.
Note that the above approach to optimistic broadcast only works as when validator's router's are well-connected. With large networks and peer limits of 21, that may not be the case, so we may want to also move transmission broadcasts to the Gateway.
