Enable client-side timeouts and replace retry logic with `reqwest`'s #39

tnull · 2025-08-19T09:45:30Z

Based on #38.

We enable reqwest client-level timeouts:

While the `RetryPolicy` has a `MaxTotalDelayRetryPolicy`, the retry `loop` would only check this configured delay once the operation future actually returns a value. However, without client-side timeouts, we're not super sure the operation is actually guaranteed to return anything (even an error, IIUC). So here, we enable some coarse client-side default timeouts to ensure the polled futures eventualy return either the response *or* an error we can handle via our retry logic.

~~Additionally, we here rip out our ~broken retry logic and replace it by utilizing reqwest's retry logic that shipped in the recent v0.12.23 release.~~

ldk-reviews-bot · 2025-08-19T09:45:33Z

👋 Thanks for assigning @tankyleo as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

tankyleo · 2025-08-19T20:44:05Z

src/client.rs

+.timeout(DEFAULT_TIMEOUT)
+.connect_timeout(DEFAULT_TIMEOUT)
+.read_timeout(DEFAULT_TIMEOUT)


Thank you seems like we could just do with the single global timeout here, no need for connect and read ?

But we can leave it as is and potentially tweak the inner timeouts later.

Right, no strong opinion here.

tankyleo · 2025-08-19T20:50:52Z

One aspect that might be debatable here is whether we should drop MaxTotalDelayRetryPolicy given it would interact with the client-side default delay. Hence also pinging @jkczyz who reviewed the original retry PR.

For reference original PR is #20. On first impression, I'd be in favor of the drop myself.

jkczyz · 2025-08-20T00:13:31Z

One aspect that might be debatable here is whether we should drop MaxTotalDelayRetryPolicy given it would interact with the client-side default delay. Hence also pinging @jkczyz who reviewed the original retry PR.

For reference original PR is #20. On first impression, I'd be in favor of the drop myself.

Do the added timeouts apply to a single operation? If it is never exceeded, wouldn't we still want MaxTotalDelayRetryPolicy to allow limiting retries to a maximum amount of time?

What is meant by client-side default delay?

tnull · 2025-08-20T07:00:31Z

Do the added timeouts apply to a single operation?

Yes, they apply for a single read, but also for connecting / detecting dropped connections AFAIU.

If it is never exceeded, wouldn't we still want MaxTotalDelayRetryPolicy to allow limiting retries to a maximum amount of time?

Yes, it could be useful, but of course its somewhat redundant if we set a client-side timeout and limit the number of retries. It could therefore be a bit confusing if somebody configures the MaxTotalDelayRetryPolicy, but still the other total delay applies if its lesser (i.e., number of retries times timeout).

What is meant by client-side default delay?

Ah, sorry, that was a typo I only corrected in the PR title: should have said default timeout, not delay.

jkczyz · 2025-08-20T14:35:56Z

Yes, it could be useful, but of course its somewhat redundant if we set a client-side timeout and limit the number of retries.

Do you mean MaxTotalDelayRetryPolicy<MaxAttemptsRetryPolicy<R>>?

It could therefore be a bit confusing if somebody configures the MaxTotalDelayRetryPolicy, but still the other total delay applies if its lesser (i.e., number of retries times timeout).

Isn't this already the case when configured as I mentioned above? Number of attempts takes priority over total delay given the way MaxTotalDelayRetryPolicy is written.

Maybe I'm confused about what is lesser in that example.

tnull · 2025-08-21T08:11:50Z

Do you mean MaxTotalDelayRetryPolicy<MaxAttemptsRetryPolicy<R>>?

Yes, if each client call is also limited by a timeout, then we'd have either timeout*MaxAttemptsRetryPolicy or MaxTotalDelayRetryPolicy being the limiting factor.

Maybe I'm confused about what is lesser in that example.

Say you configure MaxTotalDelayRetryPolicy<MaxAttemptsRetryPolicy<R>> with 5 retries and a total delay of 100 seconds (for the sake of this example). Then you'd expect the client to return either in case of success or after it tried 5 times or after 100s, whatever comes first. Now with the client-side timeouts we also have each retry timeout after 10 seconds, so it might already be done after 50s.

Or, maybe even a bit more confusing would be if the user configured a MaxTotalDelayRetryPolicy of less than 10s, say 5s. They would expect a client call def. return after that 5s. But, if we have a client-side timeout of 10s (or none before this PR), even the first attempt could take way longer than the configured total delay (since we don't use a select but rather a loop, we always await the first/current operation to return).

ldk-reviews-bot · 2025-08-21T09:46:32Z

🔔 1st Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

jkczyz · 2025-08-21T14:40:51Z

Say you configure MaxTotalDelayRetryPolicy<MaxAttemptsRetryPolicy<R>> with 5 retries and a total delay of 100 seconds (for the sake of this example). Then you'd expect the client to return either in case of success or after it tried 5 times or after 100s, whatever comes first. Now with the client-side timeouts we also have each retry timeout after 10 seconds, so it might already be done after 50s.

Isn't that expected? "done after 50s" is really "done after 5 attempts".

Or, maybe even a bit more confusing would be if the user configured a MaxTotalDelayRetryPolicy of less than 10s, say 5s. They would expect a client call def. return after that 5s. But, if we have a client-side timeout of 10s (or none before this PR), even the first attempt could take way longer than the configured total delay (since we don't use a select but rather a loop, we always await the first/current operation to return).

Yeah, though isn't that a good argument to use select? Or does the current design not allow that given policy timeout is built into the type rather than the calling site being aware of it?

tnull · 2025-08-21T15:59:45Z

Yeah, though isn't that a good argument to use select? Or does the current design not allow that given policy timeout is built into the type rather than the calling site being aware of it?

True, seems like we should? And given that we already use tokio with the time feature, it should be straightforward. I think I'll add a commit to this PR.

tnull · 2025-08-22T12:53:39Z

Yeah, though isn't that a good argument to use select? Or does the current design not allow that given policy timeout is built into the type rather than the calling site being aware of it?

True, seems like we should? And given that we already use tokio with the time feature, it should be straightforward. I think I'll add a commit to this PR.

Argh, after looking into it for a bit I have to eat my words: it's actually not trivial, as currently RetryPolicy::next_delay requires us to supply the returned error, i.e., we can only calculate the next delay based on the error type (as we use it in FilteredRetryPolicy).

And more generally, with a generic error type, we don't know what error we'd return in case the the timeout happens before the operation future resolves.

jkczyz · 2025-08-22T16:13:23Z

And more generally, with a generic error type, we don't know what error we'd return in case the the timeout happens before the operation future resolves.

Would it help defining an enum parameterized by the error E where one variant is for a timeout and the other for wrapping E?

ldk-reviews-bot · 2025-08-23T09:47:00Z

🔔 2nd Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-08-25T09:47:42Z

🔔 3rd Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-08-27T09:48:20Z

🔔 4th Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-08-30T00:01:08Z

🔔 5th Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-09-01T00:01:51Z

🔔 6th Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-09-03T00:02:29Z

🔔 7th Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-09-06T00:01:10Z

🔔 8th Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-09-08T00:01:51Z

🔔 9th Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

jkczyz

Waiting on @tnull (no rush!). Just want to silence the review bot.

tnull · 2025-11-05T13:18:43Z

Now also tacked-on a commit that corrected the expected HTTP status codes in our mocked test service. Relatedly, I discovered that the Rust vss-server didn't use the correct codes either (a regression from the Java version), fixed in lightningdevkit/vss-server#65

tnull · 2025-11-05T13:26:36Z

Now also tacked-on a commit that corrected the expected HTTP status codes in our mocked test service. Relatedly, I discovered that the Rust vss-server didn't use the correct codes either (a regression from the Java version), fixed in lightningdevkit/vss-server#65

.. and finally it turns out that we can drop our RetryPolicy in this PR as, once corrected, we can deal with all the cases through HTTP status codes, no need to parse the payload.

While the `RetryPolicy` has a `MaxTotalDelayRetryPolicy`, the retry `loop` would only check this configured delay once the operation future actually returns a value. However, without client-side timeouts, we're not super sure the operation is actually guaranteed to return anything (even an error, IIUC). So here, we enable some coarse client-side default timeouts to ensure the polled futures eventualy return either the response *or* an error we can handle via our retry logic.

.. as some types are part of our API.

tnull · 2025-11-05T14:24:52Z

Rebased to resolve minor conflicts.

ldk-reviews-bot · 2025-11-06T13:24:27Z

🔔 1st Reminder

Hey @jkczyz @tankyleo! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-11-06T13:24:28Z

🔔 1st Reminder

Hey @jkczyz @tankyleo! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

jkczyz · 2025-11-06T15:19:54Z

src/client.rs

+fn build_client() -> Client {
+Client::builder()
+.timeout(DEFAULT_TIMEOUT)
+.connect_timeout(DEFAULT_TIMEOUT)
+.read_timeout(DEFAULT_TIMEOUT)
+.build()
+.unwrap()
+}


Could probably inline this in from_client (does that need to be pub?) and pass header_provider from each new method.

does that need to be pub?

Yes, as we want to allow users to override any specific reqwest properties. In fact, in LDK Node we currently make use of from_client_with_headers introduced here to be able to override the default timeouts/max retries.

src/client.rs

src/error.rs

src/util/retry.rs

ldk-reviews-bot · 2025-11-08T13:24:53Z

🔔 2nd Reminder

Hey @tankyleo! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-11-08T13:24:55Z

🔔 2nd Reminder

Hey @tankyleo! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-11-10T13:25:52Z

🔔 3rd Reminder

Hey @tankyleo! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-11-10T13:25:54Z

🔔 3rd Reminder

Hey @tankyleo! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

Previously, we'd allow to either re-use a `reqwest::Client` or supply a header provider. Here we add a new constructor that allows us to do both at the same time.

tnull · 2025-11-10T15:21:31Z

So it turns out that not all error cases were retryable, e.g., in LDK Node's VssStore we ran into cases where the sync KVStoreSync::write case's block_on resulted in threads being blocked and the putObject task never returning, only timing out via reqwest's total timeout. While we might be able to revisit dropping our RetryPolicy when LDK Node fully switched to async persistence, for now we still need to lean on it. Now reverted corresponding changes here, excuse the back-and-forth.

tankyleo · 2025-11-10T23:10:31Z

tests/tests.rs

 };
 let mock_server = mockito::mock("POST", GET_OBJECT_ENDPOINT)
-.with_status(409)
+.with_status(404)


This is not asserted in the test want to make sure this is intentional ? I understand why we would not assert that the server responds with a consistent error code - error response map.

I'm not quite sure I follow what you're asking?

ie i change the status code and the test does not fail :)

ie i change the status code and the test does not fail :)

Well, we currently determine the error based on the decoded payload. But, we should still mirror the actual server behavior. FWIW, if we wanted to assert something, it would be that the mock service mirrors what we do in production, but if we did so (running a service in CI) it would mitigate the idea of using a mock service in the first place. Not sure if other projects have better practices around mocking..

but if we did so (running a service in CI) it would mitigate the idea of using a mock service in the first place.

Thank you can you help me understand this part ?

Well, we currently determine the error based on the decoded payload. But, we should still mirror the actual server behavior. FWIW, if we wanted to assert something, it would be that the mock service mirrors what we do in production

How about some kind of debug_assert client-side that asserts that the VssError we got from the server matches the expected StatusCode ?

This is what I had in mind. But again maybe not worth asserting here...

diff --git a/src/error.rs b/src/error.rs index 5955e6a..497a588 100644 --- a/src/error.rs +++ b/src/error.rs @@ -34,7 +34,7 @@ impl VssError { /// Create new instance of `VssError` pub fn new(status: StatusCode, payload: Bytes) -> VssError { match ErrorResponse::decode(&payload[..]) { -Ok(error_response) => VssError::from(error_response), +Ok(error_response) => VssError::from((status, error_response)), Err(e) => { let message = format!( "Unable to decode ErrorResponse from server, HttpStatusCode: {}, DecodeErr: {}", @@ -73,22 +73,35 @@ impl Display for VssError { impl Error for VssError {} -impl From<ErrorResponse> for VssError { -fn from(error_response: ErrorResponse) -> Self { +impl From<(StatusCode, ErrorResponse)> for VssError { +fn from((status, error_response): (StatusCode, ErrorResponse)) -> Self { match error_response.error_code() { -ErrorCode::NoSuchKeyException => VssError::NoSuchKeyError(error_response.message), +ErrorCode::NoSuchKeyException => { +debug_assert_eq!(status, StatusCode::NOT_FOUND); +VssError::NoSuchKeyError(error_response.message) +}, ErrorCode::InvalidRequestException => { +debug_assert_eq!(status, StatusCode::BAD_REQUEST); VssError::InvalidRequestError(error_response.message) }, -ErrorCode::ConflictException => VssError::ConflictError(error_response.message), -ErrorCode::AuthException => VssError::AuthError(error_response.message), +ErrorCode::ConflictException => { +debug_assert_eq!(status, StatusCode::CONFLICT); +VssError::ConflictError(error_response.message) +}, +ErrorCode::AuthException => { +debug_assert_eq!(status, StatusCode::UNAUTHORIZED); +VssError::AuthError(error_response.message) +}, ErrorCode::InternalServerException => { +debug_assert_eq!(status, StatusCode::INTERNAL_SERVER_ERROR); VssError::InternalServerError(error_response.message) }, -_ => VssError::InternalError(format!( -"VSS responded with an unknown error code: {}, message: {}", -error_response.error_code, error_response.message -)), +ErrorCode::Unknown => { +VssError::InternalError(format!( +"VSS responded with an unknown error code: {}, message: {}", +error_response.error_code, error_response.message +)) +}, } } }

Thank you can you help me understand this part ?

We use a mock service explicitly to avoid having to run the full service in CI. If we now add consistency checks between the mock service and the actual implementation, we might as well just run the tests against the service. That said, it's maybe not the worst idea to begin with, as IMO the mocking always just has the potential to be wrong (as just proven), and given we're otherwise not super conservative about running stuff in CI, I'm not quite sure what we gain with the mocking approach exactly.

This is what I had in mind. But again maybe not worth asserting here...

Hmm, well, it's not only not worth asserting, it's also wrong as the bug would be on the service side. So, following Postel's law, we should be 'liberal in what we accept' and any debug_asserts would need to be added service-side to ensure correctness. However, the protocol seems underspecified there, as nowhere it's actually defined that you should also make use of HTTP status codes AFAIU.

ok I think we can leave as is for now, will continue to consider replacing mock with the actual service now that a project like Fedimint wants to use the full VSS service in their CI

tnull requested review from jkczyz and tankyleo August 19, 2025 09:45

tnull changed the title ~~2025 08 enable client side delays~~ Enable client-side delays Aug 19, 2025

tnull changed the title ~~Enable client-side delays~~ Enable client-side timeouts Aug 19, 2025

tankyleo reviewed Aug 19, 2025

View reviewed changes

jkczyz reviewed Sep 8, 2025

View reviewed changes

tnull force-pushed the 2025-08-enable-client-side-delays branch from 5e83c2b to 5670866 Compare November 4, 2025 13:22

tnull changed the title ~~Enable client-side timeouts~~ Enable client-side timeouts and replace retry logic with reqwest's Nov 4, 2025

tnull requested review from jkczyz and tankyleo November 4, 2025 13:23

tnull force-pushed the 2025-08-enable-client-side-delays branch from 5670866 to 4e360cf Compare November 4, 2025 13:24

tnull force-pushed the 2025-08-enable-client-side-delays branch from 9c94320 to 9c3148f Compare November 5, 2025 12:24

tnull mentioned this pull request Nov 5, 2025

Drop custom retry policy #47

Closed

tnull force-pushed the 2025-08-enable-client-side-delays branch 4 times, most recently from f926914 to 8b8910d Compare November 5, 2025 13:16

tnull force-pushed the 2025-08-enable-client-side-delays branch from 6bc7bce to b258cec Compare November 5, 2025 14:04

tnull added 3 commits November 5, 2025 15:22

DRY up Client building

c2c7059

Re-export the reqwest crate

a25ed48

.. as some types are part of our API.

tnull force-pushed the 2025-08-enable-client-side-delays branch from b258cec to e6d8e57 Compare November 5, 2025 14:24

jkczyz reviewed Nov 6, 2025

View reviewed changes

tnull force-pushed the 2025-08-enable-client-side-delays branch from 7cf661b to 817db29 Compare November 7, 2025 12:53

tnull added 2 commits November 10, 2025 14:39

Expect NOT_FOUND / 404 status code in NoSuchKey error response

6bdf104

Add VssClient::from_client_and_headers constructor

23097bf

Previously, we'd allow to either re-use a `reqwest::Client` or supply a header provider. Here we add a new constructor that allows us to do both at the same time.

tnull force-pushed the 2025-08-enable-client-side-delays branch from 817db29 to 23097bf Compare November 10, 2025 15:17

tnull requested a review from jkczyz November 10, 2025 15:22

jkczyz approved these changes Nov 10, 2025

View reviewed changes

tnull merged commit 3465c9f into lightningdevkit:main Nov 10, 2025
3 checks passed

tankyleo reviewed Nov 10, 2025

View reviewed changes

Enable client-side timeouts and replace retry logic with reqwest's #39

Enable client-side timeouts and replace retry logic with reqwest's #39

Conversation

tnull commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ldk-reviews-bot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tankyleo commented Aug 19, 2025

jkczyz commented Aug 20, 2025

tnull commented Aug 20, 2025

jkczyz commented Aug 20, 2025

tnull commented Aug 21, 2025

ldk-reviews-bot commented Aug 21, 2025

jkczyz commented Aug 21, 2025

tnull commented Aug 21, 2025

tnull commented Aug 22, 2025

jkczyz commented Aug 22, 2025

ldk-reviews-bot commented Aug 23, 2025

ldk-reviews-bot commented Aug 25, 2025

ldk-reviews-bot commented Aug 27, 2025

ldk-reviews-bot commented Aug 30, 2025

ldk-reviews-bot commented Sep 1, 2025

ldk-reviews-bot commented Sep 3, 2025

ldk-reviews-bot commented Sep 6, 2025

ldk-reviews-bot commented Sep 8, 2025

jkczyz left a comment

Choose a reason for hiding this comment

tnull commented Nov 5, 2025

tnull commented Nov 5, 2025

tnull commented Nov 5, 2025

ldk-reviews-bot commented Nov 6, 2025

ldk-reviews-bot commented Nov 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ldk-reviews-bot commented Nov 8, 2025

ldk-reviews-bot commented Nov 8, 2025

ldk-reviews-bot commented Nov 10, 2025

ldk-reviews-bot commented Nov 10, 2025

tnull commented Nov 10, 2025

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnull Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnull Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels

4 participants

Enable client-side timeouts and replace retry logic with `reqwest`'s #39

Enable client-side timeouts and replace retry logic with `reqwest`'s #39

tnull commented Aug 19, 2025 •

edited

Loading

ldk-reviews-bot commented Aug 19, 2025 •

edited

Loading

tnull Nov 12, 2025 •

edited

Loading

tnull Nov 13, 2025 •

edited

Loading