Skip to content

Conversation

@YuriyHolinko
Copy link
Contributor

@YuriyHolinko YuriyHolinko commented Mar 17, 2025

Adjusted the jitter calculation to improve the randomness and distribution of delays in the exponential backoff logic.

current implementation is not based on exponential backoff but more uses randomly generated numbers in range (0, upperBound) and only upperBound exponentially grows, so in some cases we generate relatively low delays for retries
Also there was an existing link
image

https://github.com/grpc/proposal/blob/master/A6-client-retries.md#exponential-backoff in code to implementation of exponential backoff but actual source code was different

@YuriyHolinko YuriyHolinko requested a review from a team as a code owner March 17, 2025 15:11
@codecov
Copy link

codecov bot commented Mar 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.93%. Comparing base (490173b) to head (2fe22d8).
Report is 10 commits behind head on main.

Additional details and impacted files
@@ Coverage Diff @@ ## main #7206 +/- ## ========================================= Coverage 89.93% 89.93% - Complexity 6676 6678 +2  ========================================= Files 750 750 Lines 20168 20176 +8 Branches 1978 1978 ========================================= + Hits 18138 18146 +8  Misses 1435 1435 Partials 595 595 

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@jack-berg
Copy link
Member

Is this fixing the problem discussed in #7004?

@YuriyHolinko
Copy link
Contributor Author

YuriyHolinko commented Mar 17, 2025

Is this fixing the problem discussed in #7004?

yeah, looks like it's the same issue
i did not think someone else suffers from it, but as we have a few users then it will be very useful feature/bug fix

Copy link
Member

@jack-berg jack-berg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new logic for computing backoff looks good. Just wondering why touch the other logic, which needs to be pretty carefully considered.

long currentBackoffNanos =
Math.min(nextBackoffNanos, retryPolicy.getMaxBackoff().toNanos());
long backoffNanos = (long) (randomJitter.get() * currentBackoffNanos);
nextBackoffNanos = (long) (currentBackoffNanos * retryPolicy.getBackoffMultiplier());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also update the implementation in JdkHttpSender.

I know its not ideal that there are two implementations.. Maybe worth adding a utility function to RetryPolicy that computes the backoff for a given attempt N. Signature might look like:

public long computeBackoffNanosForAttempt(int attempt, Random randomSource) {...} 

It wouldn't be as efficient as the current implementation, but...

  • its such a tiny amount of compute that who cares
  • the compute is trivial compared to the overall cost of preparing and executing and HTTP request
Copy link
Contributor Author

@YuriyHolinko YuriyHolinko Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Added code for JdkHttpSender

I did not consider adding that method to calculate a backoff delay time
looking at the code I would say we can build more abstractions for sending requests and checking responses and exceptions, but not sure it's really helpful so let's probably move on with duplicated approach as we had before

attempt++;
try {
response = chain.proceed(chain.request());
if (response != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for changing this part of the logic?

Copy link
Contributor Author

@YuriyHolinko YuriyHolinko Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if response is null and no exception happened, the code fails in throw exception; line because exception is null. when response is null it's not something transient

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is possible but I haven't seen response null in practice. If it does occur, we can simply add a null check immediately after response = chain.proceed(chain.request()).

Copy link
Contributor Author

@YuriyHolinko YuriyHolinko Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's exactly what I did in this PR 😃
also null check was before this change so I intentionally keep it (but I suspect I can just drop it)

also previous code has the issue - if previous(before last) attempt returned rertryable response but the last attempt gets retryable exception method still returns the previous response which is not good as the last state (exception) should be returned

@YuriyHolinko
Copy link
Contributor Author

YuriyHolinko commented Mar 20, 2025

The new logic for computing backoff looks good. Just wondering why touch the other logic, which needs to be pretty carefully considered.

I replied in threads, tell me please if there are still any unanswered questions

@jack-berg
Copy link
Member

Thanks!

@jack-berg jack-berg merged commit 3c12e3a into open-telemetry:main Mar 25, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants