Refine delay jitter for exponential backoff #7206

YuriyHolinko · 2025-03-17T15:11:48Z

Adjusted the jitter calculation to improve the randomness and distribution of delays in the exponential backoff logic.

current implementation is not based on exponential backoff but more uses randomly generated numbers in range (0, upperBound) and only upperBound exponentially grows, so in some cases we generate relatively low delays for retries
Also there was an existing link

https://github.com/grpc/proposal/blob/master/A6-client-retries.md#exponential-backoff in code to implementation of exponential backoff but actual source code was different

codecov · 2025-03-17T15:19:13Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.93%. Comparing base (490173b) to head (2fe22d8).
Report is 10 commits behind head on main.

Additional details and impacted files

@@ Coverage Diff @@ ## main #7206 +/- ## ========================================= Coverage 89.93% 89.93% - Complexity 6676 6678 +2  ========================================= Files 750 750 Lines 20168 20176 +8 Branches 1978 1978 ========================================= + Hits 18138 18146 +8  Misses 1435 1435 Partials 595 595

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jack-berg · 2025-03-17T16:00:48Z

Is this fixing the problem discussed in #7004?

YuriyHolinko · 2025-03-17T16:20:54Z

Is this fixing the problem discussed in #7004?

yeah, looks like it's the same issue
i did not think someone else suffers from it, but as we have a few users then it will be very useful feature/bug fix

jack-berg

The new logic for computing backoff looks good. Just wondering why touch the other logic, which needs to be pretty carefully considered.

jack-berg · 2025-03-20T14:37:11Z

.../okhttp/src/main/java/io/opentelemetry/exporter/sender/okhttp/internal/RetryInterceptor.java

+ long currentBackoffNanos =
+ Math.min(nextBackoffNanos, retryPolicy.getMaxBackoff().toNanos());
+ long backoffNanos = (long) (randomJitter.get() * currentBackoffNanos);
+ nextBackoffNanos = (long) (currentBackoffNanos * retryPolicy.getBackoffMultiplier());


Should also update the implementation in JdkHttpSender.

I know its not ideal that there are two implementations.. Maybe worth adding a utility function to RetryPolicy that computes the backoff for a given attempt N. Signature might look like:

public long computeBackoffNanosForAttempt(int attempt, Random randomSource) {...}

It wouldn't be as efficient as the current implementation, but...

its such a tiny amount of compute that who cares

the compute is trivial compared to the overall cost of preparing and executing and HTTP request

Thanks, Added code for JdkHttpSender

I did not consider adding that method to calculate a backoff delay time
looking at the code I would say we can build more abstractions for sending requests and checking responses and exceptions, but not sure it's really helpful so let's probably move on with duplicated approach as we had before

jack-berg · 2025-03-20T14:47:48Z

.../okhttp/src/main/java/io/opentelemetry/exporter/sender/okhttp/internal/RetryInterceptor.java

- attempt++;
 try {
 response = chain.proceed(chain.request());
+ if (response != null) {


What's the motivation for changing this part of the logic?

if response is null and no exception happened, the code fails in throw exception; line because exception is null. when response is null it's not something transient

I suppose this is possible but I haven't seen response null in practice. If it does occur, we can simply add a null check immediately after response = chain.proceed(chain.request()).

it's exactly what I did in this PR 😃
also null check was before this change so I intentionally keep it (but I suspect I can just drop it)

also previous code has the issue - if previous(before last) attempt returned rertryable response but the last attempt gets retryable exception method still returns the previous response which is not good as the last state (exception) should be returned

YuriyHolinko · 2025-03-20T17:59:08Z

The new logic for computing backoff looks good. Just wondering why touch the other logic, which needs to be pretty carefully considered.

I replied in threads, tell me please if there are still any unanswered questions

jack-berg · 2025-03-25T16:19:48Z

Thanks!

Refine delay jitter for exponential backoff

df10b75

YuriyHolinko requested a review from a team as a code owner March 17, 2025 15:11

added test

a12b804

jack-berg reviewed Mar 20, 2025

View reviewed changes

YuriyHolinko added 2 commits March 21, 2025 00:41

added delay backoff logic for jdk http sender

5f1bbb0

tiny change of local variable usage

2fe22d8

jack-berg approved these changes Mar 25, 2025

View reviewed changes

jack-berg mentioned this pull request Mar 25, 2025

Fix Retry time in Exporter #7004

Closed

jack-berg merged commit 3c12e3a into open-telemetry:main Mar 25, 2025
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refine delay jitter for exponential backoff #7206

Refine delay jitter for exponential backoff #7206

Uh oh!

YuriyHolinko commented Mar 17, 2025 •

edited

Loading

codecov bot commented Mar 17, 2025 •

edited

Loading

jack-berg commented Mar 17, 2025

YuriyHolinko commented Mar 17, 2025 •

edited

Loading

jack-berg left a comment

jack-berg Mar 20, 2025

YuriyHolinko Mar 20, 2025 •

edited

Loading

jack-berg Mar 20, 2025

YuriyHolinko Mar 20, 2025 •

edited

Loading

jack-berg Mar 20, 2025

YuriyHolinko Mar 20, 2025 •

edited

Loading

YuriyHolinko commented Mar 20, 2025 •

edited

Loading

jack-berg commented Mar 25, 2025

Uh oh!

Labels

2 participants

Refine delay jitter for exponential backoff #7206

Refine delay jitter for exponential backoff #7206

Uh oh!

Conversation

YuriyHolinko commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

codecov bot commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

jack-berg commented Mar 17, 2025

YuriyHolinko commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jack-berg left a comment

Choose a reason for hiding this comment

jack-berg Mar 20, 2025

Choose a reason for hiding this comment

YuriyHolinko Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

jack-berg Mar 20, 2025

Choose a reason for hiding this comment

YuriyHolinko Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

jack-berg Mar 20, 2025

Choose a reason for hiding this comment

YuriyHolinko Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

YuriyHolinko commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jack-berg commented Mar 25, 2025

Uh oh!

Labels

2 participants

YuriyHolinko commented Mar 17, 2025 •

edited

Loading

codecov bot commented Mar 17, 2025 •

edited

Loading

YuriyHolinko commented Mar 17, 2025 •

edited

Loading

YuriyHolinko Mar 20, 2025 •

edited

Loading

YuriyHolinko Mar 20, 2025 •

edited

Loading

YuriyHolinko Mar 20, 2025 •

edited

Loading

YuriyHolinko commented Mar 20, 2025 •

edited

Loading