Skip to content

Conversation

@izuzak
Copy link
Contributor

@izuzak izuzak commented Aug 9, 2025

👋 I believe I found two bugs in the circuit breaker logic so I opened a PR to fix them. Please let me know if I missed/misunderstood anything! 🙏 🙇

TLDR:

  1. error_threshold_timeout was unused. Instead of checking the error_threshold_timeout when rejecting expired @errors, error_timeout was used.
  2. @errors were not cleared when closing the circuit upon reaching success_threshold

Long version:

While investigating something else, I noticed that error_threshold_timeout was unused. Looking at the code, I believe it should be used when filtering errors in record_error to remove errors that are outside of the error_threshold_timeout window. Currently, error_timeout is used instead.

So, I added a test in 03d2e35 to demonstrate the problem. The test does the following:

  1. records @circuit_breaker.error_threshold - 1 errors to get the circuit breaker into the state where it's closed and only 1 error is needed to flip it to open
  2. time travels for @circuit_breaker.error_threshold_timeout - 0.01 in order to let some time pass, while keeping the previously recorded errors within the error_threshold_timeout window. To demonstrate the bug, it would have been enough to travel for any amount of time greater than error_timeout.
  3. records another error to flip the circuit breaker into the open state, and verifies that the circuit breaker is open.

Running the test without fixes results in an error, as expected:

> megatest test/redis_client/circuit_breaker_test.rb:24 Running test suite with driver: RedisClient::RubyConnection started toxiproxy-0 with pid=97126 started redis-0 with pid=97127 toxiproxy-0 ready. redis-0 ready. Running 1 test cases with --seed 62419 F 1) Failure: RedisClient::CircuitBreakerTest#test_track_errors_during_error_threshold_window Expected RedisClient::CircuitBreaker::OpenCircuitError but nothing was raised. test/redis_client/circuit_breaker_test.rb:72:in 'RedisClient::CircuitBreakerTest#assert_open' test/redis_client/circuit_breaker_test.rb:32:in 'block in RedisClient::CircuitBreakerTest#test_track_errors_during_error_threshold_window' test/support/client_test_helper.rb:79:in 'ClientTestHelper#travel' test/redis_client/circuit_breaker_test.rb:30:in 'RedisClient::CircuitBreakerTest#test_track_errors_during_error_threshold_window' megatest test/redis_client/circuit_breaker_test.rb:24 Ran 1 cases, 5 assertions, 1 failures, 0 errors, 0 retries, 0 skips 

I then switched the incorrect expiry = now - @error_timeout line to expiry = now - @error_threshold_timeout and verified that the test was now passing. 🎉

However, another tests was failing now, which revealed the second bug:

> megatest test/redis_client/circuit_breaker_test.rb:55 Running test suite with driver: RedisClient::RubyConnection started toxiproxy-0 with pid=98452 started redis-0 with pid=98453 toxiproxy-0 ready. redis-0 ready. Running 1 test cases with --seed 47002 E 1) Error: RedisClient::CircuitBreakerTest#test_close_fully_after_success_threshold_is_reached RedisClient::CircuitBreaker::OpenCircuitError: Too many connection errors happened recently lib/redis_client/circuit_breaker.rb:41:in 'RedisClient::CircuitBreaker#protect' test/redis_client/circuit_breaker_test.rb:82:in 'RedisClient::CircuitBreakerTest#assert_closed' test/redis_client/circuit_breaker_test.rb:67:in 'block in RedisClient::CircuitBreakerTest#test_close_fully_after_success_threshold_is_reached' test/support/client_test_helper.rb:79:in 'ClientTestHelper#travel' test/redis_client/circuit_breaker_test.rb:60:in 'RedisClient::CircuitBreakerTest#test_close_fully_after_success_threshold_is_reached' megatest test/redis_client/circuit_breaker_test.rb:55 Ran 1 cases, 7 assertions, 0 failures, 1 errors, 0 retries, 0 skips 

This test started to fail because the test time travels for error_timeout amount of time, which was previously enough to no longer observe errors which caused the circuit breaker to open. In other words, @errors was previously cleared not because the success_threshold was reached (this would have been expected), but because the errors were incorrectly removed by the time-based filtering.

But with the fix above, the code was now correctly observing a larger period of time error_threshold_timeout and was not incorrectly clearing @errors. As a result, the circuit breaker was kept open, which fails the test since it expected the circuit breaker to be closed (because success_threshold should have been reached).

The fix for this problem is to clear @errors when we close the circuit breaker upon reaching success_threshold. Basically, if when the circuit breaker is closed, we should wipe the @errors clean since we're starting afresh and don't care about what happened before, i.e. we don't care about old errors which flipped the circuit breaker into the open state.

After making this fix, I verified that the test was now passing. 🎉

- Use error_threshold_timeout when filtering observed errors - Clear errors when closing circuit upon reaching success_threshold
@byroot byroot force-pushed the izuzak/circuit-breaker-fixes branch from a601fd2 to 8161a38 Compare August 10, 2025 08:11
@byroot byroot merged commit 4d38b9c into redis-rb:master Aug 10, 2025
@byroot
Copy link
Member

byroot commented Aug 10, 2025

Urk. Thank you and sorry for that silly mistake 😞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants