Two fixes for circuit breaker logic #247
Merged
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
👋 I believe I found two bugs in the circuit breaker logic so I opened a PR to fix them. Please let me know if I missed/misunderstood anything! 🙏 🙇
TLDR:
error_threshold_timeoutwas unused. Instead of checking theerror_threshold_timeoutwhen rejecting expired@errors,error_timeoutwas used.@errorswere not cleared when closing the circuit upon reachingsuccess_thresholdLong version:
While investigating something else, I noticed that
error_threshold_timeoutwas unused. Looking at the code, I believe it should be used when filtering errors inrecord_errorto remove errors that are outside of theerror_threshold_timeoutwindow. Currently,error_timeoutis used instead.So, I added a test in 03d2e35 to demonstrate the problem. The test does the following:
@circuit_breaker.error_threshold - 1errors to get the circuit breaker into the state where it's closed and only 1 error is needed to flip it to open@circuit_breaker.error_threshold_timeout - 0.01in order to let some time pass, while keeping the previously recorded errors within theerror_threshold_timeoutwindow. To demonstrate the bug, it would have been enough to travel for any amount of time greater thanerror_timeout.Running the test without fixes results in an error, as expected:
I then switched the incorrect
expiry = now - @error_timeoutline toexpiry = now - @error_threshold_timeoutand verified that the test was now passing. 🎉However, another tests was failing now, which revealed the second bug:
This test started to fail because the test time travels for
error_timeoutamount of time, which was previously enough to no longer observe errors which caused the circuit breaker to open. In other words,@errorswas previously cleared not because the success_threshold was reached (this would have been expected), but because the errors were incorrectly removed by the time-based filtering.But with the fix above, the code was now correctly observing a larger period of time
error_threshold_timeoutand was not incorrectly clearing@errors. As a result, the circuit breaker was kept open, which fails the test since it expected the circuit breaker to be closed (becausesuccess_thresholdshould have been reached).The fix for this problem is to clear
@errorswhen we close the circuit breaker upon reachingsuccess_threshold. Basically, if when the circuit breaker is closed, we should wipe the@errorsclean since we're starting afresh and don't care about what happened before, i.e. we don't care about old errors which flipped the circuit breaker into the open state.After making this fix, I verified that the test was now passing. 🎉