You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, when the error_timeout expires, the next acquisition request for a circuit will cause a transition from open to half_open. In this state, workers will attempt to access the resource with a modified timeout of half_open_resource_timeout. The motivation here is that the modified timeout is much lower than the client timeout so if the resource is still unhealthy, it will fail fast(er).
In the current implementation, every available worker (subject to the bulkhead configuration) will attempt the half_open -> closed transition. This means that if the resource is still unhealthy, all the workers could potentially block for half_open_resource_timeout seconds, reducing overall node capacity.
Mathematically, this means that t[half-open] / (t[half-open] + t[error_timeout]) will be spent attempting to re-open the circuit. If t[half-open] is 1.0s and t[error-timeout] is 5.0s (our MySQL defaults) then 16.7% of our capacity will go toward re-opening the circuit. If bulkheads are in place with a quota of 0.5, that number will be 8.3%.
How
When a circuit opens, the number of available tickets should immediately drop to 1. This shields the rest of the workers from this unhealthy resource. This is marginally faster than the open circuit error, since bulkhead acquisition is attempted before circuit-breaker acquisition, but that's likely not a big deal.
When the transition happens from open to half_open, we can raise the number of available tickets to success_threshold, to allow parallel re-closing of the circuit. Once the circuit is finally re-closed, we can raise the number of available tickets back to the original tickets/quota value.
The text was updated successfully, but these errors were encountered:
What
Currently, when the
error_timeout
expires, the next acquisition request for a circuit will cause a transition fromopen
tohalf_open
. In this state, workers will attempt to access the resource with a modified timeout ofhalf_open_resource_timeout
. The motivation here is that the modified timeout is much lower than the client timeout so if the resource is still unhealthy, it will fail fast(er).In the current implementation, every available worker (subject to the bulkhead configuration) will attempt the
half_open
->closed
transition. This means that if the resource is still unhealthy, all the workers could potentially block forhalf_open_resource_timeout
seconds, reducing overall node capacity.Mathematically, this means that
t[half-open] / (t[half-open] + t[error_timeout])
will be spent attempting to re-open the circuit. Ift[half-open]
is 1.0s andt[error-timeout]
is 5.0s (our MySQL defaults) then 16.7% of our capacity will go toward re-opening the circuit. If bulkheads are in place with a quota of 0.5, that number will be 8.3%.How
When a circuit opens, the number of available tickets should immediately drop to 1. This shields the rest of the workers from this unhealthy resource. This is marginally faster than the open circuit error, since bulkhead acquisition is attempted before circuit-breaker acquisition, but that's likely not a big deal.
When the transition happens from
open
tohalf_open
, we can raise the number of available tickets tosuccess_threshold
, to allow parallel re-closing of the circuit. Once the circuit is finally re-closed, we can raise the number of available tickets back to the originaltickets/quota
value.The text was updated successfully, but these errors were encountered: