-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fault injection #187
Comments
Neat idea. It could also be implemented as a Toxiproxy middleware in production, if we want to keep this feature outside of Semian. From conversations in Slack, I gather that one of the biggest painpoints of setting up Semian is coming up with the right configuration, which has to be done by hand. This introduces more complexity to that. Do you think this would be a concern? |
I like this, alot ❤
I agree this is a great intermediate step between using Toxiproxy in CI One thing to note is that in our setup, we instrument semian quite a
Personally, I would advocate to restricting this to production as I If having this functionality in CI was a must have, an alternative would
Very interested to hear if there are (even hacky :P) scripts getting |
I've always thought that fault injection in production is chaos engineering, not "on an intermediate step". Regularly injecting failure in production only has value if teams react to these failures by improving their jobs / workflows resilience. I wonder if building a self serve "shitlist" would let teams first attempt to make their logic resilient to failure and then toggle their logic on the shitlist to turn on fault injection. Once we had a shitlist, prod-eng could begin to enforce certain fault-injection rates around certain sections of the application. I'm 100% down try this out and would be super excited to work on this over hack days. IMO tests are not the right place to inject fault. Unless you're doing something like property testing, tests are better left deterministic. |
You're right @jpittis - I've just done a poor job articulating that |
This is similar to what we did back in 2014 for Resiliency, albeit through mocks, not the application library. However, it assumes that the circuits are perfectly implemented and that the client does the right thing always. I.e., you're testing the application logic, not the client driver—which is very likely to have bugs, e.g. ActiveRecord had several we found with only Toxiproxy. I've said before that those mocks covered up about as many bugs as they found. This is a bit different because we now have a nicer abstraction level to do it at than a mock—but the foundational thing stands, since you're not testing the client and we can't trust them. We found enough bugs at that layer that circumventing the client is more pure—but I think Toxiproxy is more pragmatic. |
I've been reading about fuse, a mature circuit breaker library for Erlang (a platform known for "resiliency by default").
In circuit breakers configuration, they have two fuse types (you can think of them similar to toxics in Toxiproxy):
{standard, MaxR, MaxT}
. These are fuses which tolerateMaxR
melt attempts in aMaxT
window, before they break down.{fault_injection, Rate, MaxR, MaxT}
. This fuse type sets up a fault injection scheme where the fuse fails at rateRate
, an floating point value between0.0
–1.0
. If you enter, say1 / 500
then roughly every 500th request will se ablown
fuse, even if the fuse is okay. This can be used to add noise to the system and verify that calling systems support the failure modes appropriately. The valuesMaxR
andMaxT
works as in a standard fuse.IMO, the idea of injecting faults through a circuit breaker is brilliant. Not every organization has adopted chaos engineering yet, but this could be a first step towards that, at least on the application level.
We should think about adopting this idea in Semian. The biggest concern would probably be development environment vs production: do we inject faults when it's running locally or on CI? If yes, how do we prevent test flakiness? Or should we do this only in production?
The text was updated successfully, but these errors were encountered: