Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault injection #187

Open
kirs opened this issue Jun 9, 2018 · 6 comments
Open

Fault injection #187

kirs opened this issue Jun 9, 2018 · 6 comments

Comments

@kirs
Copy link
Contributor

kirs commented Jun 9, 2018

I've been reading about fuse, a mature circuit breaker library for Erlang (a platform known for "resiliency by default").

In circuit breakers configuration, they have two fuse types (you can think of them similar to toxics in Toxiproxy):

  • Standard fuses, {standard, MaxR, MaxT}. These are fuses which tolerate MaxR melt attempts in a MaxT window, before they break down.
  • Fault injection fuses, {fault_injection, Rate, MaxR, MaxT}. This fuse type sets up a fault injection scheme where the fuse fails at rate Rate, an floating point value between 0.01.0. If you enter, say 1 / 500 then roughly every 500th request will se a blown fuse, even if the fuse is okay. This can be used to add noise to the system and verify that calling systems support the failure modes appropriately. The values MaxR and MaxT works as in a standard fuse.

IMO, the idea of injecting faults through a circuit breaker is brilliant. Not every organization has adopted chaos engineering yet, but this could be a first step towards that, at least on the application level.

We should think about adopting this idea in Semian. The biggest concern would probably be development environment vs production: do we inject faults when it's running locally or on CI? If yes, how do we prevent test flakiness? Or should we do this only in production?

@kirs
Copy link
Contributor Author

kirs commented Jun 11, 2018

@BoGs @sirupsen @jpittis @mac-adam-chaieb thoughts?

@moechaieb
Copy link

Neat idea. It could also be implemented as a Toxiproxy middleware in production, if we want to keep this feature outside of Semian.

From conversations in Slack, I gather that one of the biggest painpoints of setting up Semian is coming up with the right configuration, which has to be done by hand. This introduces more complexity to that. Do you think this would be a concern?

@jacobbednarz
Copy link

I like this, alot ❤

IMO, the idea of injecting faults through a circuit breaker is
brilliant. Not every organization has adopted chaos engineering yet, but
this could be a first step towards that, at least on the application level.

I agree this is a great intermediate step between using Toxiproxy in CI
and full blown chaos engineering in production which might make the
transition to the latter a touch easier.

One thing to note is that in our setup, we instrument semian quite a
bit. How many times the circuit breaker has tripped and for how long are
two (big) things that we're interested in watching as that gives us an
insight into underlying health of the systems we are relying on and
allows us to focus on particular subsystems should impacting trends
emerge. If this was rolled in, we'd definitely need a way to flag the
"blown fuse" as intentional and differentiate that to the real issues
that have triggered the circuit breaker.

The biggest concern would probably be development environment vs
production: do we inject faults when it's running locally or on CI? If
yes, how do we prevent test flakiness?

Personally, I would advocate to restricting this to production as I
think rolling this into the CI pipeline would cause quite a bit of
frustration and developer pain. Instead, rely on Toxiproxy to test
understood failure points in CI and then the blown fuse semian
functionality in production. You could use these two in conjunction
where new cases that are found in production could be ported to
Toxiproxy CI tests.

If having this functionality in CI was a must have, an alternative would
be to break it out into it's own pipeline that is not on the developer
path to production but still sets off warning lights with unhandled
failures. This pipeline could be built to expect failure and perhaps
allow N random blown fuses before it disables the functionality and
allows the run to complete aiming for a 100% green test rate at the end.

From conversations in Slack, I gather that one of the biggest
painpoints of setting up Semian is coming up with the right
configuration, which has to be done by hand.

Very interested to hear if there are (even hacky :P) scripts getting
around that might aid in getting people up and running with this
configuration to lower the barrier to entry.

@jpittis
Copy link
Contributor

jpittis commented Jun 12, 2018

I've always thought that fault injection in production is chaos engineering, not "on an intermediate step".

Regularly injecting failure in production only has value if teams react to these failures by improving their jobs / workflows resilience.

I wonder if building a self serve "shitlist" would let teams first attempt to make their logic resilient to failure and then toggle their logic on the shitlist to turn on fault injection.

Once we had a shitlist, prod-eng could begin to enforce certain fault-injection rates around certain sections of the application.

I'm 100% down try this out and would be super excited to work on this over hack days.

IMO tests are not the right place to inject fault. Unless you're doing something like property testing, tests are better left deterministic.

@jacobbednarz
Copy link

You're right @jpittis - I've just done a poor job articulating that
my intention with that comment was that I didn't not consider it chaos
engineering; just that it wasn't a solution like Netflix's Chaos
Monkey
whereby containers or entire instances are randomly
terminated. Having fault injection via a circuit breaker would be less
of an impact since it's already a partial gate to another system.

@sirupsen
Copy link
Contributor

This is similar to what we did back in 2014 for Resiliency, albeit through mocks, not the application library. However, it assumes that the circuits are perfectly implemented and that the client does the right thing always. I.e., you're testing the application logic, not the client driver—which is very likely to have bugs, e.g. ActiveRecord had several we found with only Toxiproxy. I've said before that those mocks covered up about as many bugs as they found. This is a bit different because we now have a nicer abstraction level to do it at than a mock—but the foundational thing stands, since you're not testing the client and we can't trust them.

We found enough bugs at that layer that circumventing the client is more pure—but I think Toxiproxy is more pragmatic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants