Fault injection #187

kirs · 2018-06-09T09:28:52Z

I've been reading about fuse, a mature circuit breaker library for Erlang (a platform known for "resiliency by default").

In circuit breakers configuration, they have two fuse types (you can think of them similar to toxics in Toxiproxy):

Standard fuses, {standard, MaxR, MaxT}. These are fuses which tolerate MaxR melt attempts in a MaxT window, before they break down.
Fault injection fuses, {fault_injection, Rate, MaxR, MaxT}. This fuse type sets up a fault injection scheme where the fuse fails at rate Rate, an floating point value between 0.0–1.0. If you enter, say 1 / 500 then roughly every 500th request will se a blown fuse, even if the fuse is okay. This can be used to add noise to the system and verify that calling systems support the failure modes appropriately. The values MaxR and MaxT works as in a standard fuse.

IMO, the idea of injecting faults through a circuit breaker is brilliant. Not every organization has adopted chaos engineering yet, but this could be a first step towards that, at least on the application level.

We should think about adopting this idea in Semian. The biggest concern would probably be development environment vs production: do we inject faults when it's running locally or on CI? If yes, how do we prevent test flakiness? Or should we do this only in production?

The text was updated successfully, but these errors were encountered:

kirs · 2018-06-11T09:30:13Z

@BoGs @sirupsen @jpittis @mac-adam-chaieb thoughts?

moechaieb · 2018-06-11T14:46:29Z

Neat idea. It could also be implemented as a Toxiproxy middleware in production, if we want to keep this feature outside of Semian.

From conversations in Slack, I gather that one of the biggest painpoints of setting up Semian is coming up with the right configuration, which has to be done by hand. This introduces more complexity to that. Do you think this would be a concern?

jacobbednarz · 2018-06-12T07:12:17Z

I like this, alot ❤

IMO, the idea of injecting faults through a circuit breaker is
brilliant. Not every organization has adopted chaos engineering yet, but
this could be a first step towards that, at least on the application level.

I agree this is a great intermediate step between using Toxiproxy in CI
and full blown chaos engineering in production which might make the
transition to the latter a touch easier.

One thing to note is that in our setup, we instrument semian quite a
bit. How many times the circuit breaker has tripped and for how long are
two (big) things that we're interested in watching as that gives us an
insight into underlying health of the systems we are relying on and
allows us to focus on particular subsystems should impacting trends
emerge. If this was rolled in, we'd definitely need a way to flag the
"blown fuse" as intentional and differentiate that to the real issues
that have triggered the circuit breaker.

The biggest concern would probably be development environment vs
production: do we inject faults when it's running locally or on CI? If
yes, how do we prevent test flakiness?

Personally, I would advocate to restricting this to production as I
think rolling this into the CI pipeline would cause quite a bit of
frustration and developer pain. Instead, rely on Toxiproxy to test
understood failure points in CI and then the blown fuse semian
functionality in production. You could use these two in conjunction
where new cases that are found in production could be ported to
Toxiproxy CI tests.

If having this functionality in CI was a must have, an alternative would
be to break it out into it's own pipeline that is not on the developer
path to production but still sets off warning lights with unhandled
failures. This pipeline could be built to expect failure and perhaps
allow N random blown fuses before it disables the functionality and
allows the run to complete aiming for a 100% green test rate at the end.

From conversations in Slack, I gather that one of the biggest
painpoints of setting up Semian is coming up with the right
configuration, which has to be done by hand.

Very interested to hear if there are (even hacky :P) scripts getting
around that might aid in getting people up and running with this
configuration to lower the barrier to entry.

jpittis · 2018-06-12T18:51:06Z

I've always thought that fault injection in production is chaos engineering, not "on an intermediate step".

Regularly injecting failure in production only has value if teams react to these failures by improving their jobs / workflows resilience.

I wonder if building a self serve "shitlist" would let teams first attempt to make their logic resilient to failure and then toggle their logic on the shitlist to turn on fault injection.

Once we had a shitlist, prod-eng could begin to enforce certain fault-injection rates around certain sections of the application.

I'm 100% down try this out and would be super excited to work on this over hack days.

IMO tests are not the right place to inject fault. Unless you're doing something like property testing, tests are better left deterministic.

jacobbednarz · 2018-06-12T21:49:29Z

You're right @jpittis - I've just done a poor job articulating that
my intention with that comment was that I didn't not consider it chaos
engineering; just that it wasn't a solution like Netflix's Chaos
Monkey whereby containers or entire instances are randomly
terminated. Having fault injection via a circuit breaker would be less
of an impact since it's already a partial gate to another system.

sirupsen · 2018-07-31T10:50:16Z

This is similar to what we did back in 2014 for Resiliency, albeit through mocks, not the application library. However, it assumes that the circuits are perfectly implemented and that the client does the right thing always. I.e., you're testing the application logic, not the client driver—which is very likely to have bugs, e.g. ActiveRecord had several we found with only Toxiproxy. I've said before that those mocks covered up about as many bugs as they found. This is a bit different because we now have a nicer abstraction level to do it at than a mock—but the foundational thing stands, since you're not testing the client and we can't trust them.

We found enough bugs at that layer that circumventing the client is more pure—but I think Toxiproxy is more pragmatic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fault injection #187

Fault injection #187

kirs commented Jun 9, 2018 •

edited

Loading

kirs commented Jun 11, 2018

moechaieb commented Jun 11, 2018

jacobbednarz commented Jun 12, 2018

jpittis commented Jun 12, 2018 •

edited

Loading

jacobbednarz commented Jun 12, 2018

sirupsen commented Jul 31, 2018

Fault injection #187

Fault injection #187

Comments

kirs commented Jun 9, 2018 • edited Loading

kirs commented Jun 11, 2018

moechaieb commented Jun 11, 2018

jacobbednarz commented Jun 12, 2018

jpittis commented Jun 12, 2018 • edited Loading

jacobbednarz commented Jun 12, 2018

sirupsen commented Jul 31, 2018

kirs commented Jun 9, 2018 •

edited

Loading

jpittis commented Jun 12, 2018 •

edited

Loading