-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move saga recovery to a background task #6063
Conversation
Feel free to ignore this comment. I just wanted to durably record these notes somewhere in case we have to revisit this. Design notesAfter #5962 it's pretty straightforward to move the saga recovery code into a background task. The complexity here is really dealing with the fact that it might run more than once and that it might find sagas that this Nexus process is already running. I wrestled with a bunch of approaches to this problem. Here's the approach I landed on:
This is trickier than it sounds because of the race between checking the database and learning about sagas created by Nexus. But with some care, I believe this is correct and well-tested and it involves no changes to the database. Alternative: ignore the problem (let Steno deal with it)We could just completely ignore this problem and let Steno bomb out if we try to recover a saga with the same id as one that's already running. Right now, I think that would work. But it has a few problems:
We could improve the first problem by implementing purging in Steno and making the saga recovery task responsible for it. That is: once the saga recovery task confirms that a saga is done, it purges it from Steno. Then maybe it's safe to rely on Steno bombing out if we give it a duplicate id. We still have the other two problems and the second of those means that even in the best case, this solution isn't great. Alternative: store SEC generation in the databaseI went down this path:
The big "pro" here is that Nexus only ever lists the sagas it needs to recover, instead of listing all in-progress sagas. This would be a big benefit if we got to the point that the number of in-progress sagas was really large. It's also pretty clean -- it's hard to see this approach accidentally recovering a saga twice. But it's kind of tricky to test. This has a few cons: storing stuff in the database means we have a schema change, we're storing more data, and the information is replicated 5x. The volume of data is small enough that it probably doesn't matter (but that also undermines the upside). It bothered me about this approach that the problem feels totally internal to Nexus -- the database shouldn't need to know about this. Also, the fact that we're not listing all in-progress sagas means that if we make a mistake somehow (but still update the SEC generation), we'll never figure that out. Looking back, it feels like kind of a wash. Alternative: Steno owns more of the recovery processThis was kind of brought up last year under oxidecomputer/steno#156. You could imagine:
There are more details here, like where does the Theoretically, this could be cleaner because some of the complexity in this PR comes from the fact that we're dispatching and recovering sagas in separate contexts and have to synchronize, whereas in principle the Steno SEC is already a single context and the source of truth for what's running. But it may not be easier in practice because the SEC is still concurrent: it can be in the middle of storing the record for a new saga at the same time it's driving a recovery pass -- might it still have the same possibility for ambiguity that we have in Nexus? There's also more overhead and boilerplate in changing the Steno SEC API. I think a few of these others are workable and have some advantages. But all of them are a lot of work, including the one I implemented, so my strong inclination is to go with what's in this PR unless there's some major problem I've missed. |
Here's some example omdb output for the background task. Right after Nexus starts up, having recovered one saga:
I activated the task manually with
Here's after seeing one saga start and another finish:
(the "bad state in SEC" error is a bug in the counter that I have since fixed) |
Manual testingI wanted to try some manual stress testing that seemed hard to do in the existing test suite. For both of these I modified omdb to add a
(note: project-blast-delete does not work because it does not delete the VPCs in the newly-created projects) Stress test: furious recovery while launching sagasProcedure:
This constantly activates the background task while constantly creating new sagas. I wanted to stress-test the path that keeps track of in-progress sagas. I let this run for a while. Here's how I had to tweak the Nexus config:
(Some of these seem like bugs. I'll file some more issues.)
(I elided a bunch of dots there. Each dot is a project created. There'd be an 'x' if one failed.) I also ran
Here's what's in the database:
This is all consistent:
Let's look at the Nexus log:
That's all expected. It's nice to see that 636 times we ignored a saga because it was running -- that means the task caught a bunch of sagas in the middle of things. There are no warning-or-worse log messages from this task:
Other tasks are spewing lots of warnings (will file some bugs). But if we ignore those, the only warnings we have are for having cancelled requests:
Kill Nexus with lots of sagas ongoingProcedure:
Here's what omdb shows after starting Nexus again:
I activated the task again and checked the status and saw:
Once more:
We'd have expected anywhere from 0 - 64 sagas in-progress when it crashed. It appears we had 31. The tool spit out:
From this we can infer that at least 407 sagas were created and no more than 567. We don't know from the tool output how many were in progress when it crashed but it shouldn't have been more than 64 (the concurrency limit). I'm not sure how much else we can infer from this. In particular, of those 132 failures, some may have created sagas and some may not have. This is complicated further because in principle all sagas completed should have succeeded, but they didn't due to some bugs in the sagas (for which I will file separate issues). Anyway, there are in fact 438 projects from our tool and 438 project-create sagas:
That matches the 407 known to have completed + 31 recovered. That implies that some of ones that project-blast reported having failed did complete and that makes sense. The project-create HTTP requests are synchronous, so if Nexus crashes during the request, the request will fail, but the saga may have been created and then recovered and then completed successfully. Let's look at what happened with the sagas:
We had four action failures and then three undo action failures! I dug into these to be sure this wasn't somehow the fault of saga recovery (e.g., replaying an action or something? but that's always supposed to be legal anyway...) If we ignore the messages from background tasks as before, we see some of these errors:
I thought that wasn't a problem, though now I'm not sure why not. I'll come back to this. Anyway, ignoring that one:
So we have three failures from node "subnet" that look the same. Looking at the implementation of svc_create_subnet() (the action I suspect is failing), I think it's not idempotent. It attempts to insert a record and then fails with a conflict if it can't do so. I'll file a bug for this. In all of these, the undo action also fails trying to delete the VPC because it already has a subnet. I'll write this up in a new ticket but I think these are basically one bug, which is that it's not atomic. We also have a similar non-idempotent-looking failure from node "route4". There are also these log messages from these sagas' execution:
I don't see anything here that makes me worried that saga recovery induced any of these failures. As far as I can tell, it recovered an expected number of sagas, exactly once each, and every saga has run to completion one way or the other. I'd welcome other things to look for! |
Regarding those "unexpectedly failed to send response to SEC client" message, I found that they're generally preceded by messages like these:
Looking at the Steno code: it looks like Steno is trying to notify the consumer that the saga has finished. The consumer is allowed to ignore this (dropping their side of the channel) -- and indeed the saga recovery process does that. So I think these just correspond to sagas that completed after having been recovered. There are 28 of them:
The last question might be: why not 31? |
The three sagas that were recovered but didn't produce this message were: 1b4580da-7bb8-4e6a-b5aa-cbb7d4e91b55 These completed normally so I'm not sure what's special about them. I dug into 84cf464b-99e3-4343-8c81-c26b7e11e643 and found that the saga finished before the recovery pass was over, so the completion future would not have been dropped yet, which probably explains why we didn't get that error. |
@davepacheco regarding your notes in #6063 (comment), under "Alternative: store SEC generation in the database", you wrote that:
Can you elaborate a bit on why there's a 5x replication of data here? That wasn't obvious to me from the description you've written --- I assume this is because of some detail of how Steno works internally that I'm not familiar with? Or am I just overlooking something from your comment? Thanks! And, by the way, thanks so much for the detailed design discussion here, it's very helpful! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thus far, I've only reviewed the nexus-saga-recovery
crate, and not the background test. The implementation looks remarkably straightforward despite the complex behavior, and I appreciate the documentation! Overall, really nice work --- I'll try to get to the rest of the background task soon.
// Send some numbers and make sure we get them back. | ||
let expected_numbers = vec![1, 7, 0, 1]; | ||
for n in &expected_numbers { | ||
tx.send(*n).unwrap(); | ||
} | ||
let (numbers, disconnected) = read_all_from_channel(&mut rx); | ||
assert_eq!(expected_numbers, numbers); | ||
assert!(!disconnected); | ||
|
||
// Send some more numbers and make sure we get them back. | ||
let expected_numbers = vec![9, 7, 2, 0, 0, 6]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this kinda feels like it's within sigma of a property test...not that the behavior is so complex or input-dependent that it needs to be, though...
/// load state from the database) | ||
Failed { message: String }, | ||
/// This pass was at least partially successful | ||
Success(LastPassSuccess), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we really be calling this a "success" if it's really only sort of a success? "Planned" might be more descriptive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you're saying, though "Planned" to me suggests "not yet executed". "Completed"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷♀️ Success
is fine with me, honestly, but Completed
is also good. Up to you!
It's just that if we store it in memory as we do right now, then we have one copy of the information and it's in-memory in Nexus. If we instead put an SEC generation into each row of the It's really not that big a deal because it's so little information to begin with. But when I was evaluating the two options, it seemed much heavier weight. Like I said, looking back, it seems like kind of a wash. |
Oh, I see, I wasn't sure whether you were just referring to the CRDB replication factor or something else. That makes sense, thanks! |
And, for the record, I think the approach you've gone with makes sense — I like that state which is only relevant to the current Unix process lifetime of a Nexus instance just lives in that process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've finished my review of the Nexus parts of the implementation. Overall, this looks great, and I have no major concerns about the overall design --- very nice work! I left some style feedback, but no hard blockers.
/// Creates a new `OpContext` just like the given one, but with a different | ||
/// identity. | ||
/// | ||
/// This is only intended for tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if this ought to be hidden behind a test_utils
feature or something that the integration-test crate can enable? not a big deal though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair question. I don't think we do this a lot and I'm not sure it's worth the extra compilation I'd expect it to trigger. I'm not opposed but I think it makes sense to decide this more holistically so I'm going to defer for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's reasonable. We can think harder about it later.
let nexus_saga_recovery::LastPass::Success(last_pass_success) = | ||
status.last_pass | ||
else { | ||
panic!("wrong last pass variant"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
teeny nitpick: should we format what we actually did get in the panic message, here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an easy way to do that with let ... else
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, i think you would need to use a match
instead. it's not that important...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davepacheco I really appreciate the comments you put in the code and PR to describe the mechanism of saga recovery and the complexity of allowing it to occur at arbitrary times. I think all your logic is correct, although I haven't reviewed most of the test code and probably didn't look as thoroughly as @hawkw.
I think given the level of testing you have done, and Eliza's review, this is probably good to go! My only real qualms about this PR are whether it's possible to do more to guarantee the ordering checks around "ignored" sagas. I'm not sure if explicitly looking for finished sagas in the DB would help in terms of correctness, but it may add a second level of asserts around "we are no longer running these sagas" -> "they show up as finished in the db". Again, this is likely overkill, and up to you to ignore the comment :)
@@ -246,22 +252,20 @@ impl BackgroundTasksInitializer { | |||
/// | |||
/// This function will wire up the `Activator`s in `background_tasks` to the | |||
/// corresponding tasks once they've been started. | |||
#[allow(clippy::too_many_arguments)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
// Correctness here requires that the caller has already updated the set | ||
// of sagas that we're ignoring to include any that may have been | ||
// created up to the beginning of the database query. (They do that by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can mostly ignore this comment. I'm just writing it down for lack of a better place and to see if maybe it will pique your interest / spark an idea.
I've read your comments in saga background task and most of that logic, but I'm still a bit concerned that correctness in this data structure requires such specific ordering in the background saga. However, it all appears correct and the excellent testing shows so. I'm wondering if there's a better way to enforce this ordering requirement in code so that it doesn't rely on comments to remain correct. I think one way to do that would be to use some sort of typestate process. But given that if the db query fails, we proceed anyway in the background task, this may add too much complexity in terms of states to be worth it.
@andrewjstone and I discussed a bit offline. In retrospect, it may well be worth having the background task do an extra pass listing sagas from the database to resolve the ambiguity about sagas that weren't in the first listing (did they finish or did they just start?). Since this is not an issue of correctness, I'm prioritizing urgency on this for now, but I'm happy to revisit it. |
tl;dr: This PR moves the execution of saga recovery into a background task and also fixes the implementation so that it can be safely re-run at any time.
#5136 requires that sagas assigned to an expunged Nexus be re-assigned to another Nexus. In principle, this is easy: since we know the first Nexus is dead, we should be able to just update the database for each in-progress saga assigned to that Nexus to say that it's being run by some other Nexus. The regular saga recovery path should take care of the rest.
There are a few problems here:
This turned out to be pretty much a rewrite of the existing saga recovery code. (But keep in mind that Steno handles all the details of recovery -- all this stuff was doing was loading state from the database and handing it to Steno.) There's now a bunch more code, but that's because it's factored into smaller pieces that are much easier to observe and test (and there are a lot of tests). With the improved observability, the background task exposes detailed status about sagas recovered, sagas it failed to recover, and other internal state. I went into this with the mindset of: assume this winds up broken in production -- what information will I wish I had to debug it?
The bulk of the changes are in:
nexus/saga-recovery
is a new Nexus-level crate containing all the bookkeeping, plus lots of tests. This doesn't do anything with the database. It just keeps track of which sagas we've seen, which need to be recovered, etc. and reports on what happened in each saga recovery pass and what's happened since startup.nexus/src/app/background/tasks/saga_recovery.rs
is the new background task. It leans onnexus-saga-recovery
for the bookkeeping and just adds the bits that read from the database and shove the saga into Steno. There's a smattering of other changes needed for any new background task: tiny additions to config files, omdb support, omdb output, etc.Fixes #5947.