-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vpc_create saga failed after recovery: "subnet" node not idempotent #6069
Comments
I was thinking of |
I'm going to try to knock this out now. I hope we can take the same approach as we do to making the IP address allocation idempotent: return either a new VPC subnet, or the previously-allocated one by looking it up via ID. |
- Fixes #6069 - Add an additional CTE to the existing VPC Subnet insertion query which detects a re-insertion of the same row, and ignores conflicts on that ID. It also verifies that the contents of the IP address blocks are the same, in that case. - Add regression test and handle new error variant where needed
- Fixes #6069 - Add an additional CTE to the existing VPC Subnet insertion query which detects a re-insertion of the same row, and ignores conflicts on that ID. It also verifies that the contents of the IP address blocks are the same, in that case. - Add regression test and handle new error variant where needed
While testing #6063, I kicked off a whole bunch of
project-create
sagas, each of which contains avpc-create
subsaga. I restarted Nexus while a bunch of them were running. On startup, Nexus recovered 31 sagas, but 3 of them failed with this error logged:The ultimately failed like this:
The action is implemented by
svc_create_subnet
, which basically just makes one datastore call tovpc_create_subnet()
, which callsvpc_create_subnet_raw()
, which does an [INSERT
]:omicron/nexus/db-queries/src/db/datastore/vpc.rs
Lines 844 to 849 in fe60eb9
That INSERT is designed to fail if the subnet overlaps with an existing one.
This is not idempotent because if we successfully insert it, then Nexus crashes, this operation will fail (in the way we saw above, I think).
Another problem reflected in the log is that these three sagas failed again when unwinding, at the
vpc
node, because they tried to delete the VPC but it still had a subnet under it (presumably from the first execution of the "subnet" node). I believe this is not a second bug. The "vpc" action can safely assume that the "subnet" action either never started or else that it ran successfully at least once and had its undo action run successfully at least once. Either way, there'd be no subnet in the database when undoing the "vpc" node. I don't think Steno did the wrong thing, either. From its local state, it knew the action had started, but wasn't sure if it finished. The correct behavior in that case is to run it again. If that fails, it would undo the stuff before it, just as it would have without a crash. (Steno never runs undo actions for actions that themselves failed.)There's potentially a second problem here which is that we have tests for action idempotency and I'd have expected them to catch this. I'm not sure why they didn't.
The text was updated successfully, but these errors were encountered: