Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vpc_create saga failed after recovery: "subnet" node not idempotent #6069

Closed
davepacheco opened this issue Jul 12, 2024 · 2 comments · Fixed by #6098
Closed

vpc_create saga failed after recovery: "subnet" node not idempotent #6069

davepacheco opened this issue Jul 12, 2024 · 2 comments · Fixed by #6098
Assignees

Comments

@davepacheco
Copy link
Collaborator

While testing #6063, I kicked off a whole bunch of project-create sagas, each of which contains a vpc-create subsaga. I restarted Nexus while a bunch of them were running. On startup, Nexus recovered 31 sagas, but 3 of them failed with this error logged:

18:32:32.664Z ERRO 333518da-5601-43e8-b9bd-f675571c5797 (ServerContext): failed to create default VPC Subnet, IP address range '172.30.0.0/22' overlaps with existing
    background_task = saga_recovery
    file = nexus/src/app/sagas/vpc_create.rs:374
    ipv4_block = Ipv4Net { addr: 172.30.0.0, width: 22 }
    ipv6_block = Ipv6Net { addr: fd25:a89f:414b::, width: 64 }
    saga_id = 919ea676-a9fb-451d-9775-b3613447d874
    saga_name = project-create
    subnet_id = 5f4c24ed-2995-42b3-a220-27cbf1293d27
    vpc_id = e884754e-223a-4540-a674-9aa44fa2669e

The ultimately failed like this:

18:32:33.259Z ERRO 333518da-5601-43e8-b9bd-f675571c5797 (ServerContext): saga finished
    action_error_node_name = "subnet"
    action_error_source = ActionFailed { source_error: Object {"InternalError": Object {"internal_message": String("Failed to create default VPC Subnet, found overlapping IP address ranges")}} }
    file = /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.1/src/sec.rs:1039
    result = failure
    saga_id = 919ea676-a9fb-451d-9775-b3613447d874
    saga_name = project-create
    sec_id = 333518da-5601-43e8-b9bd-f675571c5797
    undo_error = PermanentFailure { source_error: Object {"message": String("undo action attempt 1: Invalid Request: VPC cannot be deleted while VPC Subnets exist")} }
    undo_error_node_name = "vpc"

The action is implemented by svc_create_subnet, which basically just makes one datastore call to vpc_create_subnet(), which calls vpc_create_subnet_raw(), which does an [INSERT]:

diesel::insert_into(dsl::vpc_subnet)
.values(values)
.returning(VpcSubnet::as_returning())
.get_result_async(&*conn)
.await
.map_err(|e| SubnetError::from_diesel(e, &subnet))

That INSERT is designed to fail if the subnet overlaps with an existing one.

This is not idempotent because if we successfully insert it, then Nexus crashes, this operation will fail (in the way we saw above, I think).


Another problem reflected in the log is that these three sagas failed again when unwinding, at the vpc node, because they tried to delete the VPC but it still had a subnet under it (presumably from the first execution of the "subnet" node). I believe this is not a second bug. The "vpc" action can safely assume that the "subnet" action either never started or else that it ran successfully at least once and had its undo action run successfully at least once. Either way, there'd be no subnet in the database when undoing the "vpc" node. I don't think Steno did the wrong thing, either. From its local state, it knew the action had started, but wasn't sure if it finished. The correct behavior in that case is to run it again. If that fails, it would undo the stuff before it, just as it would have without a crash. (Steno never runs undo actions for actions that themselves failed.)


There's potentially a second problem here which is that we have tests for action idempotency and I'd have expected them to catch this. I'm not sure why they didn't.

@davepacheco
Copy link
Collaborator Author

There's potentially a second problem here which is that we have tests for action idempotency and I'd have expected them to catch this. I'm not sure why they didn't.

I was thinking of test_helpers::actions_succeed_idempotently(). It looks like we test this for the instance sagas and some of the storage ones but not other storage ones or the VPC ones. Filed #6070 for this.

@bnaecker
Copy link
Collaborator

I'm going to try to knock this out now. I hope we can take the same approach as we do to making the IP address allocation idempotent: return either a new VPC subnet, or the previously-allocated one by looking it up via ID.

@bnaecker bnaecker self-assigned this Jul 15, 2024
bnaecker added a commit that referenced this issue Jul 16, 2024
- Fixes #6069
- Add an additional CTE to the existing VPC Subnet insertion query which
  detects a re-insertion of the same row, and ignores conflicts on that
  ID. It also verifies that the contents of the IP address blocks are
  the same, in that case.
- Add regression test and handle new error variant where needed
bnaecker added a commit that referenced this issue Jul 16, 2024
- Fixes #6069
- Add an additional CTE to the existing VPC Subnet insertion query which
  detects a re-insertion of the same row, and ignores conflicts on that
  ID. It also verifies that the contents of the IP address blocks are
  the same, in that case.
- Add regression test and handle new error variant where needed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants