Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconfigurator: Move external IP allocation/deallocation into a transaction conditional on the current blueprint #6240

Merged
merged 7 commits into from
Aug 8, 2024

Conversation

jgallagher
Copy link
Contributor

See #6239 and #6207 (comment) for additional context, but briefly summarizing:

  • Prior to this PR, blueprint execution would deallocate and allocate external IP and NIC records independently
  • If two Nexus instances are trying to realize two different blueprints, this could result in them stomping on each other

Andrew suggested that we put the external IP resource set under a generation number, similar to how we mark generations for zones and DNS config. However, in both of those cases, there's an external entity (sled-agent or the DNS zone, respectively) that's ultimately responsible for rejecting updates with an old generation number. In this external networking case, there is no other entity - other parts of Nexus consume this information directly from CRDB. In chat, we discussed that we think it's equivalent to putting the set of allocation/deallocation operations inside a transaction that's conditional on the blueprint being executing still being the current target.

Most of this PR is moving the external_networking module from nexus-reconfigurator-execution into nexus-db-queries where it can be executed inside the new blueprint_ensure_external_networking_resources() method. This method opens a transaction, performs a SELECT ... FOR UPDATE to confirm that the blueprint we're ensuring is the current target, and then does the entire set of deallocation / allocation operations in that transaction. I added a test with some kinda gnarly #[cfg(test)] bits to control execution flow to show how the SELECT ... FOR UPDATE works: if the blueprint isn't the target, we immediately fail; if we are the target, any attempt to change the target will be queued by CRDB until we've finished our transaction.

Fixes #6239

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff @jgallagher

nexus/db-queries/src/db/datastore/deployment.rs Outdated Show resolved Hide resolved
@@ -961,13 +1103,23 @@ impl DataStore {
async fn blueprint_current_target_only(
&self,
conn: &async_bb8_diesel::Connection<DbConnection>,
select_for_update: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The booleans at the call sites are somewhat confusing. I think it's probably overkill to make this two separate functions, even though that's an option. How about doing what we do elsewhere, and instead of passing true or false directly, create a local variable named select_for_update and pass that in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was already borderline on the call site booleans; thanks for the nudge. I don't love the local variable thing, so I replaced this with a two variant enum in 989c5db - how does that read?

@@ -2061,6 +2219,172 @@ mod tests {
logctx.cleanup_successful();
}

#[tokio::test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great test!

I usually do not like to add test only code to production methods. However, in this case, I think ensuring that CRDB transaction serialization works as we expect warrants it. This also serves as a useful regression in case that behavior changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if select_for_update, which should take a write lock on the target table, or at least target row, also blocks reading from that table. I'm not sure if that would be a problem, and don't think it warrants a test. I'm just kinda curious at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also curious, and it was easy enough to expand this test to see! It does indeed block reads: 7fee264

@jgallagher jgallagher merged commit e4067e1 into main Aug 8, 2024
23 checks passed
@jgallagher jgallagher deleted the john/reconfigurator-test-ip-alloc-dealloc branch August 8, 2024 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants