blueprints: Nexus needs to update external DNS #5068

davepacheco · 2024-02-14T21:09:08Z

The rack operates external DNS servers containing DNS names for each Silo. These DNS names resolve to the set of Nexus external IPs. Thus, this needs to be updated when either a Silo is created or destroyed or when a Nexus is brought into or out of service.

Reconfigurator probably needs to take responsibility for this. Right now, we add and remove DNS external DNS names when Silos are created or destroyed. I think it'll be unnecessarily complicated to try to keep doing that while having Reconfigurator be responsible for adding and removing DNS names when Nexus instances come and go. Simpler would be to do what we're doing with internal DNS in #4989, which is that the blueprint contains enough information that the executor can construct the complete contents of external DNS and then just make that the reality.

We could go further: after doing this, the only thing that ever changes DNS would be the blueprint system. It works by just writing the correct records to the database and letting the existing DNS propagation background tasks take care of the rest. We could consider ripping out the DNS database and propagation stuff altogether and instead just propagate to DNS servers directly from blueprint execution. I'm not sure it's really worth making this change now.

davepacheco · 2024-03-06T21:00:33Z

This is trickier than I thought because we still have not automated Reconfigurator. So if Reconfigurator fully owns writing to the DNS tables, then we'd regress functionality: when you create a Silo, you wouldn't get DNS names for it unless you ran the planner/executor by hand. And I don't think we're quite ready to fully automate Reconfigurator yet.

So if we start from the constraint that we don't change the "silo create"/"silo delete" paths so that they continue working as-is, then suppose we adopt a scheme similar to what we do for internal DNS:

during planning, we fetch the current external DNS version and store it into the blueprint
during execution, we:
- list all Silos in the database
- from the blueprint, enumerate the external IPs for all in-service Nexus instances
- construct the new contents of the DNS zone
- construct a diff against whatever version of the DNS zone was in the blueprint
- attempt to apply the diff

If this succeeds, then we know there have been no changes to external DNS since the blueprint was planned, so our view of what external DNS should look like is up-to-date, and we have successfully made DNS reflect the current reality. If this fails because external DNS has changed in the meantime, we have to abort and cannot fix this until a new blueprint is planned. (This is the same behavior as for internal DNS.)

What if a "silo create" operation happens between planning and execution? This operation commits changes to the Silo table and the DNS tables in one transaction. So one of these will be true:

the "silo create" transaction commits before the blueprint executor commits its external DNS change; in that case, the executor's DNS update will fail because the version number will have changed.
the "silo create" transaction commits after the blueprint executor commits its external DNS change; in that case, it is as though this happened any time after the executor finished. It's no problem. (It might be a problem if the DNS records that "silo create" wrote don't reflect the latest blueprint, but that should be fixed up by another iteration of the planner+executor. This isn't ideal but it's probably fine for now. Eventually we may want to switch up responsibility in the ways mentioned above -- have Reconfigurator own all writes to the DNS tables or even rip out the DNS tables altogether.)

jgallagher · 2024-03-06T21:46:29Z

2. the "silo create" transaction commits after the blueprint executor commits its external DNS change; in that case, it is as though this happened any time after the executor finished. It's no problem. (It might be a problem if the DNS records that "silo create" wrote don't reflect the latest blueprint, but that should be fixed up by another iteration of the planner+executor. This isn't ideal but it's probably fine for now. Eventually we may want to switch up responsibility in the ways mentioned above -- have Reconfigurator own all writes to the DNS tables or even rip out the DNS tables altogether.)

Just making sure I follow the bad case here, you're describing this sequence, right?

During planning, we fetch the current external DNS version and add or remove a Nexus.
An operator starts silo creation, which early on fetches the current external addresses for all Nexus instances.
Blueprint execution runs on the plan for 1, and successfully applies the diff that adds or removes the relevant Nexus IP.
The silo creation transaction runs. This updates the external DNS entries for the new silo, but it's incorrect, because it doesn't take into account the Nexus IP that was added or removed in step 3.

At this point the blueprint executor will be stuck constantly failing, right? When it lists all Silos, it will see the new one, see that it needs a DNS diff, attempt to apply it, and fail, because the external DNS version stored in the plan is now stale. Then rerunning the planner to get the latest DNS version will unwedge it and allow it to fix the new silo's incorrect DNS entries.

davepacheco · 2024-03-06T22:10:28Z

That's all correct.

andrewjstone · 2024-03-06T22:11:48Z

@davepacheco Your new plan seems reasonable to me. It's also very unlikely in the current state, that an operator would be creating a silo while we were running reconfigurator.

This was referenced Mar 7, 2024

update external DNS during blueprint execution #5212

Merged

retire "services" table #4947

Closed

nexus_external_addresses() should use target blueprint #5220

Closed

davepacheco self-assigned this Mar 7, 2024

davepacheco closed this as completed in #5212 Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blueprints: Nexus needs to update external DNS #5068

blueprints: Nexus needs to update external DNS #5068

davepacheco commented Feb 14, 2024

davepacheco commented Mar 6, 2024

jgallagher commented Mar 6, 2024

davepacheco commented Mar 6, 2024

andrewjstone commented Mar 6, 2024 •

edited

Loading

blueprints: Nexus needs to update external DNS #5068

blueprints: Nexus needs to update external DNS #5068

Comments

davepacheco commented Feb 14, 2024

davepacheco commented Mar 6, 2024

jgallagher commented Mar 6, 2024

davepacheco commented Mar 6, 2024

andrewjstone commented Mar 6, 2024 • edited Loading

andrewjstone commented Mar 6, 2024 •

edited

Loading