retire "services" table #4947

davepacheco · 2024-01-31T22:34:02Z

The services table stores information about Omicron zones deployed by RSS. I think the intent was so that Nexus would know what was supposed to exist so that over time it could reconfigure the set of zones deployed. Unfortunately, we discovered this table doesn't have enough information to fully specify these zones so it cannot be used (by itself) to describe either what zones do exist nor what zones should exist. But fortunately we've already addressed this problem:

A fuller specification of each Omicron zone is fetched from each sled agent as part of the inventory process. This allows Nexus to know what zones are actually out there (as of some particular collection).
Blueprints include this same representation of Omicron zones. This allows Nexus to know what zones are supposed to be out there.
We use this representation in sled agent's PUT /omicron-zones API so we know it's sufficient to fully specify these zones.
I think we've built enough of a blueprint planner to have some confidence that the overall representation works to evolve the set of deployed zones.

Given that, I think we can just drop the services table altogether. There is only one thing that uses it as far as I know, and that's DNS propagation. After #4889, we should be free to drop this table altogether.

CC @smklein does this sound reasonable?

The text was updated successfully, but these errors were encountered:

smklein · 2024-01-31T23:14:44Z

Makes sense to me, I think the combination of inventory ("what exists") and blueprints ("what we want to exist") makes sense, and the services table is redundant with both of those.

Depends on #5033. As described in #4947, we would like to remove the `services` table, as its only remaining use is DNS propagation. PR #5033 changes DNS propagation to no longer use the `services` table. Now that we're no longer consuming this table, this commit removes the one method which queries that table (`Datastore::services_list_kind`) and its two remaining callers, the OMDB commands that query the table. I've also removed the test for inserting and querying this table. Note that I have *not* yet removed the code in RSS that actually populates this table when the rack is set up. I thought it might be desirable to still populate this data in case we have to roll back to a previous version of Nexus that uses the `services` table. If this isn't something we care about, I can remove that code as well, allowing us to remove the `Datastore` methods for inserting to `services`.

Currently, the DNS propagation background task in Nexus uses the `services` table to enumerate the list of DNS servers that it's responsible for keeping up to date. However, we'd really like to get rid of said `services` table (see #4947), and the DNS propagation code is the only remaining user of the `services` table. Therefore, this branch changes how DNS servers are discovered for DNS propagation. Rather than discovering DNS server addresses from the `services` table, the `DnsWatcher` background task now discovers DNS servers...using internal DNS. As described in #4889, this may _seem_ like a cyclical dependency, but, because the initial set of internal DNS servers operate at known addresses -- by design -- so that they can always be discovered. And they have to be up and discoverable for Nexus to even come up and find CockroachDB. So, internal DNS can safely be assumed to be up if Nexus has come up at all. Now, the `services` table is no longer used, and This change breaks the existing tests `nexus::app::background::init::test_dns_propagation_basic` and `nexus::app::background::dns_servers::test_basic`. I've rewritten the `test_dns_propagation_basic` test to test the new expected behavior: - creating a new internal DNS server and adding a DNS record for it to the database's DNS config table results in that server's DNS records being propagated to the existing DNS serve - the `DnsWatcher` background task then picks up the DNS records for the new DNS server by querying the existing known DNS server - the current DNS config generation is then propagated to the new DNS server - a subsequent generation is propagated to both the initial and new DNS servers The `dns_servers::test_basic` test tested the discovery of DNS server addresses from the database. Because these no longer come from the db, and now come from internal DNS, this test would now end up exercising most of the functionality tested in `test_dns_propagation_basic`. I didn't think it was necessary to have two tests for this, so I made the judgement call to delete `dns_servers::test_basic`. If we think having a more isolated test that exercises only the DNS watcher task and not the DNS propagation task, we could put this back and create records on the DNS server by manually hitting its API with new configs, but I didn't think this was really worth the effort. I've also removed the `Datastore::upsert_service` method, which was used only for test code and is now dead. I considered deleting all code related to querying the `services` table in this branch as well. However, I noticed that it's still populated when initializing the rack, and that `omdb` has commands for querying that table. I wasn't sure if there were alternative data sources for the `omdb` debugging commands yet, so I didn't remove them. If the data provided by those commands is available elsewhere, or if their only real purpose is _just_ to print the state of this table, I'm happy to delete them in this branch, as well. Closes #4889 ![image](https://github.com/oxidecomputer/omicron/assets/2796466/c37a0d31-26f7-4a5d-9748-ef7212cde9a9)

jgallagher · 2024-03-06T15:14:14Z

Summarizing discussion from yesterday's update sync: there are (at least) two other clients of the service table:

omdb db network list-eips uses it to map from an external IP to the kind of service (nexus vs ntp)
DataStore::vpc_resolve_to_sleds() uses it to determine which sleds are running services that are using a service VPC (currently only SERVICE_VPC_ID, but we could imagine more fine-grained VPCs in the future)

#5202 changes both of those to also check the current target blueprint, but doesn't remove their use of service. To do so, I think we would need to first guarantee that every system has a current target blueprint. This should be straightforward for deployed systems, as we could set it to a disabled blueprint based on the latest collection, but would require work to ensure it will be true for new systems going through RSS.

Separately, there's a question of whether these clients ought to be looking at the latest inventory instead of (or in addition to?) the current target blueprint. Both inventory and blueprints are potentially misleading, but in different ways (inventory collections are lossy, blueprints are aspirational).

davepacheco · 2024-03-07T17:07:58Z

I think I found another use here:

omicron/nexus/db-queries/src/db/datastore/rack.rs

Line 793 in 6a0d1af

pub async fn nexus_external_addresses(

This is used during Silo creation to find the Nexus external IPs so that we can assemble the right DNS records for the new Silo. After #5068, Reconfigurator will be capable of doing this correctly, but it can't be used here until/unless we fully automate planning+execution.

…s` about blueprints (#5202) While working on #4886, I ran into a couple places that expect all services to have rows in the `service` table, but Reconfigurator does not add new rows to `service` when it spins up new services (see #4947 for rationale). This PR teaches these two users of `service` to _also_ check the current target blueprint when checking for services. I also removed `LookupPath::service_id` and `lookup_resource! { ... Service ... }`; the only user of these was `omdb`, and I think their existence is misleading in the post-`service`-table world we're moving toward.

jgallagher · 2024-03-08T19:54:37Z

I took a stab at smoking out any other service table consumers (by removing it from schema.rs and following the trail of broken compilation). The only consumers beyond what we've already described in this issue are two omdb commands:

omdb db services list-instances
omdb db services list-by-sled

I suspect we can drop both of these, since you can get the same information by printing the current target blueprint? Maybe we want some variant(s) of blueprint printing that shows a subset of info or a different means of organization, but that is obviously not a blocker.

I still need to test this on madrid and I do _not_ want to merge it before we cut the next release, but I believe this is ready for review. Related changes / fallout also included in this PR: * `omdb db services ...` subcommands are all gone. I believe this functionality is covered by `omdb`'s inspection of blueprints instead. * I removed the `SledResourceKind::{Dataset,Service,Reserved}` variants that were unused. This isn't required, strictly speaking, but `SledResourceKind::Service` was intended to reference the `service` table, so I thought it was clearer to drop these for now (we can add them back when we get to the point of using them). There are two major pieces of functionality that now _require_ systems to have a current target blueprint set: * `DataStore::nexus_external_addresses()` now takes an explicit `Blueprint` instead of an `Option<Blueprint>`. Its callers (silo creation and reconfigurator DNS updates) fail if they cannot read the current target blueprint. * `DataStore::vpc_resolve_to_sleds()` now _implicitly_ requires a target blueprint to be set, if and only if the VPC being queried involves control plane services. (In practice today, that means the VPC ID is exactly `SERVICES_VPC_ID`, although in the future we could have multiple service-related VPCs.) I didn't want to make this method take an explicit blueprint, because I think its most frequent callers are specifying instance VPCs, which do not need to access the blueprint. These two together mean that deploying this change to a system without a target blueprint will result in (a) the inability to create silos or update external DNS via reconfigurator and (b) services (including Nexus) will not get the OPTE firewall rules they need to allow incoming traffic. All newly-deployed systems have a (disabled) blueprint set as of #5244, but we'll need to perform the necessary support operation to bring already-deployed systems in line. Closes #4947.

hawkw mentioned this issue Feb 9, 2024

[nexus] put DNS servers in DNS, so you can DNS while you DNS #5033

Merged

hawkw self-assigned this Feb 12, 2024

hawkw mentioned this issue Feb 12, 2024

[nexus] remove services queries & OMDB commands #5044

Closed

jgallagher mentioned this issue Feb 16, 2024

Reconfigurator: Record external networking allocations when realizing a blueprint #5045

Merged

This was referenced Feb 29, 2024

blueprint-execution: plumb service firewall rules #5157

Closed

Teach omdb db network list-eips and DataStore::vpc_resolve_to_sleds about blueprints #5202

Merged

davepacheco mentioned this issue Mar 7, 2024

nexus_external_addresses() should use target blueprint #5220

Closed

davepacheco mentioned this issue Mar 12, 2024

RSS <-> Nexus handoff: Set an initial (disabled) target blueprint #5244

Merged

jgallagher mentioned this issue Mar 19, 2024

Drop the service table #5287

Merged

jgallagher assigned jgallagher and unassigned hawkw Mar 19, 2024

jgallagher closed this as completed in #5287 Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retire "services" table #4947

retire "services" table #4947

davepacheco commented Jan 31, 2024

smklein commented Jan 31, 2024

jgallagher commented Mar 6, 2024

davepacheco commented Mar 7, 2024

jgallagher commented Mar 8, 2024

retire "services" table #4947

retire "services" table #4947

Comments

davepacheco commented Jan 31, 2024

smklein commented Jan 31, 2024

jgallagher commented Mar 6, 2024

davepacheco commented Mar 7, 2024

jgallagher commented Mar 8, 2024