-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
retire "services" table #4947
Comments
Makes sense to me, I think the combination of inventory ("what exists") and blueprints ("what we want to exist") makes sense, and the services table is redundant with both of those. |
Depends on #5033. As described in #4947, we would like to remove the `services` table, as its only remaining use is DNS propagation. PR #5033 changes DNS propagation to no longer use the `services` table. Now that we're no longer consuming this table, this commit removes the one method which queries that table (`Datastore::services_list_kind`) and its two remaining callers, the OMDB commands that query the table. I've also removed the test for inserting and querying this table. Note that I have *not* yet removed the code in RSS that actually populates this table when the rack is set up. I thought it might be desirable to still populate this data in case we have to roll back to a previous version of Nexus that uses the `services` table. If this isn't something we care about, I can remove that code as well, allowing us to remove the `Datastore` methods for inserting to `services`.
Currently, the DNS propagation background task in Nexus uses the `services` table to enumerate the list of DNS servers that it's responsible for keeping up to date. However, we'd really like to get rid of said `services` table (see #4947), and the DNS propagation code is the only remaining user of the `services` table. Therefore, this branch changes how DNS servers are discovered for DNS propagation. Rather than discovering DNS server addresses from the `services` table, the `DnsWatcher` background task now discovers DNS servers...using internal DNS. As described in #4889, this may _seem_ like a cyclical dependency, but, because the initial set of internal DNS servers operate at known addresses -- by design -- so that they can always be discovered. And they have to be up and discoverable for Nexus to even come up and find CockroachDB. So, internal DNS can safely be assumed to be up if Nexus has come up at all. Now, the `services` table is no longer used, and This change breaks the existing tests `nexus::app::background::init::test_dns_propagation_basic` and `nexus::app::background::dns_servers::test_basic`. I've rewritten the `test_dns_propagation_basic` test to test the new expected behavior: - creating a new internal DNS server and adding a DNS record for it to the database's DNS config table results in that server's DNS records being propagated to the existing DNS serve - the `DnsWatcher` background task then picks up the DNS records for the new DNS server by querying the existing known DNS server - the current DNS config generation is then propagated to the new DNS server - a subsequent generation is propagated to both the initial and new DNS servers The `dns_servers::test_basic` test tested the discovery of DNS server addresses from the database. Because these no longer come from the db, and now come from internal DNS, this test would now end up exercising most of the functionality tested in `test_dns_propagation_basic`. I didn't think it was necessary to have two tests for this, so I made the judgement call to delete `dns_servers::test_basic`. If we think having a more isolated test that exercises only the DNS watcher task and not the DNS propagation task, we could put this back and create records on the DNS server by manually hitting its API with new configs, but I didn't think this was really worth the effort. I've also removed the `Datastore::upsert_service` method, which was used only for test code and is now dead. I considered deleting all code related to querying the `services` table in this branch as well. However, I noticed that it's still populated when initializing the rack, and that `omdb` has commands for querying that table. I wasn't sure if there were alternative data sources for the `omdb` debugging commands yet, so I didn't remove them. If the data provided by those commands is available elsewhere, or if their only real purpose is _just_ to print the state of this table, I'm happy to delete them in this branch, as well. Closes #4889 ![image](https://github.com/oxidecomputer/omicron/assets/2796466/c37a0d31-26f7-4a5d-9748-ef7212cde9a9)
Summarizing discussion from yesterday's update sync: there are (at least) two other clients of the
#5202 changes both of those to also check the current target blueprint, but doesn't remove their use of Separately, there's a question of whether these clients ought to be looking at the latest inventory instead of (or in addition to?) the current target blueprint. Both inventory and blueprints are potentially misleading, but in different ways (inventory collections are lossy, blueprints are aspirational). |
I think I found another use here:
This is used during Silo creation to find the Nexus external IPs so that we can assemble the right DNS records for the new Silo. After #5068, Reconfigurator will be capable of doing this correctly, but it can't be used here until/unless we fully automate planning+execution. |
…s` about blueprints (#5202) While working on #4886, I ran into a couple places that expect all services to have rows in the `service` table, but Reconfigurator does not add new rows to `service` when it spins up new services (see #4947 for rationale). This PR teaches these two users of `service` to _also_ check the current target blueprint when checking for services. I also removed `LookupPath::service_id` and `lookup_resource! { ... Service ... }`; the only user of these was `omdb`, and I think their existence is misleading in the post-`service`-table world we're moving toward.
I took a stab at smoking out any other
I suspect we can drop both of these, since you can get the same information by printing the current target blueprint? Maybe we want some variant(s) of blueprint printing that shows a subset of info or a different means of organization, but that is obviously not a blocker. |
I still need to test this on madrid and I do _not_ want to merge it before we cut the next release, but I believe this is ready for review. Related changes / fallout also included in this PR: * `omdb db services ...` subcommands are all gone. I believe this functionality is covered by `omdb`'s inspection of blueprints instead. * I removed the `SledResourceKind::{Dataset,Service,Reserved}` variants that were unused. This isn't required, strictly speaking, but `SledResourceKind::Service` was intended to reference the `service` table, so I thought it was clearer to drop these for now (we can add them back when we get to the point of using them). There are two major pieces of functionality that now _require_ systems to have a current target blueprint set: * `DataStore::nexus_external_addresses()` now takes an explicit `Blueprint` instead of an `Option<Blueprint>`. Its callers (silo creation and reconfigurator DNS updates) fail if they cannot read the current target blueprint. * `DataStore::vpc_resolve_to_sleds()` now _implicitly_ requires a target blueprint to be set, if and only if the VPC being queried involves control plane services. (In practice today, that means the VPC ID is exactly `SERVICES_VPC_ID`, although in the future we could have multiple service-related VPCs.) I didn't want to make this method take an explicit blueprint, because I think its most frequent callers are specifying instance VPCs, which do not need to access the blueprint. These two together mean that deploying this change to a system without a target blueprint will result in (a) the inability to create silos or update external DNS via reconfigurator and (b) services (including Nexus) will not get the OPTE firewall rules they need to allow incoming traffic. All newly-deployed systems have a (disabled) blueprint set as of #5244, but we'll need to perform the necessary support operation to bring already-deployed systems in line. Closes #4947.
The
services
table stores information about Omicron zones deployed by RSS. I think the intent was so that Nexus would know what was supposed to exist so that over time it could reconfigure the set of zones deployed. Unfortunately, we discovered this table doesn't have enough information to fully specify these zones so it cannot be used (by itself) to describe either what zones do exist nor what zones should exist. But fortunately we've already addressed this problem:PUT /omicron-zones
API so we know it's sufficient to fully specify these zones.Given that, I think we can just drop the
services
table altogether. There is only one thing that uses it as far as I know, and that's DNS propagation. After #4889, we should be free to drop this table altogether.CC @smklein does this sound reasonable?
The text was updated successfully, but these errors were encountered: