Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retire "services" table #4947

Closed
davepacheco opened this issue Jan 31, 2024 · 4 comments · Fixed by #5287
Closed

retire "services" table #4947

davepacheco opened this issue Jan 31, 2024 · 4 comments · Fixed by #5287
Assignees

Comments

@davepacheco
Copy link
Collaborator

The services table stores information about Omicron zones deployed by RSS. I think the intent was so that Nexus would know what was supposed to exist so that over time it could reconfigure the set of zones deployed. Unfortunately, we discovered this table doesn't have enough information to fully specify these zones so it cannot be used (by itself) to describe either what zones do exist nor what zones should exist. But fortunately we've already addressed this problem:

  • A fuller specification of each Omicron zone is fetched from each sled agent as part of the inventory process. This allows Nexus to know what zones are actually out there (as of some particular collection).
  • Blueprints include this same representation of Omicron zones. This allows Nexus to know what zones are supposed to be out there.
  • We use this representation in sled agent's PUT /omicron-zones API so we know it's sufficient to fully specify these zones.
  • I think we've built enough of a blueprint planner to have some confidence that the overall representation works to evolve the set of deployed zones.

Given that, I think we can just drop the services table altogether. There is only one thing that uses it as far as I know, and that's DNS propagation. After #4889, we should be free to drop this table altogether.

CC @smklein does this sound reasonable?

@smklein
Copy link
Collaborator

smklein commented Jan 31, 2024

Makes sense to me, I think the combination of inventory ("what exists") and blueprints ("what we want to exist") makes sense, and the services table is redundant with both of those.

@hawkw hawkw self-assigned this Feb 12, 2024
hawkw added a commit that referenced this issue Feb 12, 2024
Depends on #5033.

As described in #4947, we would like to remove the `services` table, as
its only remaining use is DNS propagation. PR #5033 changes DNS
propagation to no longer use the `services` table. Now that we're no
longer consuming this table, this commit removes the one method which
queries that table (`Datastore::services_list_kind`) and its two
remaining callers, the OMDB commands that query the table. I've also
removed the test for inserting and querying this table.

Note that I have *not* yet removed the code in RSS that actually
populates this table when the rack is set up. I thought it might be
desirable to still populate this data in case we have to roll back to a
previous version of Nexus that uses the `services` table. If this isn't
something we care about, I can remove that code as well, allowing us to
remove the `Datastore` methods for inserting to `services`.
hawkw added a commit that referenced this issue Feb 12, 2024
Currently, the DNS propagation background task in Nexus uses the
`services` table to enumerate the list of DNS servers that it's
responsible for keeping up to date. However, we'd really like to get rid
of said `services` table (see #4947), and the DNS propagation code is
the only remaining user of the `services` table.

Therefore, this branch changes how DNS servers are discovered for DNS
propagation. Rather than discovering DNS server addresses from the
`services` table, the `DnsWatcher` background task now discovers DNS
servers...using internal DNS. As described in #4889, this may _seem_
like a cyclical dependency, but, because the initial set of internal DNS
servers operate at known addresses -- by design -- so that they can
always be discovered. And they have to be up and discoverable for Nexus
to even come up and find CockroachDB. So, internal DNS can safely be
assumed to be up if Nexus has come up at all. Now, the `services` table
is no longer used, and

This change breaks the existing tests
`nexus::app::background::init::test_dns_propagation_basic` and
`nexus::app::background::dns_servers::test_basic`. I've rewritten the
`test_dns_propagation_basic` test to test the new expected behavior:
- creating a new internal DNS server and adding a DNS record for it to
the database's DNS config table results in that server's DNS records
being propagated to the existing DNS serve
- the `DnsWatcher` background task then picks up the DNS records for the
new DNS server by querying the existing known DNS server
- the current DNS config generation is then propagated to the new DNS
server
- a subsequent generation is propagated to both the initial and new DNS
servers

The `dns_servers::test_basic` test tested the discovery of DNS server
addresses from the database. Because these no longer come from the db,
and now come from internal DNS, this test would now end up exercising
most of the functionality tested in `test_dns_propagation_basic`. I
didn't think it was necessary to have two tests for this, so I made the
judgement call to delete `dns_servers::test_basic`. If we think having a
more isolated test that exercises only the DNS watcher task and not the
DNS propagation task, we could put this back and create records on the
DNS server by manually hitting its API with new configs, but I didn't
think this was really worth the effort.

I've also removed the `Datastore::upsert_service` method, which was used
only for test code and is now dead. I considered deleting all code
related to querying the `services` table in this branch as well.
However, I noticed that it's still populated when initializing the rack,
and that `omdb` has commands for querying that table. I wasn't sure if
there were alternative data sources for the `omdb` debugging commands
yet, so I didn't remove them. If the data provided by those commands is
available elsewhere, or if their only real purpose is _just_ to print
the state of this table, I'm happy to delete them in this branch, as
well.

Closes #4889


![image](https://github.com/oxidecomputer/omicron/assets/2796466/c37a0d31-26f7-4a5d-9748-ef7212cde9a9)
@jgallagher
Copy link
Contributor

Summarizing discussion from yesterday's update sync: there are (at least) two other clients of the service table:

  1. omdb db network list-eips uses it to map from an external IP to the kind of service (nexus vs ntp)
  2. DataStore::vpc_resolve_to_sleds() uses it to determine which sleds are running services that are using a service VPC (currently only SERVICE_VPC_ID, but we could imagine more fine-grained VPCs in the future)

#5202 changes both of those to also check the current target blueprint, but doesn't remove their use of service. To do so, I think we would need to first guarantee that every system has a current target blueprint. This should be straightforward for deployed systems, as we could set it to a disabled blueprint based on the latest collection, but would require work to ensure it will be true for new systems going through RSS.

Separately, there's a question of whether these clients ought to be looking at the latest inventory instead of (or in addition to?) the current target blueprint. Both inventory and blueprints are potentially misleading, but in different ways (inventory collections are lossy, blueprints are aspirational).

@davepacheco
Copy link
Collaborator Author

I think I found another use here:

pub async fn nexus_external_addresses(

This is used during Silo creation to find the Nexus external IPs so that we can assemble the right DNS records for the new Silo. After #5068, Reconfigurator will be capable of doing this correctly, but it can't be used here until/unless we fully automate planning+execution.

jgallagher added a commit that referenced this issue Mar 8, 2024
…s` about blueprints (#5202)

While working on #4886, I ran into a couple places that expect all
services to have rows in the `service` table, but Reconfigurator does
not add new rows to `service` when it spins up new services (see #4947
for rationale). This PR teaches these two users of `service` to _also_
check the current target blueprint when checking for services.

I also removed `LookupPath::service_id` and `lookup_resource! { ...
Service ... }`; the only user of these was `omdb`, and I think their
existence is misleading in the post-`service`-table world we're moving
toward.
@jgallagher
Copy link
Contributor

I took a stab at smoking out any other service table consumers (by removing it from schema.rs and following the trail of broken compilation). The only consumers beyond what we've already described in this issue are two omdb commands:

  • omdb db services list-instances
  • omdb db services list-by-sled

I suspect we can drop both of these, since you can get the same information by printing the current target blueprint? Maybe we want some variant(s) of blueprint printing that shows a subset of info or a different means of organization, but that is obviously not a blocker.

@jgallagher jgallagher assigned jgallagher and unassigned hawkw Mar 19, 2024
jgallagher added a commit that referenced this issue Apr 16, 2024
I still need to test this on madrid and I do _not_ want to merge it
before we cut the next release, but I believe this is ready for review.

Related changes / fallout also included in this PR:

* `omdb db services ...` subcommands are all gone. I believe this
functionality is covered by `omdb`'s inspection of blueprints instead.
* I removed the `SledResourceKind::{Dataset,Service,Reserved}` variants
that were unused. This isn't required, strictly speaking, but
`SledResourceKind::Service` was intended to reference the `service`
table, so I thought it was clearer to drop these for now (we can add
them back when we get to the point of using them).

There are two major pieces of functionality that now _require_ systems
to have a current target blueprint set:

* `DataStore::nexus_external_addresses()` now takes an explicit
`Blueprint` instead of an `Option<Blueprint>`. Its callers (silo
creation and reconfigurator DNS updates) fail if they cannot read the
current target blueprint.
* `DataStore::vpc_resolve_to_sleds()` now _implicitly_ requires a target
blueprint to be set, if and only if the VPC being queried involves
control plane services. (In practice today, that means the VPC ID is
exactly `SERVICES_VPC_ID`, although in the future we could have multiple
service-related VPCs.) I didn't want to make this method take an
explicit blueprint, because I think its most frequent callers are
specifying instance VPCs, which do not need to access the blueprint.

These two together mean that deploying this change to a system without a
target blueprint will result in (a) the inability to create silos or
update external DNS via reconfigurator and (b) services (including
Nexus) will not get the OPTE firewall rules they need to allow incoming
traffic. All newly-deployed systems have a (disabled) blueprint set as
of #5244, but we'll need to perform the necessary support operation to
bring already-deployed systems in line.

Closes #4947.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants