Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] put DNS servers in DNS, so you can DNS while you DNS #5033

Merged
merged 18 commits into from
Feb 12, 2024

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Feb 8, 2024

Currently, the DNS propagation background task in Nexus uses the services table to enumerate the list of DNS servers that it's responsible for keeping up to date. However, we'd really like to get rid of said services table (see #4947), and the DNS propagation code is the only remaining user of the services table.

Therefore, this branch changes how DNS servers are discovered for DNS propagation. Rather than discovering DNS server addresses from the services table, the DnsWatcher background task now discovers DNS servers...using internal DNS. As described in #4889, this may seem like a cyclical dependency, but, because the initial set of internal DNS servers operate at known addresses -- by design -- so that they can always be discovered. And they have to be up and discoverable for Nexus to even come up and find CockroachDB. So, internal DNS can safely be assumed to be up if Nexus has come up at all. Now, the services table is no longer used, and

This change breaks the existing tests nexus::app::background::init::test_dns_propagation_basic and nexus::app::background::dns_servers::test_basic. I've rewritten the test_dns_propagation_basic test to test the new expected behavior:

  • creating a new internal DNS server and adding a DNS record for it to the database's DNS config table results in that server's DNS records being propagated to the existing DNS serve
  • the DnsWatcher background task then picks up the DNS records for the new DNS server by querying the existing known DNS server
  • the current DNS config generation is then propagated to the new DNS server
  • a subsequent generation is propagated to both the initial and new DNS servers

The dns_servers::test_basic test tested the discovery of DNS server addresses from the database. Because these no longer come from the db, and now come from internal DNS, this test would now end up exercising most of the functionality tested in test_dns_propagation_basic. I didn't think it was necessary to have two tests for this, so I made the judgement call to delete dns_servers::test_basic. If we think having a more isolated test that exercises only the DNS watcher task and not the DNS propagation task, we could put this back and create records on the DNS server by manually hitting its API with new configs, but I didn't think this was really worth the effort.

I've also removed the Datastore::upsert_service method, which was used only for test code and is now dead. I considered deleting all code related to querying the services table in this branch as well. However, I noticed that it's still populated when initializing the rack, and that omdb has commands for querying that table. I wasn't sure if there were alternative data sources for the omdb debugging commands yet, so I didn't remove them. If the data provided by those commands is available elsewhere, or if their only real purpose is just to print the state of this table, I'm happy to delete them in this branch, as well.

Closes #4889

image

hawkw added 11 commits February 8, 2024 15:40
after a bunch of messing with it i made the executive decision to just
get rid of that test. the test for DNS propagation exercises it, and
it's harder to test now
@hawkw hawkw changed the title we stored your DNS servers in DNS, so you can DNS while you DNS put DNS servers in DNS, so you can DNS while you DNS Feb 8, 2024
@hawkw hawkw changed the title put DNS servers in DNS, so you can DNS while you DNS [nexus] put DNS servers in DNS, so you can DNS while you DNS Feb 9, 2024
@hawkw hawkw marked this pull request as ready for review February 9, 2024 20:07
@smklein
Copy link
Collaborator

smklein commented Feb 9, 2024

Therefore, this branch changes how DNS servers are discovered for DNS propagation. Rather than discovering DNS server addresses from the services table, the DnsWatcher background task now discovers DNS servers...using internal DNS. As described in #4889, this may seem like a cyclical dependency, but, because the initial set of internal DNS servers operate at known addresses -- by design -- so that they can always be discovered. And they have to be up and discoverable for Nexus to even come up and find CockroachDB. So, internal DNS can safely be assumed to be up if Nexus has come up at all. Now, the services table is no longer used, and

I think this might have gotten cut-off!

I've also removed the Datastore::upsert_service method, which was used only for test code and is now dead. I considered deleting all code related to querying the services table in this branch as well. However, I noticed that it's still populated when initializing the rack, and that omdb has commands for querying that table. I wasn't sure if there were alternative data sources for the omdb debugging commands yet, so I didn't remove them. If the data provided by those commands is available elsewhere, or if their only real purpose is just to print the state of this table, I'm happy to delete them in this branch, as well.

I think it's wise to defer the deletion to a separate branch, just to keep things decoupled. Happy with this direction though.

nexus/src/app/background/dns_servers.rs Show resolved Hide resolved
Comment on lines +90 to +93
// TODO(eliza): it would be nicer if `Resolver` had a method
// returning an iterator instead of a `Vec`, so we didn't
// have to drain the Vec and then collect it into a new
// one...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be very doable, though it doesn't need to be part of this PR. The lookup_all_ipv6 method is a convenience function within omicron, acting on an iterator type internally, which we could totally return.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i might do that, although i didn't think it was particularly important to minimize allocation overhead in the background task --- this code doesn't seem particularly hot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lookup methods on Resolver could definitely use a fresh pass. But agreed, it's not a blocker here

nexus/src/app/background/init.rs Outdated Show resolved Hide resolved
@hawkw hawkw requested a review from luqmana February 10, 2024 01:31
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this and @luqmana and @smklein for the review. I appreciate the extra eyes on this.

@@ -467,29 +476,70 @@ pub mod test {
SocketAddr::V4(_) => panic!("expected v6 address"),
SocketAddr::V6(a) => a,
};

let update = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since DNS is now both the thing that configures us and the thing that we're configuring, I feel like this could use a comment. Maybe something like:

Suggested change
let update = {
// In order to test that DNS gets propagated to a newly-added server, we
// first need to update the source of truth about DNS (the database).
// Then we need to wait for that to get propagated (by this same
// mechanism) to the existing DNS servers. Only then would we expect
// the mechanism to see the new DNS server and then propagate
// configuration to it.
let update = {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, I'll add that!

@davepacheco
Copy link
Collaborator

In terms of #4987: I agree with not removing all of the service table stuff in this PR. I'd like to avoid doing that until we've finished with its replacement (i.e., until blueprints are far enough along that we've automated their generation so that we can always look at a blueprint when we want to know what zones should exist). Ideally I'd love to avoid building any new code that uses this table but I can't think of a great way to do that aside from a comment in the datastore file to that effect.

hawkw added a commit that referenced this pull request Feb 12, 2024
Depends on #5033.

As described in #4947, we would like to remove the `services` table, as
its only remaining use is DNS propagation. PR #5033 changes DNS
propagation to no longer use the `services` table. Now that we're no
longer consuming this table, this commit removes the one method which
queries that table (`Datastore::services_list_kind`) and its two
remaining callers, the OMDB commands that query the table. I've also
removed the test for inserting and querying this table.

Note that I have *not* yet removed the code in RSS that actually
populates this table when the rack is set up. I thought it might be
desirable to still populate this data in case we have to roll back to a
previous version of Nexus that uses the `services` table. If this isn't
something we care about, I can remove that code as well, allowing us to
remove the `Datastore` methods for inserting to `services`.
@hawkw
Copy link
Member Author

hawkw commented Feb 12, 2024

In terms of #4987: I agree with not removing all of the service table stuff in this PR. I'd like to avoid doing that until we've finished with its replacement (i.e., until blueprints are far enough along that we've automated their generation so that we can always look at a blueprint when we want to know what zones should exist). Ideally I'd love to avoid building any new code that uses this table but I can't think of a great way to do that aside from a comment in the datastore file to that effect.

One option would be removing the Datastore wrappers that actually query the table, in order to discourage its use in new code, but continuing to populate it in case we need it, until it's no longer necessary. Of course, we can still write new queries for the table, but not having a method for it should maybe make it less easy to reach for. If you think that this approach makes sense, I've opened #5044 to do that.

@hawkw hawkw enabled auto-merge (squash) February 12, 2024 20:48
@hawkw hawkw merged commit e1c3dd7 into main Feb 12, 2024
20 checks passed
@hawkw hawkw deleted the eliza/dns-in-dns branch February 12, 2024 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DNS propagation should use internal DNS to find DNS servers
4 participants