-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
follow trust-dns to its new name: hickory #5912
Conversation
As a heads-up, when I wrote qorb, I just started using hickory from the get-go (see: #5876) I ran into some issues specifically within the Hickory DNS client resolver. Although hickory DNS doesn't take any slog logs as arguments, it does use tracing, and when I manually added a tracing subscriber I got more info (this is a big motivation for me writing RFD 489). Anyway, when hickory DNS clients made requests through qorb, I saw a bunch of tracing messages that looked kinda like:
This seemed to match some of the symptoms described by hickory-dns/hickory-dns#2140 , which were triggered by an upgrade from the 0.22 -> 0.23 boundary. When I enabled edns, I stopped seeing the end-of-input reached messages. I still would like to dig into the underlying root cause more here, but if you see failing DNS client requests with this upgrade, hopefully this can be a useful trail-of-breadcrumbs. |
From the logs, I'm seeing the following from the internal-dns logs:
It also appears CockroachDB initialization isn't completing, based on this error |
To follow up on this: I see this same error in successful builds. I also honed in on a similar failure for I'm trying to get some tracing information out of hickory dns. |
I see messages coming into the internal DNS server:
But the corresponding request times out:
Note that this crdb zone is |
I finally put this up on a4x2 so we could debug the helios-deploy failure interactively. The problem readily reproduced: the system got stuck bringing up the CockroachDB zones:
and it's the same problem we saw in helios-deploy: dnswait is sitting there waiting to get a response from the DNS servers:
but the DNS servers are happily reporting receiving and responding to these requests:
I was immediately able to confirm that
So there's no problem with the networking stack here and the server at least seems to be working. @ahl had already confirmed that the problem was not readily reproducible when locally running the DNS server and Of course, I was also able to reproduce this running
I also used
In particular, we see the zone querying all three servers and getting responses for the SRV queries. Around this point I noticed this message in the dnsadm output:
and wondered if that could be related. @ahl immediately noticed that he had enabled eDNS in our internal DNS resolver constructor that accepts a specific list of (our) resolvers: but had not changed the path used by Wondering if eDNS was on the scene, I went back to the full dig output:
I do not see EDNS here. When using EDNS, dig prints something like this for the query part:
and this in the response:
At the same time, the "rcvd: 652" above shows the packet received was 652 bytes. DNS appears to have a limit of 512 bytes for normal UDP (non-EDNS) messages. I'm not sure this is definitely wrong, but at best it doesn't seem sound to expect clients to handle this well. I filed #6342 for this. I also saved a complete packet capture of the DNS traffic for a few queries and responses. It's not notable except that Wireshark also reports nothing about EDNS being used. I have not verified any of the following but here's my best guess about what was happening:
|
I just learned about
If I'm reading this right,
|
internal-dns/src/resolver.rs
Outdated
/// Construct a new DNS resolver from the system configuration. | ||
pub fn new_from_system_conf( | ||
log: slog::Logger, | ||
) -> Result<Self, ResolveError> { | ||
let (rc, mut opts) = hickory_resolver::system_conf::read_system_conf()?; | ||
opts.edns0 = true; | ||
|
||
let resolver = TokioAsyncResolver::tokio(rc, opts); | ||
|
||
Ok(Self { log, resolver }) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is one of the few things that I think is "new"; @davepacheco (or others) please let me know if you think there's a better approach
let mut resolver_opts = ResolverOpts::default(); | ||
// Enable edns for potentially larger records | ||
resolver_opts.edns0 = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block repeats a lot and the implications have been subtle -- I just wonder if we can/should put this into a common helper with more of an explanation.
obviates #4439 (if this works)