You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
i was digging into a DNS client's view of #6342 and found some more issues. notes and repro steps in the fold. any changes mentioned are in ixi/hickory-client-testing. issues identified along the way are below.
for i in $(seq 65 90); do cargo run --bin dnsadm -- -a [::1]:80 add-a computer.oxide.test foo 49.50.51.$i; done
for i in $(seq 97 122); do cargo run --bin dnsadm -- -a [::1]:80 add-a computer.oxide.test foo 49.50.51.$i; done
with our stock DNS server settings, from dig, this results in:
> dig -6 A foo.computer.oxide.test @::1
; <<>> DiG 9.18.28-0ubuntu0.24.04.1-Ubuntu <<>> -6 A foo.computer.oxide.test @::1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58164
;; flags: qr rd; QUERY: 1, ANSWER: 52, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available
;; QUESTION SECTION:
;foo.computer.oxide.test. IN A
;; ANSWER SECTION:
foo.computer.oxide.test. 0 IN A 49.50.51.65
[... snip ...]
foo.computer.oxide.test. 0 IN A 49.50.51.90
foo.computer.oxide.test. 0 IN A 49.50.51.97
[... snip ...]
foo.computer.oxide.test. 0 IN A 49.50.51.122
;; Query time: 0 msec
;; SERVER: ::1#53(::1) (UDP)
;; WHEN: Thu Aug 22 20:57:57 UTC 2024
;; MSG SIZE rcvd: 873
which is to say, dig receives the whole response without issue.
in [branch] i've added a command to dnsadm to query the server using hickory-resolver, like internal-dns provides. it mirrors our experiences in #5912 and #6342:
~/oxide/omicron/dns-server$ cargo run --bin dnsadm -- -a [::1]:53 query -- A foo.computer.oxide.test
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.40s
Running `/home/iximeow/oxide/omicron/target/debug/dnsadm -a '[::1]:53' query -- A foo.computer.oxide.test`
thread 'main' panicked at dns-server/src/bin/dnsadm.rs:277:70:
A lookup: ResolveError { kind: Timeout }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
this is mostly rehashing what we have seen elsewhere. there rest of this issue describes interesting observations and improvements we can make on the DNS client side.
error reporting
our DNS clients apparently silently failed (ending in timeout) when presented with larger-than-expected responses. it turns out they were trying to log, but not in a way we configured to hear. with an early tracing_subscriber::fmt::init() in dnsadm, rerunning the above query yields a bit more guidance:
~/oxide/omicron/dns-server$ cargo run --bin dnsadm -- -a [::1]:53 query -- A foo.computer.oxide.test
Compiling dns-server v0.1.0 (/home/iximeow/oxide/omicron/dns-server)
Finished `dev` profile [unoptimized + debuginfo] target(s) in 2.61s
Running `/home/iximeow/oxide/omicron/target/debug/dnsadm -a '[::1]:53' query -- A foo.computer.oxide.test`
2024-08-22T21:08:56.323660Z WARN hickory_proto::udp::udp_client_stream: dropped malformed message waiting for id: 46737 err: unexpected end of input reached
2024-08-22T21:09:01.326809Z WARN hickory_proto::udp::udp_client_stream: dropped malformed message waiting for id: 12579 err: unexpected end of input reached
2024-08-22T21:09:06.328318Z WARN hickory_proto::udp::udp_client_stream: dropped malformed message waiting for id: 47699 err: unexpected end of input reached
thread 'main' panicked at dns-server/src/bin/dnsadm.rs:277:70:
A lookup: ResolveError { kind: Timeout }
the WARN lines come from this region in hickory-dns. unfortunately, these are swallowed by send_serial_message_inner, at which point the client will retry until it times out. our options seem to be:
provide our own UdpClientStream, and DnsRequestSender impl on it, so that we can record errors on the wire, or
get hickory-dns changed in a way that allows us to inspect and handle these errors
in the case of truncated responses, it seems like no amount of retry would help, so at the very least we should bail out early...
speaking of truncation, this is the point things become really interesting.
truncation
our DNS server does not currently set the truncated bit, so we'll send oversized responses to clients. setting the bit is simple enough in the server:
diff --git a/dns-server/src/dns_server.rs b/dns-server/src/dns_server.rs
index 4ecbe382c..756ab71d8 100644
--- a/dns-server/src/dns_server.rs+++ b/dns-server/src/dns_server.rs@@ -411,6 +411,7 @@ where
async move {
let mut resp_data = Vec::new();
let mut enc = BinEncoder::new(&mut resp_data);
+ enc.set_max_size(512);
let _ = mresp
.destructive_emit(&mut enc)
.with_context(|| format!("encoding {}", label))?;
with this change, dig now reports that the response is truncated:
~/oxide/omicron/dns-server$ dig -6 A foo.computer.oxide.test @::1
;; Truncated, retrying in TCP mode.
;; Connection to ::1#53(::1) for foo.computer.oxide.test failed: connection refused.
;; no servers could be reached
;; Connection to ::1#53(::1) for foo.computer.oxide.test failed: connection refused.
;; no servers could be reached
;; Connection to ::1#53(::1) for foo.computer.oxide.test failed: connection refused.
;; no servers could be reached
because we are no longer sending oversized responses:
~/oxide/omicron/dns-server$ strace -f -s 4096 -e sendmmsg,recvmsg dig -6 A foo.computer.oxide.test @::1
strace: Process 181562 attached
strace: Process 181564 attached
[pid 181562] sendmmsg(17, [{msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\30\320\1 \0\1\0\0\0\0\0\1\3foo\10computer\5oxide\4test\0\0\1\0\1\0\0)\4\320\0\0\0\0\0\f\0\n\0\10\1\33\310\4\3}f\324", iov_len=64}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=64}], 1, 0) = 1
[pid 181562] recvmsg(17, {msg_name={sa_family=AF_INET6, sin6_port=htons(53), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, msg_namelen=128 => 28, msg_iov=[{iov_base="\30\320\203\0\0\1\0\34\0\0\0\0\3foo\10computer\5oxide\4test\0\0\1\0\1\300\f\0\1\0\1\0\0\0\0\0\004123A\300\f\0\1\0\1\0\0\0\0\0\004123B\300\f\0\1\0\1\0\0\0\0\0\004123C\300\f\0\1\0\1\0\0\0\0\0\004123D\300\f\0\1\0\1\0\0\0\0\0\004123E\300\f\0\1\0\1\0\0\0\0\0\004123F\300\f\0\1\0\1\0\0\0\0\0\004123G\300\f\0\1\0\1\0\0\0\0\0\004123H\300\f\0\1\0\1\0\0\0\0\0\004123I\300\f\0\1\0\1\0\0\0\0\0\004123J\300\f\0\1\0\1\0\0\0\0\0\004123K\300\f\0\1\0\1\0\0\0\0\0\004123L\300\f\0\1\0\1\0\0\0\0\0\004123M\300\f\0\1\0\1\0\0\0\0\0\004123N\300\f\0\1\0\1\0\0\0\0\0\004123O\300\f\0\1\0\1\0\0\0\0\0\004123P\300\f\0\1\0\1\0\0\0\0\0\004123Q\300\f\0\1\0\1\0\0\0\0\0\004123R\300\f\0\1\0\1\0\0\0\0\0\004123S\300\f\0\1\0\1\0\0\0\0\0\004123T\300\f\0\1\0\1\0\0\0\0\0\004123U\300\f\0\1\0\1\0\0\0\0\0\004123V\300\f\0\1\0\1\0\0\0\0\0\004123W\300\f\0\1\0\1\0\0\0\0\0\004123X\300\f\0\1\0\1\0\0\0\0\0\004123Y\300\f\0\1\0\1\0\0\0\0\0\004123Z\300\f\0\1\0\1\0\0\0\0\0\004123a\300\f\0\1\0\1\0\0\0\0\0\004123b\3foo\10computer\5oxide\4", iov_len=1310700}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 509
;; Truncated, retrying in TCP mode.
;; Connection to ::1#53(::1) for foo.computer.oxide.test failed: connection refused.
;; no servers could be reached
;; Connection to ::1#53(::1) for foo.computer.oxide.test failed: connection refused.
;; no servers could be reached
;; Connection to ::1#53(::1) for foo.computer.oxide.test failed: connection refused.
;; no servers could be reached
[pid 181562] +++ exited with 0 +++
[pid 181564] +++ exited with 0 +++
+++ exited with 9 +++
dig then tries to use TCP, which we don't support, and errors. that's nice because it's another point that resolvers used by software outside omicron will probably work reasonably.
meanwhile, with the resolver we use across Omicron:
~/oxide/omicron/dns-server$ RUST_LOG=debug ../target/debug/dnsadm -a [::1]:53 query -- A foo.computer.oxide.test
2024-08-22T21:43:27.545872Z DEBUG hickory_proto::xfer::dns_handle: querying: foo.computer.oxide.test A
2024-08-22T21:43:27.545924Z DEBUG hickory_resolver::name_server::name_server_pool: sending request: [Query { name: Name("foo.computer.oxide.test"), query_type: A, query_class: IN }]
2024-08-22T21:43:27.545971Z DEBUG hickory_resolver::name_server::name_server: reconnecting: NameServerConfig { socket_addr: [::1]:53, protocol: Udp, tls_dns_name: None, trust_negative_responses: true, bind_addr: None }
2024-08-22T21:43:27.546037Z DEBUG hickory_proto::xfer: enqueueing message:QUERY:[Query { name: Name("foo.computer.oxide.test"), query_type: A, query_class: IN }] 2024-08-22T21:43:27.546253Z DEBUG hickory_proto::udp::udp_client_stream: final message: ; header 30500:QUERY:RD:NoError:QUERY:0/0/0
; query
;; foo.computer.oxide.test. IN A
2024-08-22T21:43:27.546388Z DEBUG hickory_proto::udp::udp_stream: created socket successfully
2024-08-22T21:43:27.546859Z DEBUG hickory_proto::udp::udp_client_stream: received message id: 30500
2024-08-22T21:43:27.546920Z DEBUG hickory_resolver::error: Response:; header 30500:RESPONSE:RD,TC:NoError:QUERY:28/0/0
; query
;; foo.computer.oxide.test. IN A
; answers 28
foo.computer.oxide.test. 0 IN A 49.50.51.65
[... snip ...]
foo.computer.oxide.test. 0 IN A 49.50.51.98
; nameservers 0
; additionals 0
2024-08-22T21:43:27.547063Z DEBUG hickory_resolver::name_server::name_server_pool: truncated response received, retrying over TCP
2024-08-22T21:43:27.547074Z DEBUG hickory_resolver::name_server::name_server_pool: no TCP connections available
[... retry falls back to UDP, same output as above ...]
DEBG A, response: Ipv4Lookup(Lookup { [... records for 49.50.51.65 through 49.50.51.98 ...] })
notably: hickory-resolver notices the truncated response, tries again, gets the same results, then returns the records in that truncated response to the caller of ipv4_lookup. since ipv4_loookup does not error here, i don't think there is a way we can know, as a consumer of hickory-resolver, that we got a truncated response. we would just get a partial answer, unable to detect an inconsistency.
The text was updated successfully, but these errors were encountered:
i was digging into a DNS client's view of #6342 and found some more issues. notes and repro steps in the fold. any changes mentioned are in
ixi/hickory-client-testing
. issues identified along the way are below.created ~52 records:
with our stock DNS server settings, from
dig
, this results in:which is to say,
dig
receives the whole response without issue.in [branch] i've added a command to
dnsadm
to query the server usinghickory-resolver
, likeinternal-dns
provides. it mirrors our experiences in #5912 and #6342:this is mostly rehashing what we have seen elsewhere. there rest of this issue describes interesting observations and improvements we can make on the DNS client side.
error reporting
our DNS clients apparently silently failed (ending in timeout) when presented with larger-than-expected responses. it turns out they were trying to log, but not in a way we configured to hear. with an early
tracing_subscriber::fmt::init()
indnsadm
, rerunning the above query yields a bit more guidance:the
WARN
lines come from this region inhickory-dns
. unfortunately, these are swallowed bysend_serial_message_inner
, at which point the client will retry until it times out. our options seem to be:UdpClientStream
, andDnsRequestSender
impl on it, so that we can record errors on the wire, orhickory-dns
changed in a way that allows us to inspect and handle these errorsin the case of truncated responses, it seems like no amount of retry would help, so at the very least we should bail out early...
speaking of truncation, this is the point things become really interesting.
truncation
our DNS server does not currently set the truncated bit, so we'll send oversized responses to clients. setting the bit is simple enough in the server:
with this change,
dig
now reports that the response is truncated:because we are no longer sending oversized responses:
dig
then tries to use TCP, which we don't support, and errors. that's nice because it's another point that resolvers used by software outside omicron will probably work reasonably.meanwhile, with the resolver we use across Omicron:
notably:
hickory-resolver
notices the truncated response, tries again, gets the same results, then returns the records in that truncated response to the caller ofipv4_lookup
. sinceipv4_loookup
does not error here, i don't think there is a way we can know, as a consumer ofhickory-resolver
, that we got a truncated response. we would just get a partial answer, unable to detect an inconsistency.The text was updated successfully, but these errors were encountered: