DNS client does not behave well with truncated responses #6415

iximeow · 2024-08-22T22:08:23Z

i was digging into a DNS client's view of #6342 and found some more issues. notes and repro steps in the fold. any changes mentioned are in ixi/hickory-client-testing. issues identified along the way are below.

started a standalone server:

sudo ./target/release/dns-server --config-file ./config.toml --http-address [::1]:80 --dns-address [::1]:53

created ~52 records:

for i in $(seq 65 90); do cargo run --bin dnsadm -- -a [::1]:80 add-a computer.oxide.test foo 49.50.51.$i; done
for i in $(seq 97 122); do cargo run --bin dnsadm -- -a [::1]:80 add-a computer.oxide.test foo 49.50.51.$i; done

with our stock DNS server settings, from dig, this results in:

> dig -6 A foo.computer.oxide.test @::1

; <<>> DiG 9.18.28-0ubuntu0.24.04.1-Ubuntu <<>> -6 A foo.computer.oxide.test @::1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58164
;; flags: qr rd; QUERY: 1, ANSWER: 52, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;foo.computer.oxide.test.	IN	A

;; ANSWER SECTION:
foo.computer.oxide.test. 0	IN	A	49.50.51.65
[... snip ...]
foo.computer.oxide.test. 0	IN	A	49.50.51.90
foo.computer.oxide.test. 0	IN	A	49.50.51.97
[... snip ...]
foo.computer.oxide.test. 0	IN	A	49.50.51.122

;; Query time: 0 msec
;; SERVER: ::1#53(::1) (UDP)
;; WHEN: Thu Aug 22 20:57:57 UTC 2024
;; MSG SIZE  rcvd: 873

which is to say, dig receives the whole response without issue.

in [branch] i've added a command to dnsadm to query the server using hickory-resolver, like internal-dns provides. it mirrors our experiences in #5912 and #6342:

~/oxide/omicron/dns-server$ cargo run --bin dnsadm -- -a [::1]:53 query -- A foo.computer.oxide.test
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.40s
     Running `/home/iximeow/oxide/omicron/target/debug/dnsadm -a '[::1]:53' query -- A foo.computer.oxide.test`
thread 'main' panicked at dns-server/src/bin/dnsadm.rs:277:70:
A lookup: ResolveError { kind: Timeout }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

this is mostly rehashing what we have seen elsewhere. there rest of this issue describes interesting observations and improvements we can make on the DNS client side.

error reporting

our DNS clients apparently silently failed (ending in timeout) when presented with larger-than-expected responses. it turns out they were trying to log, but not in a way we configured to hear. with an early tracing_subscriber::fmt::init() in dnsadm, rerunning the above query yields a bit more guidance:

~/oxide/omicron/dns-server$ cargo run --bin dnsadm -- -a [::1]:53 query -- A foo.computer.oxide.test
   Compiling dns-server v0.1.0 (/home/iximeow/oxide/omicron/dns-server)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 2.61s
     Running `/home/iximeow/oxide/omicron/target/debug/dnsadm -a '[::1]:53' query -- A foo.computer.oxide.test`
2024-08-22T21:08:56.323660Z  WARN hickory_proto::udp::udp_client_stream: dropped malformed message waiting for id: 46737 err: unexpected end of input reached
2024-08-22T21:09:01.326809Z  WARN hickory_proto::udp::udp_client_stream: dropped malformed message waiting for id: 12579 err: unexpected end of input reached
2024-08-22T21:09:06.328318Z  WARN hickory_proto::udp::udp_client_stream: dropped malformed message waiting for id: 47699 err: unexpected end of input reached
thread 'main' panicked at dns-server/src/bin/dnsadm.rs:277:70:
A lookup: ResolveError { kind: Timeout }

the WARN lines come from this region in hickory-dns. unfortunately, these are swallowed by send_serial_message_inner, at which point the client will retry until it times out. our options seem to be:

provide our own UdpClientStream, and DnsRequestSender impl on it, so that we can record errors on the wire, or
get hickory-dns changed in a way that allows us to inspect and handle these errors

in the case of truncated responses, it seems like no amount of retry would help, so at the very least we should bail out early...

speaking of truncation, this is the point things become really interesting.

truncation

our DNS server does not currently set the truncated bit, so we'll send oversized responses to clients. setting the bit is simple enough in the server:

diff --git a/dns-server/src/dns_server.rs b/dns-server/src/dns_server.rs
index 4ecbe382c..756ab71d8 100644
--- a/dns-server/src/dns_server.rs
+++ b/dns-server/src/dns_server.rs
@@ -411,6 +411,7 @@ where
     async move {
         let mut resp_data = Vec::new();
         let mut enc = BinEncoder::new(&mut resp_data);
+        enc.set_max_size(512);
         let _ = mresp
             .destructive_emit(&mut enc)
             .with_context(|| format!("encoding {}", label))?;

with this change, dig now reports that the response is truncated:

~/oxide/omicron/dns-server$ dig -6 A foo.computer.oxide.test @::1
;; Truncated, retrying in TCP mode.
;; Connection to ::1#53(::1) for foo.computer.oxide.test failed: connection refused.
;; no servers could be reached

;; Connection to ::1#53(::1) for foo.computer.oxide.test failed: connection refused.
;; no servers could be reached

;; Connection to ::1#53(::1) for foo.computer.oxide.test failed: connection refused.
;; no servers could be reached

because we are no longer sending oversized responses:

~/oxide/omicron/dns-server$ strace -f -s 4096 -e sendmmsg,recvmsg dig -6 A foo.computer.oxide.test @::1
strace: Process 181562 attached
strace: Process 181564 attached
[pid 181562] sendmmsg(17, [{msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\30\320\1 \0\1\0\0\0\0\0\1\3foo\10computer\5oxide\4test\0\0\1\0\1\0\0)\4\320\0\0\0\0\0\f\0\n\0\10\1\33\310\4\3}f\324", iov_len=64}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=64}], 1, 0) = 1
[pid 181562] recvmsg(17, {msg_name={sa_family=AF_INET6, sin6_port=htons(53), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, msg_namelen=128 => 28, msg_iov=[{iov_base="\30\320\203\0\0\1\0\34\0\0\0\0\3foo\10computer\5oxide\4test\0\0\1\0\1\300\f\0\1\0\1\0\0\0\0\0\004123A\300\f\0\1\0\1\0\0\0\0\0\004123B\300\f\0\1\0\1\0\0\0\0\0\004123C\300\f\0\1\0\1\0\0\0\0\0\004123D\300\f\0\1\0\1\0\0\0\0\0\004123E\300\f\0\1\0\1\0\0\0\0\0\004123F\300\f\0\1\0\1\0\0\0\0\0\004123G\300\f\0\1\0\1\0\0\0\0\0\004123H\300\f\0\1\0\1\0\0\0\0\0\004123I\300\f\0\1\0\1\0\0\0\0\0\004123J\300\f\0\1\0\1\0\0\0\0\0\004123K\300\f\0\1\0\1\0\0\0\0\0\004123L\300\f\0\1\0\1\0\0\0\0\0\004123M\300\f\0\1\0\1\0\0\0\0\0\004123N\300\f\0\1\0\1\0\0\0\0\0\004123O\300\f\0\1\0\1\0\0\0\0\0\004123P\300\f\0\1\0\1\0\0\0\0\0\004123Q\300\f\0\1\0\1\0\0\0\0\0\004123R\300\f\0\1\0\1\0\0\0\0\0\004123S\300\f\0\1\0\1\0\0\0\0\0\004123T\300\f\0\1\0\1\0\0\0\0\0\004123U\300\f\0\1\0\1\0\0\0\0\0\004123V\300\f\0\1\0\1\0\0\0\0\0\004123W\300\f\0\1\0\1\0\0\0\0\0\004123X\300\f\0\1\0\1\0\0\0\0\0\004123Y\300\f\0\1\0\1\0\0\0\0\0\004123Z\300\f\0\1\0\1\0\0\0\0\0\004123a\300\f\0\1\0\1\0\0\0\0\0\004123b\3foo\10computer\5oxide\4", iov_len=1310700}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 509
;; Truncated, retrying in TCP mode.
;; Connection to ::1#53(::1) for foo.computer.oxide.test failed: connection refused.
;; no servers could be reached

;; Connection to ::1#53(::1) for foo.computer.oxide.test failed: connection refused.
;; no servers could be reached

;; Connection to ::1#53(::1) for foo.computer.oxide.test failed: connection refused.
;; no servers could be reached

[pid 181562] +++ exited with 0 +++
[pid 181564] +++ exited with 0 +++
+++ exited with 9 +++

dig then tries to use TCP, which we don't support, and errors. that's nice because it's another point that resolvers used by software outside omicron will probably work reasonably.

meanwhile, with the resolver we use across Omicron:

~/oxide/omicron/dns-server$ RUST_LOG=debug  ../target/debug/dnsadm -a [::1]:53 query -- A foo.computer.oxide.test
2024-08-22T21:43:27.545872Z DEBUG hickory_proto::xfer::dns_handle: querying: foo.computer.oxide.test A
2024-08-22T21:43:27.545924Z DEBUG hickory_resolver::name_server::name_server_pool: sending request: [Query { name: Name("foo.computer.oxide.test"), query_type: A, query_class: IN }]
2024-08-22T21:43:27.545971Z DEBUG hickory_resolver::name_server::name_server: reconnecting: NameServerConfig { socket_addr: [::1]:53, protocol: Udp, tls_dns_name: None, trust_negative_responses: true, bind_addr: None }
2024-08-22T21:43:27.546037Z DEBUG hickory_proto::xfer: enqueueing message:QUERY:[Query { name: Name("foo.computer.oxide.test"), query_type: A, query_class: IN }]                                                                                                                 2024-08-22T21:43:27.546253Z DEBUG hickory_proto::udp::udp_client_stream: final message: ; header 30500:QUERY:RD:NoError:QUERY:0/0/0
; query
;; foo.computer.oxide.test. IN A

2024-08-22T21:43:27.546388Z DEBUG hickory_proto::udp::udp_stream: created socket successfully
2024-08-22T21:43:27.546859Z DEBUG hickory_proto::udp::udp_client_stream: received message id: 30500
2024-08-22T21:43:27.546920Z DEBUG hickory_resolver::error: Response:; header 30500:RESPONSE:RD,TC:NoError:QUERY:28/0/0
; query
;; foo.computer.oxide.test. IN A
; answers 28
foo.computer.oxide.test. 0 IN A 49.50.51.65
[... snip ...]
foo.computer.oxide.test. 0 IN A 49.50.51.98
; nameservers 0
; additionals 0

2024-08-22T21:43:27.547063Z DEBUG hickory_resolver::name_server::name_server_pool: truncated response received, retrying over TCP
2024-08-22T21:43:27.547074Z DEBUG hickory_resolver::name_server::name_server_pool: no TCP connections available
[... retry falls back to UDP, same output as above ...]
DEBG A, response: Ipv4Lookup(Lookup { [... records for 49.50.51.65 through 49.50.51.98 ...] })

notably: hickory-resolver notices the truncated response, tries again, gets the same results, then returns the records in that truncated response to the caller of ipv4_lookup. since ipv4_loookup does not error here, i don't think there is a way we can know, as a consumer of hickory-resolver, that we got a truncated response. we would just get a partial answer, unable to detect an inconsistency.

The text was updated successfully, but these errors were encountered:

iximeow mentioned this issue Aug 22, 2024

dns server does not behave well with DNS size limits #6342

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS client does not behave well with truncated responses #6415

DNS client does not behave well with truncated responses #6415

iximeow commented Aug 22, 2024 •

edited

Loading

DNS client does not behave well with truncated responses #6415

DNS client does not behave well with truncated responses #6415

Comments

iximeow commented Aug 22, 2024 • edited Loading

error reporting

truncation

iximeow commented Aug 22, 2024 •

edited

Loading