follow trust-dns to its new name: hickory #5912

ahl · 2024-06-18T19:03:51Z

obviates #4439 (if this works)

smklein · 2024-06-18T19:23:03Z

As a heads-up, when I wrote qorb, I just started using hickory from the get-go (see: #5876)

I ran into some issues specifically within the Hickory DNS client resolver. Although hickory DNS doesn't take any slog logs as arguments, it does use tracing, and when I manually added a tracing subscriber I got more info (this is a big motivation for me writing RFD 489).

Anyway, when hickory DNS clients made requests through qorb, I saw a bunch of tracing messages that looked kinda like:

WARN trust_dns_proto::udp::udp_client_stream: dropped malformed message waiting for id: 21579 err: unexpected end of input reached

This seemed to match some of the symptoms described by hickory-dns/hickory-dns#2140 , which were triggered by an upgrade from the 0.22 -> 0.23 boundary. When I enabled edns, I stopped seeing the end-of-input reached messages.

I still would like to dig into the underlying root cause more here, but if you see failing DNS client requests with this upgrade, hopefully this can be a useful trail-of-breadcrumbs.

smklein · 2024-08-06T16:51:50Z

From the logs, I'm seeing the following from the internal-dns logs:

2024-08-06T00:00:00.384Z	ERRO	dns-server (dns): failed to handle incoming DNS message: SERVFAIL: server is not authoritative for name: "oxz_cockroachdb_49283f4a-b51b-4b52-8a00-dda887133849."
    peer_addr = [fd00:1122:3344:101::4]:44059
    req_id = 952b5a26-0a7a-419d-b88c-b6047b03ce2c

It also appears CockroachDB initialization isn't completing, based on this error

ahl · 2024-08-13T18:22:37Z

From the logs, I'm seeing the following from the internal-dns logs:
2024-08-06T00:00:00.384Z	ERRO	dns-server (dns): failed to handle incoming DNS message: SERVFAIL: server is not authoritative for name: "oxz_cockroachdb_49283f4a-b51b-4b52-8a00-dda887133849."
    peer_addr = [fd00:1122:3344:101::4]:44059
    req_id = 952b5a26-0a7a-419d-b88c-b6047b03ce2c
It also appears CockroachDB initialization isn't completing, based on this error

To follow up on this: I see this same error in successful builds. I also honed in on a similar failure for _nexus._tcp.control-plane.oxide.internal. but I also see that in some CI runs.

I'm trying to get some tracing information out of hickory dns.

ahl · 2024-08-13T23:16:47Z

I see messages coming into the internal DNS server:

1032	2024-08-13T22:10:31.961Z	DEBG	dns-server (dns): message_request
    mr = MessageRequest {\n    header: Header {\n        id: 56913,\n        message_type: Query,\n        op_code: Query,\n        authoritative: false,\n        truncation: false,\n        recursion_desired: true,\n        recursion_available: false,\n        authentic_data: false,\n        checking_disabled: false,\n        response_code: NoError,\n        query_count: 1,\n        answer_count: 0,\n        name_server_count: 0,\n        additional_count: 0,\n    },\n    query: WireQuery {\n        query: LowerQuery {\n            name: LowerName(\n                Name("_cockroach._tcp.control-plane.oxide.internal."),\n            ),\n            original: Query {\n                name: Name("_cockroach._tcp.control-plane.oxide.internal."),\n                query_type: SRV,\n                query_class: IN,\n            },\n        },\n        original: [\n            10,\n            95,\n            99,\n            111,\n            99,\n            107,\n            114,\n            111,\n            97,\n            99,\n            104,\n            4,\n            95,\n            116,\n            99,\n            112,\n            13,\n            99,\n            111,\n            110,\n            116,\n            114,\n            111,\n            108,\n            45,\n            112,\n            108,\n            97,\n            110,\n            101,\n            5,\n            111,\n            120,\n            105,\n            100,\n            101,\n            8,\n            105,\n            110,\n            116,\n            101,\n            114,\n            110,\n            97,\n            108,\n            0,\n            0,\n            33,\n            0,\n            1,\n        ],\n    },\n    answers: [],\n    name_servers: [],\n    additionals: [],\n    sig0: [],\n    edns: None,\n}
    peer_addr = [fd00:1122:3344:101::6]:49500
    req_id = 958e5f1f-eb9e-4c34-818f-173b527978ac

But the corresponding request times out:

27	2024-08-13T22:11:01.968Z	WARN	dnswait: DNS query failed; will try again
    delay = 818.849846ms
    error = request timed out

Note that this crdb zone is fd00:1122:3344:101::6 which corresponds to the peer_addr above

davepacheco · 2024-08-15T04:36:49Z

I finally put this up on a4x2 so we could debug the helios-deploy failure interactively. The problem readily reproduced: the system got stuck bringing up the CockroachDB zones:

root@g0:~# zoneadm list
global
oxz_switch
oxz_internal_dns_fd3abde8-1f2c-44bf-83bc-1a6479524260
oxz_ntp_56c2fb6f-add4-4325-9ef3-c0bd6c35de1a
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65
oxz_cockroachdb_b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1
root@g0:~# zlogin oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65
[Connected to zone 'oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65' pts/6]
The illumos Project     helios-2.0.22694        May 2024
root@oxz_cockroachdb_a1ea5264:~# svcs
STATE          STIME    FMRI
...
offline*        2:50:13 svc:/oxide/cockroachdb:default

and it's the same problem we saw in helios-deploy: dnswait is sitting there waiting to get a response from the DNS servers:

root@oxz_cockroachdb_a1ea5264:~# tail -f $(svcs -L cockroachdb) | looker
02:50:43.957Z WARN dnswait: DNS query failed; will try again
    delay = 139.591848ms
    error = request timed out
02:51:14.103Z WARN dnswait: DNS query failed; will try again
    delay = 381.765426ms
    error = request timed out
02:51:44.490Z WARN dnswait: DNS query failed; will try again
    delay = 1.23004047s
    error = request timed out
02:52:15.728Z WARN dnswait: DNS query failed; will try again
    delay = 2.408434624s
    error = request timed out
02:52:48.141Z WARN dnswait: DNS query failed; will try again
    delay = 5.939410988s
    error = request timed out
02:53:24.086Z WARN dnswait: DNS query failed; will try again
    delay = 6.926888473s
    error = request timed out
02:54:01.021Z WARN dnswait: DNS query failed; will try again
    delay = 8.993978337s
    error = request timed out
02:54:40.018Z WARN dnswait: DNS query failed; will try again
    delay = 31.943154409s
    error = request timed out
02:55:41.964Z WARN dnswait: DNS query failed; will try again
    delay = 82.841641439s
    error = request timed out
02:57:34.803Z WARN dnswait: DNS query failed; will try again
    delay = 136.735336465s
    error = request timed out

but the DNS servers are happily reporting receiving and responding to these requests:

root@oxz_internal_dns_fd3abde8:~# tail -n 20 -f $(svcs -L internal_dns) | looker
04:19:12.024Z DEBG dns-server (dns): message_request
    mr = MessageRequest {\n    header: Header {\n        id: 24737,\n        message_type: Query,\n        op_code: Query,\n        authoritative: false,\n        truncation: false,\n        recursion_desired: true,\n        recursion_available: false,\n        authentic_data: false,\n        checking_disabled: false,\n        response_code: NoError,\n        query_count: 1,\n        answer_count: 0,\n        name_server_count: 0,\n        additional_count: 0,\n    },\n    query: WireQuery {\n        query: LowerQuery {\n            name: LowerName(\n                Name("_cockroach._tcp.control-plane.oxide.internal."),\n            ),\n            original: Query {\n                name: Name("_cockroach._tcp.control-plane.oxide.internal."),\n                query_type: SRV,\n                query_class: IN,\n            },\n        },\n        original: [\n            10,\n            95,\n            99,\n            111,\n            99,\n            107,\n            114,\n            111,\n            97,\n            99,\n            104,\n            4,\n            95,\n            116,\n            99,\n            112,\n            13,\n            99,\n            111,\n            110,\n            116,\n            114,\n            111,\n            108,\n            45,\n            112,\n            108,\n            97,\n            110,\n            101,\n            5,\n            111,\n            120,\n            105,\n            100,\n            101,\n            8,\n            105,\n            110,\n            116,\n            101,\n            114,\n            110,\n            97,\n            108,\n            0,\n            0,\n            33,\n            0,\n            1,\n        ],\n    },\n    answers: [],\n    name_servers: [],\n    additionals: [],\n    sig0: [],\n    edns: None,\n}
    peer_addr = [fd00:1122:3344:101::4]:64374
    req_id = 1b3c41d1-5d74-4ce1-9380-10c2f41ae002
zones
zone control-plane.oxide.internal
04:19:12.024Z DEBG dns-server (store): query key
    key = _cockroach._tcp
zones
zone control-plane.oxide.internal
04:19:12.025Z DEBG dns-server (store): query key
    key = 17e75e8c-57ca-4fd9-8011-614bc2e72c98.host
zones
zone control-plane.oxide.internal
04:19:12.025Z DEBG dns-server (store): query key
    key = a1ea5264-563d-40a9-8446-6a32951e5c65.host
zones
zone control-plane.oxide.internal
04:19:12.025Z DEBG dns-server (store): query key
    key = a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host
zones
zone control-plane.oxide.internal
04:19:12.026Z DEBG dns-server (store): query key
    key = b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host
zones
zone control-plane.oxide.internal
04:19:12.026Z DEBG dns-server (store): query key
    key = b6736d88-bffb-4361-9857-c8ac7eab4ab8.host
04:19:12.026Z DEBG dns-server (dns): dns response
    additional_records = [Record { name_labels: Name("17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal"), rr_type: AAAA, dns_class: IN, ttl: 0, rdata: Some(AAAA(AAAA(fd00:1122:3344:102::4))) }, Record { name_labels: Name("a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal"), rr_type: AAAA, dns_class: IN, ttl: 0, rdata: Some(AAAA(AAAA(fd00:1122:3344:101::3))) }, Record { name_labels: Name("a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal"), rr_type: AAAA, dns_class: IN, ttl: 0, rdata: Some(AAAA(AAAA(fd00:1122:3344:103::3))) }, Record { name_labels: Name("b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal"), rr_type: AAAA, dns_class: IN, ttl: 0, rdata: Some(AAAA(AAAA(fd00:1122:3344:101::4))) }, Record { name_labels: Name("b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal"), rr_type: AAAA, dns_class: IN, ttl: 0, rdata: Some(AAAA(AAAA(fd00:1122:3344:102::3))) }]
    peer_addr = [fd00:1122:3344:101::4]:64374
    query = LowerQuery { name: LowerName(Name("_cockroach._tcp.control-plane.oxide.internal.")), original: Query { name: Name("_cockroach._tcp.control-plane.oxide.internal."), query_type: SRV, query_class: IN } }
    records = [Record { name_labels: Name("_cockroach._tcp.control-plane.oxide.internal."), rr_type: SRV, dns_class: IN, ttl: 0, rdata: Some(SRV(SRV { priority: 0, weight: 0, port: 32221, target: Name("17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal") })) }, Record { name_labels: Name("_cockroach._tcp.control-plane.oxide.internal."), rr_type: SRV, dns_class: IN, ttl: 0, rdata: Some(SRV(SRV { priority: 0, weight: 0, port: 32221, target: Name("a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal") })) }, Record { name_labels: Name("_cockroach._tcp.control-plane.oxide.internal."), rr_type: SRV, dns_class: IN, ttl: 0, rdata: Some(SRV(SRV { priority: 0, weight: 0, port: 32221, target: Name("a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal") })) }, Record { name_labels: Name("_cockroach._tcp.control-plane.oxide.internal."), rr_type: SRV, dns_class: IN, ttl: 0, rdata: Some(SRV(SRV { priority: 0, weight: 0, port: 32221, target: Name("b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal") })) }, Record { name_labels: Name("_cockroach._tcp.control-plane.oxide.internal."), rr_type: SRV, dns_class: IN, ttl: 0, rdata: Some(SRV(SRV { priority: 0, weight: 0, port: 32221, target: Name("b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal") })) }]
    req_id = 1b3c41d1-5d74-4ce1-9380-10c2f41ae002

I was immediately able to confirm that dig can query all of the DNS servers in /etc/resolv.conf and get the correct answers:

root@oxz_cockroachdb_a1ea5264:~# cat /etc/resolv.conf 
nameserver fd00:1122:3344:3::1
nameserver fd00:1122:3344:2::1
nameserver fd00:1122:3344:1::1

root@oxz_cockroachdb_a1ea5264:~# awk '$1 == "nameserver"{ print $2 }' /etc/resolv.conf  | while read ip; do echo checking $ip; dig -t SRV _cockroach._tcp.control-plane.oxide.internal. @$ip +short; echo; done
checking fd00:1122:3344:3::1
0 0 32221 17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal.
0 0 32221 a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal.
0 0 32221 a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal.
0 0 32221 b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal.
0 0 32221 b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal.

checking fd00:1122:3344:2::1
0 0 32221 17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal.
0 0 32221 a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal.
0 0 32221 a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal.
0 0 32221 b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal.
0 0 32221 b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal.

checking fd00:1122:3344:1::1
0 0 32221 17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal.
0 0 32221 a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal.
0 0 32221 a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal.
0 0 32221 b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal.
0 0 32221 b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal.

So there's no problem with the networking stack here and the server at least seems to be working. @ahl had already confirmed that the problem was not readily reproducible when locally running the DNS server and dnsadm, which is rather surprising. We wondered if this was an issue only with release builds and did the local test with those, but that didn't reproduce the problem either.

Of course, I was also able to reproduce this running dnsadm myself in the zone:

# /opt/oxide/internal-dns-cli/bin/dnswait cockroach 2>&1 | looker
note: configured to log to "/dev/stderr"
03:21:18.442Z INFO dnswait: using system configuration
03:21:48.449Z WARN dnswait: DNS query failed; will try again
    delay = 374.992936ms
    error = request timed out
...

I also used # snoop -d oxControlService17 udp port 53 in the CockroachDB zone to verify that the traffic looks like what we'd expect, and it mostly does:

root@oxz_cockroachdb_a1ea5264:~# snoop -d oxControlService17 udp port 53
Using device oxControlService17 (promiscuous mode)
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:3::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:2::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
fd00:1122:3344:3::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
fd00:1122:3344:2::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:3::1 DNS C 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.0.0.0.4.4.3.3.2.2.1.1.0.0.d.f.ip6.arpa. IN PTR ?
fd00:1122:3344:3::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R  Error: 2(Server Fail)
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:2::1 DNS C 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.0.0.0.4.4.3.3.2.2.1.1.0.0.d.f.ip6.arpa. IN PTR ?
fd00:1122:3344:2::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R  Error: 2(Server Fail)
...
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:1::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
fd00:1122:3344:1::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:3::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:2::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
fd00:1122:3344:2::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
fd00:1122:3344:3::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:1::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
fd00:1122:3344:1::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:3::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:2::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
fd00:1122:3344:3::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
fd00:1122:3344:2::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:1::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
fd00:1122:3344:1::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:3::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:2::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
fd00:1122:3344:2::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
fd00:1122:3344:3::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:1::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
fd00:1122:3344:1::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:3::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local -> fd00:1122:3344:2::1 DNS C _cockroach._tcp.control-plane.oxide.internal. IN SRV ?
fd00:1122:3344:3::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV 
fd00:1122:3344:2::1 -> oxz_cockroachdb_a1ea5264-563d-40a9-8446-6a32951e5c65.local DNS R _cockroach._tcp.control-plane.oxide.internal. IN SRV

In particular, we see the zone querying all three servers and getting responses for the SRV queries.

Around this point I noticed this message in the dnsadm output:

03:21:18.442Z INFO dnswait: using system configuration

and wondered if that could be related. @ahl immediately noticed that he had enabled eDNS in our internal DNS resolver constructor that accepts a specific list of (our) resolvers:
https://github.com/oxidecomputer/omicron/pull/5912/files#diff-b1212398b8c6caf455365ebfb06b4347121ff72a555070b6221545961e43deeaR60

but had not changed the path used by dnswait when loading the system configuration. (That's not trivial -- see 10b92eb for the fix for that.) When removing this line, the problem became readily reproducible locally. (@ahl did I have that right?)

Wondering if eDNS was on the scene, I went back to the full dig output:

root@oxz_cockroachdb_a1ea5264:~# dig -t SRV _cockroach._tcp.control-plane.oxide.internal. @fd00:1122:3344:3::1 

; <<>> DiG 9.18.14 <<>> -t SRV _cockroach._tcp.control-plane.oxide.internal. @fd00:1122:3344:3::1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54500
;; flags: qr rd; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 5
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;_cockroach._tcp.control-plane.oxide.internal. IN SRV

;; ANSWER SECTION:
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal.

;; ADDITIONAL SECTION:
17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:102::4
a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:101::3
a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:103::3
b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:101::4
b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:102::3

;; Query time: 6 msec
;; SERVER: fd00:1122:3344:3::1#53(fd00:1122:3344:3::1) (UDP)
;; WHEN: Thu Aug 15 04:28:26 UTC 2024
;; MSG SIZE  rcvd: 652

I do not see EDNS here. When using EDNS, dig prints something like this for the query part:

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096

and this in the response:

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232

At the same time, the "rcvd: 652" above shows the packet received was 652 bytes. DNS appears to have a limit of 512 bytes for normal UDP (non-EDNS) messages. I'm not sure this is definitely wrong, but at best it doesn't seem sound to expect clients to handle this well. I filed #6342 for this.

I also saved a complete packet capture of the DNS traffic for a few queries and responses. It's not notable except that Wireshark also reports nothing about EDNS being used.

I have not verified any of the following but here's my best guess about what was happening:

Our server completely ignores all of this: it doesn't claim to support EDNS, nor does it truncate its responses at 512 bytes like it's supposed to. It just sends big packets out. I think this because I don't see any code to handle any of this, plus we see the large, not-truncated response in dig.
dig is not strict on the receiving side and prints whatever it got, so it works by accident with our server.
The new hickory-dns is accidentally strict on the receiving side. Without edns enabled, it either has a buffer that's too small and thinks it got back garbage or it explicitly drops packets that were too large. So we see this failure. I'm inferring this because setting edns changes the behavior even though we're not using edns.

davepacheco · 2024-08-15T04:41:56Z

I just learned about dig +qr:

root@oxz_cockroachdb_a1ea5264:~# dig +qr -t SRV _cockroach._tcp.control-plane.oxide.internal. @fd00:1122:3344:3::1

; <<>> DiG 9.18.14 <<>> +qr -t SRV _cockroach._tcp.control-plane.oxide.internal. @fd00:1122:3344:3::1
;; global options: +cmd
;; Sending:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51079
;; flags: rd ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 0d3addd966bdb5e2
;; QUESTION SECTION:
;_cockroach._tcp.control-plane.oxide.internal. IN SRV

;; QUERY SIZE: 85

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51079
;; flags: qr rd; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 5
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;_cockroach._tcp.control-plane.oxide.internal. IN SRV

;; ANSWER SECTION:
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal.

;; ADDITIONAL SECTION:
17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:102::4
a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:101::3
a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:103::3
b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:101::4
b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:102::3

;; Query time: 2 msec
;; SERVER: fd00:1122:3344:3::1#53(fd00:1122:3344:3::1) (UDP)
;; WHEN: Thu Aug 15 04:39:22 UTC 2024
;; MSG SIZE  rcvd: 652

If I'm reading this right, dig actually is sending its query with EDNS and advertising a max response size of 1232 bytes. So maybe the server is doing something reasonable for dig? I'm not sure if it's supposed to have sent EDNS information in the response. But also, if I use +bufsize=400 to make the max response size smaller (as advertised by the client), I can see that it successfully changed the message from client to server, but the server still sent too much back:

root@oxz_cockroachdb_a1ea5264:~# dig +qr -t SRV _cockroach._tcp.control-plane.oxide.internal. @fd00:1122:3344:3::1 +bufsize=400

; <<>> DiG 9.18.14 <<>> +qr -t SRV _cockroach._tcp.control-plane.oxide.internal. @fd00:1122:3344:3::1 +bufsize=400
;; global options: +cmd
;; Sending:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59109
;; flags: rd ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 400
; COOKIE: c99f7affaee9c498
;; QUESTION SECTION:
;_cockroach._tcp.control-plane.oxide.internal. IN SRV

;; QUERY SIZE: 85

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59109
;; flags: qr rd; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 5
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;_cockroach._tcp.control-plane.oxide.internal. IN SRV

;; ANSWER SECTION:
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal.

;; ADDITIONAL SECTION:
17e75e8c-57ca-4fd9-8011-614bc2e72c98.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:102::4
a1ea5264-563d-40a9-8446-6a32951e5c65.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:101::3
a41a5a55-a1b1-47c4-8271-9d0355e9d65e.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:103::3
b54a4d35-d0dc-4bd4-8bc1-032809ebb6d1.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:101::4
b6736d88-bffb-4361-9857-c8ac7eab4ab8.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:102::3

;; Query time: 2 msec
;; SERVER: fd00:1122:3344:3::1#53(fd00:1122:3344:3::1) (UDP)
;; WHEN: Thu Aug 15 04:41:13 UTC 2024
;; MSG SIZE  rcvd: 652

ahl · 2024-08-15T05:14:26Z

internal-dns/src/resolver.rs

+    /// Construct a new DNS resolver from the system configuration.
+    pub fn new_from_system_conf(
+        log: slog::Logger,
+    ) -> Result<Self, ResolveError> {
+        let (rc, mut opts) = hickory_resolver::system_conf::read_system_conf()?;
+        opts.edns0 = true;
+
+        let resolver = TokioAsyncResolver::tokio(rc, opts);
+
+        Ok(Self { log, resolver })
+    }
+


this is one of the few things that I think is "new"; @davepacheco (or others) please let me know if you think there's a better approach

davepacheco · 2024-08-15T18:53:08Z

dns-server/tests/basic_test.rs

+    let mut resolver_opts = ResolverOpts::default();
+    // Enable edns for potentially larger records
+    resolver_opts.edns0 = true;


This block repeats a lot and the implications have been subtle -- I just wonder if we can/should put this into a common helper with more of an explanation.

follow trust-dns to its new name: hickory

3dbf0a1

edns?

3964433

thomaseizinger mentioned this pull request Jun 24, 2024

fix(connlib): allow larger DNS responses firezone/firezone#5507

Merged

smklein mentioned this pull request Jun 24, 2024

Qorb integration as connection pool for database #5876

Merged

ahl added 2 commits August 5, 2024 15:46

Merge branch 'main' into new-hickory

28b3d8e

update

98fbfe3

ahl added 6 commits August 12, 2024 15:35

Merge branch 'main' into new-hickory

083a417

tracing; first attempt (of many, I presume)

ac6d453

write start of file

3921ddd

more error

245cf7d

it's hard being dumb

2eb67db

testing

f095964

sunshowers mentioned this pull request Aug 13, 2024

[meta] trust-dns -> hickory-dns #4439

Closed

ahl added 2 commits August 13, 2024 11:49

log = debug

3ae7ac9

hail mary

8a67b94

ahl added 2 commits August 14, 2024 20:36

edns for the system config case in dnswait

10b92eb

reenable dnsadm safeguard

5b0a6bd

davepacheco mentioned this pull request Aug 15, 2024

dns server does not behave well with DNS size limits #6342

Open

ahl commented Aug 15, 2024

View reviewed changes

ahl added 3 commits August 14, 2024 22:33

cleanup

8be2853

hakari

42896eb

Merge branch 'main' into new-hickory

4554bbc

davepacheco approved these changes Aug 15, 2024

View reviewed changes

ahl merged commit 66ac7b3 into main Aug 15, 2024
24 checks passed

ahl deleted the new-hickory branch August 15, 2024 22:40

iximeow mentioned this pull request Aug 22, 2024

DNS client does not behave well with truncated responses #6415

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

follow trust-dns to its new name: hickory #5912

follow trust-dns to its new name: hickory #5912

ahl commented Jun 18, 2024

smklein commented Jun 18, 2024

smklein commented Aug 6, 2024

ahl commented Aug 13, 2024

ahl commented Aug 13, 2024

davepacheco commented Aug 15, 2024

davepacheco commented Aug 15, 2024

ahl Aug 15, 2024

davepacheco Aug 15, 2024

follow trust-dns to its new name: hickory #5912

follow trust-dns to its new name: hickory #5912

Conversation

ahl commented Jun 18, 2024

smklein commented Jun 18, 2024

smklein commented Aug 6, 2024

ahl commented Aug 13, 2024

ahl commented Aug 13, 2024

davepacheco commented Aug 15, 2024

davepacheco commented Aug 15, 2024

ahl Aug 15, 2024

Choose a reason for hiding this comment

davepacheco Aug 15, 2024

Choose a reason for hiding this comment