Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

external_endpoints should support more than 200 silos and TLS certificates #7291

Merged
merged 3 commits into from
Dec 20, 2024

Conversation

davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Dec 20, 2024

Fixes #7289.

In terms of testing, there are three commits here:

  • bafa224 adds an integration test that creates 250 silos and TLS certificates
  • f138943 fixes the bug
  • 5e4153b reverts the first commit because I think this test is too expensive to run regularly (more on this below)

I have run the test twice. With the fix, the test passed. Without the fix, the test failed with:

$ cargo nextest run -p omicron-nexus --nocapture -- test_silo_certificates 
warning: this repository recommends nextest version 0.9.86, but the current version is 0.9.77
info: experimental features enabled: setup-scripts
   Compiling omicron-nexus v0.1.0 (/home/dap/omicron-fix/nexus)
    Finished `test` profile [unoptimized + debuginfo] target(s) in 2m 41s
------------
 Nextest run ID c9f71573-ba50-4d59-81e7-e9d74db7f91f with nextest profile: default
    Starting 1 test across 4 binaries (550 tests skipped)
       SETUP [      1/1] crdb-seed: cargo run -p crdb-seed --profile test
    Finished `test` profile [unoptimized + debuginfo] target(s) in 1.30s
     Running `target/debug/crdb-seed`
Dec 20 19:29:31.454 INFO Using existing CRDB seed tarball: `/dangerzone/omicron_tmp/crdb-base-dap/345df35a854b98fb9ca917bc1faf83f3d09ca36aa592165a23be654a8eec19c1.tar`
  SETUP PASS [      1/1] crdb-seed: cargo run -p crdb-seed --profile test
       START             omicron-nexus::test_all integration_tests::certificates::test_silo_certificates

running 1 test
log file: /dangerzone/omicron_tmp/test_all-fc85fc284cf2c5cf-test_silo_certificates.18194.0.log
note: configured to log to "/dangerzone/omicron_tmp/test_all-fc85fc284cf2c5cf-test_silo_certificates.18194.0.log"
DB URL: postgresql://root@[::1]:46355/omicron?sslmode=disable
DB address: [::1]:46355
log file: /dangerzone/omicron_tmp/test_all-fc85fc284cf2c5cf-test_silo_certificates.18194.2.log
note: configured to log to "/dangerzone/omicron_tmp/test_all-fc85fc284cf2c5cf-test_silo_certificates.18194.2.log"
log file: /dangerzone/omicron_tmp/test_all-fc85fc284cf2c5cf-test_silo_certificates.18194.3.log
note: configured to log to "/dangerzone/omicron_tmp/test_all-fc85fc284cf2c5cf-test_silo_certificates.18194.3.log"
        SLOW [> 60.000s] omicron-nexus::test_all integration_tests::certificates::test_silo_certificates
test integration_tests::certificates::test_silo_certificates has been running for over 60 seconds
        SLOW [>120.000s] omicron-nexus::test_all integration_tests::certificates::test_silo_certificates
thread 'integration_tests::certificates::test_silo_certificates' panicked at nexus/tests/integration_tests/certificates.rs:730:13:
failed to connect to silo extra-silo-5's endpoint within timeout: timed out after 30.039779006s
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: "/dangerzone/omicron_tmp/.tmpaAJUVb"
        If you would like to access the database for debugging, run the following:

        # Run the database
        cargo xtask db-dev run --no-populate --store-dir "/dangerzone/omicron_tmp/.tmpaAJUVb/data"
        # Access the database. Note the port may change if you run multiple databases.
        cockroach sql --host=localhost:32221 --insecure
WARN: dropped ClickHouse process without cleaning it up first (there may still be a child process running (PID 18201) and a temporary directory leaked, /dangerzone/omicron_tmp/test_all-fc85fc284cf2c5cf-test_silo_certificates.18194.1-clickhouse-7vTOoL)
failed to clean up ClickHouse data dir:
- /dangerzone/omicron_tmp/test_all-fc85fc284cf2c5cf-test_silo_certificates.18194.1-clickhouse-7vTOoL: File exists (os error 17)
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /dangerzone/omicron_tmp/.tmpNQsxJO
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /dangerzone/omicron_tmp/.tmp0mQs0g
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /dangerzone/omicron_tmp/.tmpoAzNOr
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /dangerzone/omicron_tmp/.tmpBpQDX6
test integration_tests::certificates::test_silo_certificates ... FAILED

failures:

failures:
    integration_tests::certificates::test_silo_certificates

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 409 filtered out; finished in 160.76s

        FAIL [ 160.850s] omicron-nexus::test_all integration_tests::certificates::test_silo_certificates
   Canceling due to test failure
------------
     Summary [ 162.408s] 1 test run: 0 passed, 1 failed, 550 skipped
        FAIL [ 160.850s] omicron-nexus::test_all integration_tests::certificates::test_silo_certificates
warning: this repository recommends nextest version 0.9.86, but the current version is 0.9.77
info: update nextest with cargo nextest self update, or bypass check with --override-version-check
error: test run failed

The key bit is:

thread 'integration_tests::certificates::test_silo_certificates' panicked at nexus/tests/integration_tests/certificates.rs:730:13:
failed to connect to silo extra-silo-5's endpoint within timeout: timed out after 30.039779006s

That's what we'd expect given the previous behavior: some silos won't have their certificates loaded. I can also see from the log file that we hit the error case from the original issue:

$ looker -lerror < /dangerzone/omicron_tmp/test_all-fc85fc284cf2c5cf-test_silo_certificates.18194.0.log
...
19:30:56.978Z ERRO 7f7bb105-2a6f-407f-b01d-8632f04c46b6 (ServerContext): reading endpoints: expected at most 200 silos, but found at least 200.  TLS may not work on some Silos' external endpoints.
    background_task = external_endpoints
19:30:56.978Z ERRO 7f7bb105-2a6f-407f-b01d-8632f04c46b6 (ServerContext): reading endpoints: expected at most 200 certificates, but found at least 200.  TLS may not work on some Silos' external endpoints.
    background_task = external_endpoints

and I can also see these messages:

19:32:11.788Z WARN 7f7bb105-2a6f-407f-b01d-8632f04c46b6 (ServerContext): failed to resolve TLS certificate
    error = unrecognized server name: extra-silo-5.sys.oxide-dev.test
    server_name = extra-silo-5.sys.oxide-dev.test
19:32:11.788Z WARN 7f7bb105-2a6f-407f-b01d-8632f04c46b6 (dropshot_external): tls accept err: unexpected error: no server certificate chain resolved
    local_addr = 127.0.0.1:38366

That's all consistent with what we'd expect if we'd simply not loaded this silo's certificate, which is what we'd expect if we'd hit this limit prior to this fix.

I suspected this test might take a while, and indeed the output shows it took 143s to do this extra work (in the success case). I think this is too expensive to bother including so I've reverted the automated test. It really shouldn't be necessary since Paginator is extensively tested elsewhere. I just wanted to be sure I was resolving the original problem.

@davepacheco davepacheco added this to the 13 milestone Dec 20, 2024
@davepacheco davepacheco force-pushed the dap/paginate-endpoints branch from 34bfee1 to f138943 Compare December 20, 2024 19:33
@davepacheco davepacheco marked this pull request as ready for review December 20, 2024 19:42
let batch = datastore
.dns_zones_list(opctx, DnsGroup::External, &p.current_pagparams())
.await?;
paginator = p.found_batch(&batch, &|z: &DnsZone| z.zone_name.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks correct to me, but out of curiosity: why in the other two queries do we have an explicit PaginatedBy::Id(pageparams), and for this one we seem to be implicitly paginating by zone name? Is the use of z.zone_name here relying on implementation details of dns_zones_list()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall correctly: the sort of low-level parameter for pagination is DataPageParams, which can be parametrized by basically any marker type:

/// Parameters used to request a specific page of results when listing a
/// collection of objects
///
/// This is logically analogous to Dropshot's `PageSelector` (plus the limit from
/// Dropshot's `PaginationParams). However, this type is HTTP-agnostic. More
/// importantly, by the time this struct is generated, we know the type of the
/// sort field and we can specialize `DataPageParams` to that type. This makes
/// it considerably simpler to implement the backend for most of our paginated
/// APIs.
///
/// `NameType` is the type of the field used to sort the returned values and it's
/// usually `Name`.
#[derive(Clone, Debug)]
pub struct DataPageParams<'a, NameType> {
/// If present, this is the value of the sort field for the last object seen
pub marker: Option<&'a NameType>,
/// Whether the sort is in ascending order
pub direction: PaginationOrder,
/// This identifies how many results should be returned on this page.
/// Backend implementations must provide this many results unless we're at
/// the end of the scan. Dropshot assumes that if we provide fewer results
/// than this number, then we're done with the scan.
pub limit: NonZeroU32,
}

For collections that only support pagination by one thing, like dns_zones_list(), their datastore functions just accept DataPageParams:

pagparams: &DataPageParams<'_, String>,

For relatively few collections we support paginating by any one of a few different fields (well, it's always "id" or "name"). The caller gets to pick. The code is basically the same and these functions accept a PaginatedBy, which is just an enum with variants for by-id or by-name, and each variant just contains the appropriate DataPageParams type:

#[derive(Debug)]
pub enum PaginatedBy<'a> {
Id(DataPageParams<'a, Uuid>),
Name(DataPageParams<'a, Name>),
}

So I don't think this is relying on an implementation detail. It's just that dns_zones_list() only supports paginating by the zone name so it accepts DataPageParams directly, while the other functions support pagination by either and so need the wrapper that lets you pick which one you want (but ultimately you're still providing a DataPageParams and that's what the code is using).


There's a bit of an implicit detail here which is that the caller of a paginated function needs to know which field of the object is the marker. (In practice this is not super easy to get wrong because the marker type is usually kind of unique -- a uuid or a Name -- though you could pick a different field that's also a uuid or something.) This applies to all three queries though and every other use of Paginator, too (it's encoded in the closure provided to found_batch()).

@davepacheco davepacheco enabled auto-merge (squash) December 20, 2024 20:08
@davepacheco davepacheco merged commit 9eca6bc into main Dec 20, 2024
16 checks passed
@davepacheco davepacheco deleted the dap/paginate-endpoints branch December 20, 2024 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

use pagination to support more than 200 silos and TLS certs
2 participants