Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix phased startup of zones when sled agent starts up #4588

Merged
merged 4 commits into from
Dec 7, 2023
Merged

Conversation

davepacheco
Copy link
Collaborator

When sled agent starts up, it starts up the zones it's responsible for in a specific order:

  • internal DNS
  • NTP (starting boundary NTP zones implicitly requires that internal DNS be up)
  • (wait for time sync)
  • start everything else

At least, that's what the comments say, and that appears to be what it should do. But the first call to start internal DNS also starts the NTP zones today. This appears to work by accident because the DNS zones happen to get processed first.

This was discussed on #4466:
#4466 (comment)
#4466 (comment)

This change fixes that.

@@ -88,7 +88,7 @@ impl<T: Ledgerable> Ledger<T> {
match T::read_from(log, &path).await {
Ok(ledger) => ledgers.push(ledger),
Err(err) => {
error!(log, "Failed to read ledger: {err}"; "path" => %path)
debug!(log, "Failed to read ledger: {err}"; "path" => %path)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tiny change came along for the ride. This is not an error condition. It happens during normal operation.

Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@citrus-it
Copy link
Contributor

NTP (starting boundary NTP zones implicitly requires that internal DNS be up)

What's the dependency here, being able to configure the boundary services for external network access?
(Chrony itself is not dependent on internal DNS, hence the question).

@davepacheco
Copy link
Collaborator Author

NTP (starting boundary NTP zones implicitly requires that internal DNS be up)

What's the dependency here, being able to configure the boundary services for external network access? (Chrony itself is not dependent on internal DNS, hence the question).

It's a dependency in the sled agent code that starts the boundary NTP zones, not the zones themselves. The comment from the code explains:

// Initialize internal DNS only first: we need it to look up the
// boundary switch addresses. This dependency is implicit: when we call
// `ensure_all_omicron_zones` below, we eventually land in
// `opte_ports_needed()`, which for some service types (including Ntp
// but _not_ including InternalDns), we perform internal DNS lookups.

@citrus-it
Copy link
Contributor

It's a dependency in the sled agent code that starts the boundary NTP zones, not the zones themselves. The comment from the code explains:

Got it thanks, it's a dependency needed for being able to find boundary services.

@citrus-it citrus-it self-requested a review November 30, 2023 21:48
@davepacheco
Copy link
Collaborator Author

davepacheco commented Dec 7, 2023

Besides CI, I tested the tip of this branch (3320239) on madrid:

  • mupdate all sleds
  • deploy Omicron (svcs -Zxv clear, timestamps up to date, all sleds on the right version, expected zones present)
  • provision instance
  • ssh into instance
  • cold boot (uadmin 2 1 on all four sleds)
  • wait for it to come up (svcs -Zxv clear, timestamps up to date, expected zones present)
  • provision instance
  • ssh into instance

That all worked.

@@ -184,7 +184,7 @@ pub trait Ledgerable: DeserializeOwned + Serialize + Send + Sync {
err,
})
} else {
warn!(log, "No ledger in {path}");
info!(log, "No ledger in {path}");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also happens during normal operation prior to the ledgers being initially written (during initial setup, if nothing else) so I found it misleading to make these warnings.

@davepacheco davepacheco enabled auto-merge (squash) December 7, 2023 04:32
@davepacheco davepacheco merged commit 5594eab into main Dec 7, 2023
19 of 20 checks passed
@davepacheco davepacheco deleted the dap/cleanup branch December 7, 2023 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants