Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RandomnWithDistinctSleds region allocation strategy #3858

Merged
merged 13 commits into from
Oct 3, 2023

Conversation

faithanalog
Copy link
Contributor

@faithanalog faithanalog commented Aug 9, 2023

I am submitting this draft now, but I am not 100% sure that the configuration location I have selected for the allocation strategy (PackageConfig toml) is the correct one. It would work for getting stuff onto dogfood/prod, but I don't know how to use a different one when compiling a build to run locally for dev purposes. That's a problem, because as devs we're all mostly running "1-sled" setups, so region allocation would just Not Work for any of us.

If you know how to configure this differently for a local dev build, then maybe we need docs for that.

If that's just not in the cards, then maybe it should go into DeploymentConfig instead.

Either way, the core logic of this is fundamentally done. The tests are in there. The integration tests pass. And configuration works right now using PackageConfig, so it is in a state where someone other than me can adjust the way configuration works or add docs if necessary. I will be gone for about a week, so I invite anyone to make changes around that config if they want to do that to get this merged while I'm out.


PR #3650 introduced the Random region allocation strategy to allocate regions randomly across the rack. This expands on that with the addition of the RandomWithDistinctSleds region allocation strategy. This strategy is the same, but requires the 3 crucible regions be allocated on 3 different sleds to improve resiliency against a whole-sled failure.

The Random strategy still exists, and does not require 3 distinct sleds. This is useful in one-sled environments such as the integration tests, and lab setups. This PR adds the ability to configure the allocation strategy in the Nexus PackageConfig toml. Anyone running in a one-sled setup will need to configure that to one-sled mode (as is done for the integration test environment).

This also fixes a shortcoming of #3650 whereby multiple datasets on a single zpool could be selected. That fix applies to both the old Random strategy and the new RandomWithDistinctSleds strategy.

smf/nexus/config-partial.toml is configured for
RandomWithDistinctSleds, as that is what we want to use on prod.

As I mentioned, the integration tests are not using the distinct sleds allocation strategy. I attempted to add 2 extra sleds to the simulated environment but found that this broke more things than I had the understanding to fix in this PR. It would be nice in the future for the sim environment to have 3 sleds in it though, not just for this but for anything else that might have different behaviors in a multi-sled setup.

In the present, I have unit tests that verify the allocation behavior works correctly with cockroachdb, and we can try it out on dogfood.

Fixes #3702

@faithanalog faithanalog requested a review from smklein August 9, 2023 04:52
@faithanalog
Copy link
Contributor Author

faithanalog commented Aug 9, 2023

I can't find it in the logs but I am assuming that the deploy job is not particularly happy with the "requiring 3 sleds" situation given i think thats using the real config and not the integration test environment config

}

#[tokio::test]
/// Note that this test is currently non-deterministic. It can be made
/// deterministic by generating deterministic *dataset* Uuids. The sled and
/// pool IDs should not matter.
async fn test_region_allocation() {
async fn test_region_allocation_strat_random() {
let logctx = dev::test_setup_log("test_region_allocation");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: log doesn't match test name

/// It should always pick datasets where no two datasets are on the same
/// zpool and no two zpools are on the same sled.
async fn test_region_allocation_strat_random_with_distinct_sleds() {
let logctx = dev::test_setup_log("test_region_allocation");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: log doesn't match test name

@@ -798,8 +863,8 @@ mod test {
assert!(disk_zpools.insert(dataset.pool_id));

// Must be 3 unique sleds
// TODO: When allocation chooses 3 distinct sleds, uncomment this.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!

/// Ensure the [`RegionAllocationStrategy::RandomWithDistinctSleds`]
/// strategy fails when there aren't enough distinct sleds.
async fn test_region_allocation_strat_random_with_distinct_sleds_fails() {
let logctx = dev::test_setup_log("test_region_allocation");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit log name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for catching these, i hadnt even noticed the log initialization was taking this string at all

let seed_bytes = seed.to_le_bytes();

let query: Box<dyn CteQuery<SqlType = candidate_zpools::SqlType>> =
Box::new(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(unrelated to this specific spot) I just wanna say, thanks again for going so deep into making this CTE magic work. This is a non-trivial query, and I like the way you've done it. Great work.


[default_region_allocation_strategy]
# we only have one sled in the test environment, so we need to use the
# `Random` strategy, instead of `RandomWithDistinctSleds`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most of our tests, I think this makes sense.

[default_region_allocation_strategy]
# by default, allocate across 3 distinct sleds
# seed is omitted so a new seed will be chosen with every allocation.
type = "random_with_distinct_sleds"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I chatted with @iliana about this earlier, and here's the TL;DR of our conversation:

  • Most of the "non-deploy" tests work using the config from nexus/tests/config.test.toml
  • The "real prod config" is using this file (smf/nexus/config-partial.toml), and uses the distinct sled strategy, as we want

So the problem cases are the following:

  • Developers trying to run a single sled
  • The "deploy job", which typically runs a "non-gimlet" standalone sled (similar to the developer workflow)

These cases are painful, because we want some config options to be like prod (e.g., we use internal DNS! We have an underlay, kinda!) but others we don't want to look like our prod environment (most obviously: we're only running with a single sled).

iliana and I came to the conclusion that a couple ways to solve this would be the following:

Add a new target option to omicron-package (my slight preference)

  1. Add a new feature to the omicron-package target type named cluster, which can be either single-sled or multi-sled. This would involve updating package/src/lib.rs and package/src/target.rs. There is some overlap with the gimlet vs gimlet-standalone vs non-gimlet options, but I think it's probably worth just adding a new feature explicitly.
  2. Split the Nexus config files into two: what previously was smf/nexus/config-partial.toml becomes smf/nexus/multi-sled/config-partial.toml and smf/nexus/single-sled/config-partial.toml. We can provide a different region allocation strategy in each, through the PackageConfig.
  3. Update package-manifest.toml to reference the right smf subdirectory for Nexus, based on the cluster value (you can look at how we do this for the sled-agent target using the machine target).

Pros: Plumbs Nexus-specific options directly to Nexus
Cons: Possible to select confusing combinations of targets, like "gimlet-standalone" + "multi-sled"?

Try to overload an existing target?

  1. Use the non-gimlet and gimlet-standalone options, and "infer" that these are single-sled
  2. Pass an argument from sled agent to Nexus (via DeploymentConfig) when this is the case

Pros: One fewer target?
Cons: Prevents us from effectively having a multi-sled setup without real gimlets (I think this would break the Canada cluster), kinda confusing that a Nexus-level option is plumbed through the sled agent config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you've sold me on adding a new target option to omicron-package i think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What cluster option will I want to pass if I want to set up a multi-sled dev cluster with exactly two sleds? It seems like the single-sled option should still work for now (though it makes me wonder if there's a clearer label we could put on it); is that correct?

@faithanalog faithanalog force-pushed the artemis/allocate-regions-across-sleds branch from abc7293 to 960489f Compare August 18, 2023 05:34
@faithanalog
Copy link
Contributor Author

faithanalog commented Aug 18, 2023

@smklein I named it rack-topology before realizing you had suggested cluster as a name. I can switch it to cluster if you'd like though- sorry about that. Anyways, if CI passes (I am writing this before it finishes), how do you like the shape of this change?

@faithanalog faithanalog marked this pull request as ready for review August 18, 2023 05:40
@faithanalog
Copy link
Contributor Author

oh boy, CI failures! ili and us tried these tests locally just running omicron-package and it seemed to be generating the right files in the output, but maybe we got the CI scripts wrong?

@iliana
Copy link
Contributor

iliana commented Aug 18, 2023

The nexus_config::test::test_repo_configs_are_valid test is failing, which may or may not explain why the deploy job is failing with:

SledAgent: failed to initialize services: Failed to initialize 3 zones:
      - oxz_nexus_9b1b8d22-21f8-4b93-b487-3fbd84fe42a8: Failed to create OPTE port for service nexus: Failure interacting with the OPTE ioctl(2) interface: command CreateXde failed: MacExists { port: "opte115", vni: Vni { inner: 100 }, mac: MacAddr { inner: A8:40:25:FF:E4:EE } }
      - oxz_nexus_d64d9e21-fd6e-4eb0-94a6-c5e50421aa1e: Failed to create OPTE port for service nexus: Failure interacting with the OPTE ioctl(2) interface: command CreateXde failed: MacExists { port: "opte114", vni: Vni { inner: 100 }, mac: MacAddr { inner: A8:40:25:FF:A5:B5 } }
      - oxz_nexus_d5e0ff69-9935-4733-a913-70d0eeb40f3e: Failed to create OPTE port for service nexus: Failure interacting with the OPTE ioctl(2) interface: command CreateXde failed: MacExists { port: "opte116", vni: Vni { inner: 100 }, mac: MacAddr { inner: A8:40:25:FF:9A:04 } }
    file = sled-agent/src/http_entrypoints.rs:321
    sled_id = 5ed96990-c0b7-4d25-9d48-513b19d67f21

@faithanalog faithanalog force-pushed the artemis/allocate-regions-across-sleds branch from 9a5c954 to 2f41c2e Compare September 2, 2023 00:20
@faithanalog faithanalog force-pushed the artemis/allocate-regions-across-sleds branch from 215a0d5 to e266136 Compare September 17, 2023 05:27
@faithanalog faithanalog requested a review from smklein September 17, 2023 21:58
@faithanalog
Copy link
Contributor Author

finally CI is happy. but at what cost

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies on the delay for this - LGTM, thanks for working through the config issues.

faithanalog and others added 11 commits September 29, 2023 22:18
PR #3650 introduced the Random region allocation strategy to allocate
regions randomly across the rack. This expands on that with the addition
of the RandomWithDistinctSleds region allocation strategy. This strategy
is the same, but requires the 3 crucible regions be allocated on 3
different sleds to improve resiliency against a whole-sled failure.

The Random strategy still exists, and does not require 3 distinct sleds.
This is useful in one-sled environments such as the integration tests,
and lab setups. This PR adds the ability to configure the allocation
strategy in the Nexus PackageConfig toml. Anyone running in a one-sled
setup will need to configure that to one-sled mode (as is done for the
integration test environment).

This also fixes a shortcoming of #3650 whereby multiple datasets on a
single zpool could be selected. That fix applies to both the old Random
strategy and the new RandomWithDistinctSleds strategy.

`smf/nexus/config-partial.toml` is configured for
RandomWithDistinctSleds, as that is what we want to use on prod.

As I mentioned, the integration tests are not using the distinct sleds
allocation strategy. I attempted to add 2 extra sleds to the simulated
environment but found that this broke more things than I had the
understanding to fix in this PR. It would be nice in the future for the
sim environment to have 3 sleds in it though, not just for this but for
anything else that might have different behaviors in a multi-sled setup.

In the present, I have unit tests that verify the allocation behavior
works correctly with cockroachdb, and we can try it out on dogfood.
This adds the rack-topology package feature with possible values of
single-sled or multi-sled. single-sled is intended for dev/CI
deployments, while multi-sled is intended for dogfood/prod. The value
of this determines which nexus config-partial.toml is packaged.

Right now the only difference single/multi is the crucible region
allocation strategy.
@faithanalog faithanalog force-pushed the artemis/allocate-regions-across-sleds branch from e266136 to e689108 Compare September 29, 2023 23:20
@faithanalog
Copy link
Contributor Author

alright ill merge this monday

@faithanalog faithanalog merged commit 6bc5e60 into main Oct 3, 2023
22 checks passed
@faithanalog faithanalog deleted the artemis/allocate-regions-across-sleds branch October 3, 2023 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allocate Crucible Datasets On Separate Sleds
4 participants