RandomnWithDistinctSleds region allocation strategy #3858

faithanalog · 2023-08-09T04:52:33Z

I am submitting this draft now, but I am not 100% sure that the configuration location I have selected for the allocation strategy (PackageConfig toml) is the correct one. It would work for getting stuff onto dogfood/prod, but I don't know how to use a different one when compiling a build to run locally for dev purposes. That's a problem, because as devs we're all mostly running "1-sled" setups, so region allocation would just Not Work for any of us.

If you know how to configure this differently for a local dev build, then maybe we need docs for that.

If that's just not in the cards, then maybe it should go into DeploymentConfig instead.

Either way, the core logic of this is fundamentally done. The tests are in there. The integration tests pass. And configuration works right now using PackageConfig, so it is in a state where someone other than me can adjust the way configuration works or add docs if necessary. I will be gone for about a week, so I invite anyone to make changes around that config if they want to do that to get this merged while I'm out.

PR #3650 introduced the Random region allocation strategy to allocate regions randomly across the rack. This expands on that with the addition of the RandomWithDistinctSleds region allocation strategy. This strategy is the same, but requires the 3 crucible regions be allocated on 3 different sleds to improve resiliency against a whole-sled failure.

The Random strategy still exists, and does not require 3 distinct sleds. This is useful in one-sled environments such as the integration tests, and lab setups. This PR adds the ability to configure the allocation strategy in the Nexus PackageConfig toml. Anyone running in a one-sled setup will need to configure that to one-sled mode (as is done for the integration test environment).

This also fixes a shortcoming of #3650 whereby multiple datasets on a single zpool could be selected. That fix applies to both the old Random strategy and the new RandomWithDistinctSleds strategy.

smf/nexus/config-partial.toml is configured for
RandomWithDistinctSleds, as that is what we want to use on prod.

As I mentioned, the integration tests are not using the distinct sleds allocation strategy. I attempted to add 2 extra sleds to the simulated environment but found that this broke more things than I had the understanding to fix in this PR. It would be nice in the future for the sim environment to have 3 sleds in it though, not just for this but for anything else that might have different behaviors in a multi-sled setup.

In the present, I have unit tests that verify the allocation behavior works correctly with cockroachdb, and we can try it out on dogfood.

Fixes #3702

faithanalog · 2023-08-09T08:07:47Z

I can't find it in the logs but I am assuming that the deploy job is not particularly happy with the "requiring 3 sleds" situation given i think thats using the real config and not the integration test environment config

smklein · 2023-08-11T01:16:57Z

nexus/db-queries/src/db/datastore/mod.rs

    }

    #[tokio::test]
    /// Note that this test is currently non-deterministic. It can be made
    /// deterministic by generating deterministic *dataset* Uuids. The sled and
    /// pool IDs should not matter.
-    async fn test_region_allocation() {
+    async fn test_region_allocation_strat_random() {
        let logctx = dev::test_setup_log("test_region_allocation");


Nit: log doesn't match test name

smklein · 2023-08-11T01:17:25Z

nexus/db-queries/src/db/datastore/mod.rs

+    /// It should always pick datasets where no two datasets are on the same
+    /// zpool and no two zpools are on the same sled.
+    async fn test_region_allocation_strat_random_with_distinct_sleds() {
+        let logctx = dev::test_setup_log("test_region_allocation");


Nit: log doesn't match test name

smklein · 2023-08-11T01:17:54Z

nexus/db-queries/src/db/datastore/mod.rs

@@ -798,8 +863,8 @@ mod test {
                assert!(disk_zpools.insert(dataset.pool_id));

                // Must be 3 unique sleds
-                // TODO: When allocation chooses 3 distinct sleds, uncomment this.


smklein · 2023-08-11T01:18:05Z

nexus/db-queries/src/db/datastore/mod.rs

+    /// Ensure the [`RegionAllocationStrategy::RandomWithDistinctSleds`]
+    /// strategy fails when there aren't enough distinct sleds.
+    async fn test_region_allocation_strat_random_with_distinct_sleds_fails() {
+        let logctx = dev::test_setup_log("test_region_allocation");


nit log name

thanks for catching these, i hadnt even noticed the log initialization was taking this string at all

smklein · 2023-08-11T01:24:38Z

nexus/db-queries/src/db/queries/region_allocation.rs

+            let seed_bytes = seed.to_le_bytes();
+
+            let query: Box<dyn CteQuery<SqlType = candidate_zpools::SqlType>> =
+                Box::new(


(unrelated to this specific spot) I just wanna say, thanks again for going so deep into making this CTE magic work. This is a non-trivial query, and I like the way you've done it. Great work.

smklein · 2023-08-11T01:25:15Z

nexus/tests/config.test.toml

+
+[default_region_allocation_strategy]
+# we only have one sled in the test environment, so we need to use the
+# `Random` strategy, instead of `RandomWithDistinctSleds`


For most of our tests, I think this makes sense.

smklein · 2023-08-11T01:36:48Z

smf/nexus/config-partial.toml

+[default_region_allocation_strategy]
+# by default, allocate across 3 distinct sleds
+# seed is omitted so a new seed will be chosen with every allocation.
+type = "random_with_distinct_sleds"


So I chatted with @iliana about this earlier, and here's the TL;DR of our conversation:

Most of the "non-deploy" tests work using the config from nexus/tests/config.test.toml

The "real prod config" is using this file (smf/nexus/config-partial.toml), and uses the distinct sled strategy, as we want

So the problem cases are the following:

Developers trying to run a single sled

The "deploy job", which typically runs a "non-gimlet" standalone sled (similar to the developer workflow)

These cases are painful, because we want some config options to be like prod (e.g., we use internal DNS! We have an underlay, kinda!) but others we don't want to look like our prod environment (most obviously: we're only running with a single sled).

iliana and I came to the conclusion that a couple ways to solve this would be the following:

Add a new target option to omicron-package (my slight preference)

Add a new feature to the omicron-package target type named cluster, which can be either single-sled or multi-sled. This would involve updating package/src/lib.rs and package/src/target.rs. There is some overlap with the gimlet vs gimlet-standalone vs non-gimlet options, but I think it's probably worth just adding a new feature explicitly.

Split the Nexus config files into two: what previously was smf/nexus/config-partial.toml becomes smf/nexus/multi-sled/config-partial.toml and smf/nexus/single-sled/config-partial.toml. We can provide a different region allocation strategy in each, through the PackageConfig.

Update package-manifest.toml to reference the right smf subdirectory for Nexus, based on the cluster value (you can look at how we do this for the sled-agent target using the machine target).

Pros: Plumbs Nexus-specific options directly to Nexus
Cons: Possible to select confusing combinations of targets, like "gimlet-standalone" + "multi-sled"?

Try to overload an existing target?

Use the non-gimlet and gimlet-standalone options, and "infer" that these are single-sled

Pass an argument from sled agent to Nexus (via DeploymentConfig) when this is the case

Pros: One fewer target?
Cons: Prevents us from effectively having a multi-sled setup without real gimlets (I think this would break the Canada cluster), kinda confusing that a Nexus-level option is plumbed through the sled agent config

you've sold me on adding a new target option to omicron-package i think.

What cluster option will I want to pass if I want to set up a multi-sled dev cluster with exactly two sleds? It seems like the single-sled option should still work for now (though it makes me wonder if there's a clearer label we could put on it); is that correct?

faithanalog · 2023-08-18T05:38:07Z

@smklein I named it rack-topology before realizing you had suggested cluster as a name. I can switch it to cluster if you'd like though- sorry about that. Anyways, if CI passes (I am writing this before it finishes), how do you like the shape of this change?

faithanalog · 2023-08-18T09:10:51Z

oh boy, CI failures! ili and us tried these tests locally just running omicron-package and it seemed to be generating the right files in the output, but maybe we got the CI scripts wrong?

iliana · 2023-08-18T17:26:19Z

The nexus_config::test::test_repo_configs_are_valid test is failing, which may or may not explain why the deploy job is failing with:

SledAgent: failed to initialize services: Failed to initialize 3 zones:
      - oxz_nexus_9b1b8d22-21f8-4b93-b487-3fbd84fe42a8: Failed to create OPTE port for service nexus: Failure interacting with the OPTE ioctl(2) interface: command CreateXde failed: MacExists { port: "opte115", vni: Vni { inner: 100 }, mac: MacAddr { inner: A8:40:25:FF:E4:EE } }
      - oxz_nexus_d64d9e21-fd6e-4eb0-94a6-c5e50421aa1e: Failed to create OPTE port for service nexus: Failure interacting with the OPTE ioctl(2) interface: command CreateXde failed: MacExists { port: "opte114", vni: Vni { inner: 100 }, mac: MacAddr { inner: A8:40:25:FF:A5:B5 } }
      - oxz_nexus_d5e0ff69-9935-4733-a913-70d0eeb40f3e: Failed to create OPTE port for service nexus: Failure interacting with the OPTE ioctl(2) interface: command CreateXde failed: MacExists { port: "opte116", vni: Vni { inner: 100 }, mac: MacAddr { inner: A8:40:25:FF:9A:04 } }
    file = sled-agent/src/http_entrypoints.rs:321
    sled_id = 5ed96990-c0b7-4d25-9d48-513b19d67f21

faithanalog · 2023-09-17T21:59:18Z

finally CI is happy. but at what cost

smklein

Apologies on the delay for this - LGTM, thanks for working through the config issues.

PR #3650 introduced the Random region allocation strategy to allocate regions randomly across the rack. This expands on that with the addition of the RandomWithDistinctSleds region allocation strategy. This strategy is the same, but requires the 3 crucible regions be allocated on 3 different sleds to improve resiliency against a whole-sled failure. The Random strategy still exists, and does not require 3 distinct sleds. This is useful in one-sled environments such as the integration tests, and lab setups. This PR adds the ability to configure the allocation strategy in the Nexus PackageConfig toml. Anyone running in a one-sled setup will need to configure that to one-sled mode (as is done for the integration test environment). This also fixes a shortcoming of #3650 whereby multiple datasets on a single zpool could be selected. That fix applies to both the old Random strategy and the new RandomWithDistinctSleds strategy. `smf/nexus/config-partial.toml` is configured for RandomWithDistinctSleds, as that is what we want to use on prod. As I mentioned, the integration tests are not using the distinct sleds allocation strategy. I attempted to add 2 extra sleds to the simulated environment but found that this broke more things than I had the understanding to fix in this PR. It would be nice in the future for the sim environment to have 3 sleds in it though, not just for this but for anything else that might have different behaviors in a multi-sled setup. In the present, I have unit tests that verify the allocation behavior works correctly with cockroachdb, and we can try it out on dogfood.

This adds the rack-topology package feature with possible values of single-sled or multi-sled. single-sled is intended for dev/CI deployments, while multi-sled is intended for dogfood/prod. The value of this determines which nexus config-partial.toml is packaged. Right now the only difference single/multi is the crucible region allocation strategy.

faithanalog · 2023-09-30T05:38:59Z

alright ill merge this monday

faithanalog requested a review from smklein August 9, 2023 04:52

smklein reviewed Aug 11, 2023

View reviewed changes

faithanalog force-pushed the artemis/allocate-regions-across-sleds branch from abc7293 to 960489f Compare August 18, 2023 05:34

faithanalog marked this pull request as ready for review August 18, 2023 05:40

faithanalog force-pushed the artemis/allocate-regions-across-sleds branch from 9a5c954 to 2f41c2e Compare September 2, 2023 00:20

faithanalog force-pushed the artemis/allocate-regions-across-sleds branch from 215a0d5 to e266136 Compare September 17, 2023 05:27

faithanalog requested a review from smklein September 17, 2023 21:58

smklein approved these changes Sep 28, 2023

View reviewed changes

faithanalog and others added 11 commits September 29, 2023 22:18

fix log messages

a7aa896

fix test_repo_configs_are_valid test

82a5867

add a nice error message for you

61ae4ac

actually put the manifest.xml

e859e12

ensure newlines after nexus config-partial.toml

dc1893f

does... does this work?

75fa726

is this anything

3fa73a4

rebase

d9845ec

make rack_topology required to save me from myself

e689108

faithanalog force-pushed the artemis/allocate-regions-across-sleds branch from e266136 to e689108 Compare September 29, 2023 23:20

faithanalog added 2 commits September 29, 2023 23:34

fix the ci

ba1c53f

sorry lol

140f973

faithanalog merged commit 6bc5e60 into main Oct 3, 2023
22 checks passed

faithanalog deleted the artemis/allocate-regions-across-sleds branch October 3, 2023 02:29

askfongjojo mentioned this pull request Nov 28, 2024

Instance hitting ext4-fs read/write error after running non-disk I/O heavy workload for some time oxidecomputer/crucible#1572

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RandomnWithDistinctSleds region allocation strategy #3858

RandomnWithDistinctSleds region allocation strategy #3858

faithanalog commented Aug 9, 2023 •

edited

Loading

faithanalog commented Aug 9, 2023 •

edited

Loading

smklein Aug 11, 2023

smklein Aug 11, 2023

smklein Aug 11, 2023

smklein Aug 11, 2023

faithanalog Aug 16, 2023

smklein Aug 11, 2023

smklein Aug 11, 2023

smklein Aug 11, 2023

faithanalog Aug 16, 2023

gjcolombo Aug 18, 2023

faithanalog commented Aug 18, 2023 •

edited

Loading

faithanalog commented Aug 18, 2023

iliana commented Aug 18, 2023 •

edited

Loading

faithanalog commented Sep 17, 2023

smklein left a comment

faithanalog commented Sep 30, 2023

RandomnWithDistinctSleds region allocation strategy #3858

RandomnWithDistinctSleds region allocation strategy #3858

Conversation

faithanalog commented Aug 9, 2023 • edited Loading

faithanalog commented Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add a new target option to omicron-package (my slight preference)

Try to overload an existing target?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

faithanalog commented Aug 18, 2023 • edited Loading

faithanalog commented Aug 18, 2023

iliana commented Aug 18, 2023 • edited Loading

faithanalog commented Sep 17, 2023

smklein left a comment

Choose a reason for hiding this comment

faithanalog commented Sep 30, 2023

faithanalog commented Aug 9, 2023 •

edited

Loading

faithanalog commented Aug 9, 2023 •

edited

Loading

Add a new target option to `omicron-package` (my slight preference)

faithanalog commented Aug 18, 2023 •

edited

Loading

iliana commented Aug 18, 2023 •

edited

Loading