[ClickHouse] Remove single node functionality #4148

karencfv · 2023-09-27T05:50:05Z

Once replicated mode has been enabled, there will be no further use for single node ClickHouse. The init SQL is wildly different between replicated and single node mode, this means that any testing and/or development done on a single node will be unreliable on a replicated node setup.

Several parts of the codebase will need to be modified:

Init SQL files
Unit tests - [oximeter] Port all single node ClickHouse tests to a replicated set up #4372
Integration tests
Simulated omicron - Implement ClickHouse replicated cluster option for simulated omicron #4422
Oximeter cli (OG oxdb)

## Structural changes Instead of spinning up a ClickHouse server per test, now only two servers are spun up to run all tests against. One for single node mode and another for replicated testing. Even though it may look odd to this this now, there is a good reason for this. ClickHouse servers that have been configured to be part of a cluster cannot be spun up in the same server as single node servers. This means that if we spin up several servers, the replicated test becomes incredibly flaky. It's important to note that this set up will not live forever. Once we can remove single node mode and start the replicated ClickHouse cluster as part of the build process, the replicated tests will be able to run independently. All tests now run sequentially against a single server. For correctness, after each test the database is wiped out to give us a clean slate for testing. This only adds a few more seconds (<10s) to the overall testing time. Again, this will not be the final state of things, but necessary to be able to test alongside replicated clusters until we [remove](#4148) single node functionality. ## Caveat Github does some really weird things with the diffs on oximeter/db/src/client.rs. Frankly, it looks like a mess. I would probably checkout the branch itself to see the final state of the tests. ## Follow up work Given the size of this PR, I've not added all of the tests to run against the replicated cluster. There will be a follow up PR that does this. Related: #4001 Closes: #4037

…up (#4372) This is a follow up to #4149 . In this commit all single node tests are now run against a clustered set up. Additionally, a few bugs on the replicated init SQL file have been fixed. Note: All of the individual field and measurement tests all run under a single test now. The reason for this is test times. Each test needs to wipe out it's database before the next test runs. In a replicated installation, this means that all nodes must sync, which takes time. Before I merged all the field/measurement tests into a single one, the testing time for a replicated set up took about 10 minutes, which really is too much. Now it takes ~3.5 minutes. It is still a lot, but only a temporary measure until we can run these tests in parallel again. An unintended bonus of this approach is that running the tests this way means that you are testing that the data is being inserted in the correct table. Related: #4148 Related: #4001

…4422) New `--replicated` flag for the `omicron-dev ch-run` command. This will allow us to spin up a ClickHouse replicated cluster containing 2 replicas and 3 keepers. ```console $ cargo run --bin omicron-dev -- ch-run --replicated Finished dev [unoptimized + debuginfo] target(s) in 0.31s Running `target/debug/omicron-dev ch-run --replicated` omicron-dev: running ClickHouse cluster with configuration files: replicas: /home/{user}/src/omicron/oximeter/db/src/configs/replica_config.xml keepers: /home/{user}/src/omicron/oximeter/db/src/configs/keeper_config.xml omicron-dev: ClickHouse cluster is running with PIDs: 1113482, 1113681, 1113387, 1113451, 1113419 omicron-dev: ClickHouse HTTP servers listening on ports: 8123, 8124 omicron-dev: using /tmp/.tmpFH6v8h and /tmp/.tmpkUjDji for ClickHouse data storage ``` Related: #4148

karencfv · 2023-11-07T23:38:19Z

Unassigned myself from this as I have to focus on other work for the time being

bnaecker · 2024-09-17T23:58:31Z

I recommend that we close this as not planned. I believe we'll always want the ability to run single-node functionality for tests that interact with the oximeter database itself. Many of those tests do not care if their running against a replicated or single-node table, e.g., something that just inserts data and fetches it back out to verify serialization. For those tests which do care, we already have the ability to start a replicated cluster.

karencfv · 2024-09-18T01:34:14Z

Heya @bnaecker! I raise this point in the "Open Questions" section of RFD 468. Do you mind writing your concerns there so we can discuss there and keep a record of why we made a choice over another?

bnaecker · 2024-09-18T01:56:44Z

What I'm talking about is orthogonal to how we deploy ClickHouse on production racks. I don't have much insight to add to whether we should remove deployment of single-node ClickHouse zones in those environments. You and @andrewjstone have thought much more deeply about that at this point that me!

I'm asking more about test and development environments. This issue seems to say we should remove the ability to run single-node ClickHouse completely, including in unit tests, integration tests, or other development situations. I disagree. I find it extremely valuable to quickly run tests that rely on an oximeter database, and spinning up a single-node version is many times faster and more reliable than a cluster. I also think it's a non-starter to completely serialize every single Nexus integration test, as we'd have to if we want to reliably run ClickHouse clusters during those tests.

I'd be fine keeping this issue open, if you'd prefer to refine it so that it refers to production environments, e.g., the deployment of single-node ClickHouse zones by the sled-agent's service manager.

karencfv · 2024-09-18T08:02:00Z

Ah! gotcha. Yeah, I can see how spinning up an entire cluster to test a quick thing could be annoying. The only downside I see to keeping single node functionality is maintaining two different sets of SQL init files.

In the new clickward-based integration tests there's one where we only use a 1 keeper 1 server configuration. Could this be a good middle ground so we don't have to maintain two sets of SQL init files? This configuration starts up a lot faster that starting up a bunch of replicas and keepers, but I'm not sure if it's fast enough for what you need. WDYT?

bnaecker · 2024-09-18T14:32:13Z

In the new clickward-based integration tests there's one where we only use a 1 keeper 1 server configuration. Could this be a good middle ground so we don't have to maintain two sets of SQL init files?

Unless I'm missing something, that's not quite enough. That will still require us to completely serialize all the tests that deploy ClickHouse. The problem is the port numbers. I don't think we've found a way to start a cluster where the OS assigns the ports, specifically to the Keepers, right? Without that, there is no way to reliably start two clusters at once, since there will be random port conflicts that make the tests flaky.

andrewjstone · 2024-09-18T23:02:05Z

In the new clickward-based integration tests there's one where we only use a 1 keeper 1 server configuration. Could this be a good middle ground so we don't have to maintain two sets of SQL init files?

Unless I'm missing something, that's not quite enough. That will still require us to completely serialize all the tests that deploy ClickHouse. The problem is the port numbers. I don't think we've found a way to start a cluster where the OS assigns the ports, specifically to the Keepers, right? Without that, there is no way to reliably start two clusters at once, since there will be random port conflicts that make the tests flaky.

Unfortunately this is true. I had high hopes that we would be able to get rid of single node clickhouse deployments in tests for the same reason as @karencfv. Configuring clickhouse clusters requires us to write out the config files, which requires us to choose the ports. There's no good way to do this without a TOCTTOU. In a4x2, Ry and co tried to reserve ports by squatting on them via OS allocation and then releasing them back to the pool so they could be explicitly used. But any process that gets in during that release can break this scheme, and the result is very intermittent flakes.

I'd probably recommend either changing this issue to mention it being for production only or closing it altogether as well. I'm not sure it buys us anything, since we know we are following the plan in RFD468 for production purposes. And if we figure out a way to only deploy clusters and run tests in parallel, we'll almost certainly do it regardless of if there's an issue. Otherwise, this issue will stay open and inactionable forever.

karencfv · 2024-09-18T23:15:44Z

Of course! The ports, gargh. Yeah, ok, that makes sense. I think I'll close this then. When we need to keep track of the work for removing single node production, the item list will be different to what this issue has anyway. I guess it makes sense to close this one.

karencfv added database Related to database access oximeter Metrics labels Sep 27, 2023

karencfv added this to the Unscheduled milestone Sep 27, 2023

karencfv mentioned this issue Sep 27, 2023

Initial work for replicated ClickHouse testing #4149

Merged

karencfv mentioned this issue Oct 27, 2023

[oximeter] Port all single node ClickHouse tests to a replicated set up #4372

Merged

karencfv self-assigned this Nov 1, 2023

karencfv mentioned this issue Nov 3, 2023

Implement ClickHouse replicated cluster option for simulated omicron #4422

Merged

karencfv removed their assignment Nov 7, 2023

karencfv closed this as completed Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ClickHouse] Remove single node functionality #4148

[ClickHouse] Remove single node functionality #4148

karencfv commented Sep 27, 2023 •

edited

Loading

karencfv commented Nov 7, 2023

bnaecker commented Sep 17, 2024

karencfv commented Sep 18, 2024

bnaecker commented Sep 18, 2024

karencfv commented Sep 18, 2024

bnaecker commented Sep 18, 2024

andrewjstone commented Sep 18, 2024 •

edited

Loading

karencfv commented Sep 18, 2024

[ClickHouse] Remove single node functionality #4148

[ClickHouse] Remove single node functionality #4148

Comments

karencfv commented Sep 27, 2023 • edited Loading

karencfv commented Nov 7, 2023

bnaecker commented Sep 17, 2024

karencfv commented Sep 18, 2024

bnaecker commented Sep 18, 2024

karencfv commented Sep 18, 2024

bnaecker commented Sep 18, 2024

andrewjstone commented Sep 18, 2024 • edited Loading

karencfv commented Sep 18, 2024

karencfv commented Sep 27, 2023 •

edited

Loading

andrewjstone commented Sep 18, 2024 •

edited

Loading