-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ClickHouse] Remove single node functionality #4148
Comments
## Structural changes Instead of spinning up a ClickHouse server per test, now only two servers are spun up to run all tests against. One for single node mode and another for replicated testing. Even though it may look odd to this this now, there is a good reason for this. ClickHouse servers that have been configured to be part of a cluster cannot be spun up in the same server as single node servers. This means that if we spin up several servers, the replicated test becomes incredibly flaky. It's important to note that this set up will not live forever. Once we can remove single node mode and start the replicated ClickHouse cluster as part of the build process, the replicated tests will be able to run independently. All tests now run sequentially against a single server. For correctness, after each test the database is wiped out to give us a clean slate for testing. This only adds a few more seconds (<10s) to the overall testing time. Again, this will not be the final state of things, but necessary to be able to test alongside replicated clusters until we [remove](#4148) single node functionality. ## Caveat Github does some really weird things with the diffs on oximeter/db/src/client.rs. Frankly, it looks like a mess. I would probably checkout the branch itself to see the final state of the tests. ## Follow up work Given the size of this PR, I've not added all of the tests to run against the replicated cluster. There will be a follow up PR that does this. Related: #4001 Closes: #4037
…up (#4372) This is a follow up to #4149 . In this commit all single node tests are now run against a clustered set up. Additionally, a few bugs on the replicated init SQL file have been fixed. Note: All of the individual field and measurement tests all run under a single test now. The reason for this is test times. Each test needs to wipe out it's database before the next test runs. In a replicated installation, this means that all nodes must sync, which takes time. Before I merged all the field/measurement tests into a single one, the testing time for a replicated set up took about 10 minutes, which really is too much. Now it takes ~3.5 minutes. It is still a lot, but only a temporary measure until we can run these tests in parallel again. An unintended bonus of this approach is that running the tests this way means that you are testing that the data is being inserted in the correct table. Related: #4148 Related: #4001
…4422) New `--replicated` flag for the `omicron-dev ch-run` command. This will allow us to spin up a ClickHouse replicated cluster containing 2 replicas and 3 keepers. ```console $ cargo run --bin omicron-dev -- ch-run --replicated Finished dev [unoptimized + debuginfo] target(s) in 0.31s Running `target/debug/omicron-dev ch-run --replicated` omicron-dev: running ClickHouse cluster with configuration files: replicas: /home/{user}/src/omicron/oximeter/db/src/configs/replica_config.xml keepers: /home/{user}/src/omicron/oximeter/db/src/configs/keeper_config.xml omicron-dev: ClickHouse cluster is running with PIDs: 1113482, 1113681, 1113387, 1113451, 1113419 omicron-dev: ClickHouse HTTP servers listening on ports: 8123, 8124 omicron-dev: using /tmp/.tmpFH6v8h and /tmp/.tmpkUjDji for ClickHouse data storage ``` Related: #4148
Unassigned myself from this as I have to focus on other work for the time being |
I recommend that we close this as not planned. I believe we'll always want the ability to run single-node functionality for tests that interact with the |
What I'm talking about is orthogonal to how we deploy ClickHouse on production racks. I don't have much insight to add to whether we should remove deployment of single-node ClickHouse zones in those environments. You and @andrewjstone have thought much more deeply about that at this point that me! I'm asking more about test and development environments. This issue seems to say we should remove the ability to run single-node ClickHouse completely, including in unit tests, integration tests, or other development situations. I disagree. I find it extremely valuable to quickly run tests that rely on an I'd be fine keeping this issue open, if you'd prefer to refine it so that it refers to production environments, e.g., the deployment of single-node ClickHouse zones by the sled-agent's service manager. |
Ah! gotcha. Yeah, I can see how spinning up an entire cluster to test a quick thing could be annoying. The only downside I see to keeping single node functionality is maintaining two different sets of SQL init files. In the new clickward-based integration tests there's one where we only use a 1 keeper 1 server configuration. Could this be a good middle ground so we don't have to maintain two sets of SQL init files? This configuration starts up a lot faster that starting up a bunch of replicas and keepers, but I'm not sure if it's fast enough for what you need. WDYT? |
Unless I'm missing something, that's not quite enough. That will still require us to completely serialize all the tests that deploy ClickHouse. The problem is the port numbers. I don't think we've found a way to start a cluster where the OS assigns the ports, specifically to the Keepers, right? Without that, there is no way to reliably start two clusters at once, since there will be random port conflicts that make the tests flaky. |
Unfortunately this is true. I had high hopes that we would be able to get rid of single node clickhouse deployments in tests for the same reason as @karencfv. Configuring clickhouse clusters requires us to write out the config files, which requires us to choose the ports. There's no good way to do this without a TOCTTOU. In I'd probably recommend either changing this issue to mention it being for production only or closing it altogether as well. I'm not sure it buys us anything, since we know we are following the plan in RFD468 for production purposes. And if we figure out a way to only deploy clusters and run tests in parallel, we'll almost certainly do it regardless of if there's an issue. Otherwise, this issue will stay open and inactionable forever. |
Of course! The ports, gargh. Yeah, ok, that makes sense. I think I'll close this then. When we need to keep track of the work for removing single node production, the item list will be different to what this issue has anyway. I guess it makes sense to close this one. |
Once replicated mode has been enabled, there will be no further use for single node ClickHouse. The init SQL is wildly different between replicated and single node mode, this means that any testing and/or development done on a single node will be unreliable on a replicated node setup.
Several parts of the codebase will need to be modified:
The text was updated successfully, but these errors were encountered: