Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ClickHouse] Remove single node functionality #4148

Closed
5 tasks
karencfv opened this issue Sep 27, 2023 · 8 comments
Closed
5 tasks

[ClickHouse] Remove single node functionality #4148

karencfv opened this issue Sep 27, 2023 · 8 comments
Labels
database Related to database access Metrics oximeter
Milestone

Comments

@karencfv
Copy link
Contributor

karencfv commented Sep 27, 2023

Once replicated mode has been enabled, there will be no further use for single node ClickHouse. The init SQL is wildly different between replicated and single node mode, this means that any testing and/or development done on a single node will be unreliable on a replicated node setup.

Several parts of the codebase will need to be modified:

@karencfv karencfv added database Related to database access oximeter Metrics labels Sep 27, 2023
@karencfv karencfv added this to the Unscheduled milestone Sep 27, 2023
karencfv added a commit that referenced this issue Oct 26, 2023
## Structural changes

Instead of spinning up a ClickHouse server per test, now only two
servers are spun up to run all tests against. One for single node mode
and another for replicated testing. Even though it may look odd to this
this now, there is a good reason for this. ClickHouse servers that have
been configured to be part of a cluster cannot be spun up in the same
server as single node servers. This means that if we spin up several
servers, the replicated test becomes incredibly flaky. It's important to
note that this set up will not live forever. Once we can remove single
node mode and start the replicated ClickHouse cluster as part of the
build process, the replicated tests will be able to run independently.

All tests now run sequentially against a single server. For correctness,
after each test the database is wiped out to give us a clean slate for
testing. This only adds a few more seconds (<10s) to the overall testing
time. Again, this will not be the final state of things, but necessary
to be able to test alongside replicated clusters until we
[remove](#4148) single
node functionality.

## Caveat

Github does some really weird things with the diffs on
oximeter/db/src/client.rs. Frankly, it looks like a mess. I would
probably checkout the branch itself to see the final state of the tests.

## Follow up work

Given the size of this PR, I've not added all of the tests to run
against the replicated cluster. There will be a follow up PR that does
this.

Related: #4001
Closes: #4037
karencfv added a commit that referenced this issue Oct 31, 2023
…up (#4372)

This is a follow up to
#4149 . In this commit all
single node tests are now run against a clustered set up. Additionally,
a few bugs on the replicated init SQL file have been fixed.

Note:

All of the individual field and measurement tests all run under a
single test now. The reason for this is test times. Each test needs to
wipe out it's database before the next test runs. In a replicated
installation, this means that all nodes must sync, which takes time.
Before I merged all the field/measurement tests into a single one, the
testing time for a replicated set up took about 10 minutes, which really
is too much. Now it takes ~3.5 minutes. It is still a lot, but only a
temporary measure until we can run these tests in parallel again. An
unintended bonus of this approach is that running the tests this way
means that you are testing that the data is being inserted in the
correct table.

Related: #4148
Related: #4001
@karencfv karencfv self-assigned this Nov 1, 2023
karencfv added a commit that referenced this issue Nov 7, 2023
…4422)

New `--replicated` flag for the `omicron-dev ch-run` command. This will
allow us to spin up a ClickHouse replicated cluster containing 2
replicas and 3 keepers.

```console
$ cargo run --bin omicron-dev -- ch-run --replicated
    Finished dev [unoptimized + debuginfo] target(s) in 0.31s
     Running `target/debug/omicron-dev ch-run --replicated`
omicron-dev: running ClickHouse cluster with configuration files:
 replicas: /home/{user}/src/omicron/oximeter/db/src/configs/replica_config.xml
 keepers: /home/{user}/src/omicron/oximeter/db/src/configs/keeper_config.xml
omicron-dev: ClickHouse cluster is running with PIDs: 1113482, 1113681, 1113387, 1113451, 1113419
omicron-dev: ClickHouse HTTP servers listening on ports: 8123, 8124
omicron-dev: using /tmp/.tmpFH6v8h and /tmp/.tmpkUjDji for ClickHouse data storage
```

Related: #4148
@karencfv karencfv removed their assignment Nov 7, 2023
@karencfv
Copy link
Contributor Author

karencfv commented Nov 7, 2023

Unassigned myself from this as I have to focus on other work for the time being

@bnaecker
Copy link
Collaborator

I recommend that we close this as not planned. I believe we'll always want the ability to run single-node functionality for tests that interact with the oximeter database itself. Many of those tests do not care if their running against a replicated or single-node table, e.g., something that just inserts data and fetches it back out to verify serialization. For those tests which do care, we already have the ability to start a replicated cluster.

@karencfv
Copy link
Contributor Author

Heya @bnaecker! I raise this point in the "Open Questions" section of RFD 468. Do you mind writing your concerns there so we can discuss there and keep a record of why we made a choice over another?

@bnaecker
Copy link
Collaborator

What I'm talking about is orthogonal to how we deploy ClickHouse on production racks. I don't have much insight to add to whether we should remove deployment of single-node ClickHouse zones in those environments. You and @andrewjstone have thought much more deeply about that at this point that me!

I'm asking more about test and development environments. This issue seems to say we should remove the ability to run single-node ClickHouse completely, including in unit tests, integration tests, or other development situations. I disagree. I find it extremely valuable to quickly run tests that rely on an oximeter database, and spinning up a single-node version is many times faster and more reliable than a cluster. I also think it's a non-starter to completely serialize every single Nexus integration test, as we'd have to if we want to reliably run ClickHouse clusters during those tests.

I'd be fine keeping this issue open, if you'd prefer to refine it so that it refers to production environments, e.g., the deployment of single-node ClickHouse zones by the sled-agent's service manager.

@karencfv
Copy link
Contributor Author

Ah! gotcha. Yeah, I can see how spinning up an entire cluster to test a quick thing could be annoying. The only downside I see to keeping single node functionality is maintaining two different sets of SQL init files.

In the new clickward-based integration tests there's one where we only use a 1 keeper 1 server configuration. Could this be a good middle ground so we don't have to maintain two sets of SQL init files? This configuration starts up a lot faster that starting up a bunch of replicas and keepers, but I'm not sure if it's fast enough for what you need. WDYT?

@bnaecker
Copy link
Collaborator

In the new clickward-based integration tests there's one where we only use a 1 keeper 1 server configuration. Could this be a good middle ground so we don't have to maintain two sets of SQL init files?

Unless I'm missing something, that's not quite enough. That will still require us to completely serialize all the tests that deploy ClickHouse. The problem is the port numbers. I don't think we've found a way to start a cluster where the OS assigns the ports, specifically to the Keepers, right? Without that, there is no way to reliably start two clusters at once, since there will be random port conflicts that make the tests flaky.

@andrewjstone
Copy link
Contributor

andrewjstone commented Sep 18, 2024

In the new clickward-based integration tests there's one where we only use a 1 keeper 1 server configuration. Could this be a good middle ground so we don't have to maintain two sets of SQL init files?

Unless I'm missing something, that's not quite enough. That will still require us to completely serialize all the tests that deploy ClickHouse. The problem is the port numbers. I don't think we've found a way to start a cluster where the OS assigns the ports, specifically to the Keepers, right? Without that, there is no way to reliably start two clusters at once, since there will be random port conflicts that make the tests flaky.

Unfortunately this is true. I had high hopes that we would be able to get rid of single node clickhouse deployments in tests for the same reason as @karencfv. Configuring clickhouse clusters requires us to write out the config files, which requires us to choose the ports. There's no good way to do this without a TOCTTOU. In a4x2, Ry and co tried to reserve ports by squatting on them via OS allocation and then releasing them back to the pool so they could be explicitly used. But any process that gets in during that release can break this scheme, and the result is very intermittent flakes.

I'd probably recommend either changing this issue to mention it being for production only or closing it altogether as well. I'm not sure it buys us anything, since we know we are following the plan in RFD468 for production purposes. And if we figure out a way to only deploy clusters and run tests in parallel, we'll almost certainly do it regardless of if there's an issue. Otherwise, this issue will stay open and inactionable forever.

@karencfv
Copy link
Contributor Author

Of course! The ports, gargh. Yeah, ok, that makes sense. I think I'll close this then. When we need to keep track of the work for removing single node production, the item list will be different to what this issue has anyway. I guess it makes sense to close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
database Related to database access Metrics oximeter
Projects
None yet
Development

No branches or pull requests

3 participants