Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[oximeter] Port all single node ClickHouse tests to a replicated set up #4372

Merged
merged 13 commits into from
Oct 31, 2023

Conversation

karencfv
Copy link
Contributor

@karencfv karencfv commented Oct 27, 2023

This is a follow up to #4149 . In this commit all single node tests are now run against a clustered set up. Additionally, a few bugs on the replicated init SQL file have been fixed.

A few things to note:

  • All of the individual field and measurement tests all run under a single test now. The reason for this is test times. Each test needs to wipe out it's database before the next test runs. In a replicated installation, this means that all nodes must sync, which takes time. Before I merged all the field/measurement tests into a single one, the testing time for a replicated set up took about 10 minutes, which really is too much. Now it takes ~3.5 minutes. It is still a lot, but only a temporary measure until we can run these tests in parallel again. An unintended bonus of this approach is that running the tests this way means that you are testing that the data is being inserted in the correct table.

  • There is a strange behaviour that occurs when running the recall_measurement_f32_test test. This test is unique in the fact that it has to insert an f64 value to the oximeter.measurements_f32 table. For some strange reason the data is inserted into the oximeter.measurements_f64 table as well. I honestly could not find out why. It seems to be some strange ClickHouse quirk, so I have disabled the test temporarily.

    I tweaked the test to return 2 rows instead of one and these are the results. As you can see, an additional row is inserted into the oximeter.measurements_f64 at the same time the recall_measurement_f32_test test is run.

    Output from the log:

    {"msg":"executing SQL query: INSERT INTO oximeter.measurements_f32 FORMAT JSONEachRow {"datum":1.1,"timeseries_key":101,"timeseries_name":"foo:bar","timestamp":"2023-10-27 00:52:35.601821001"}","v":0,"name":"test_recall_of_all_fields","level":10,"time":"2023-10-27T00:52:35.602139961Z","hostname":"pop-os","pid":846472,"id":"47b615cf-5e40-4ff8-8700-a48cc97a4830","component":"clickhouse-client"}
    {"msg":"executing SQL query: SELECT * FROM oximeter.measurements_f32 LIMIT 2 FORMAT JSONEachRow;","v":0,"name":"test_recall_of_all_fields","level":10,"time":"2023-10-27T00:52:35.613187591Z","hostname":"pop-os","pid":846472,"id":"47b615cf-5e40-4ff8-8700-a48cc97a4830","component":"clickhouse-client"}
    {"msg":"executing SQL query: INSERT INTO oximeter.measurements_f64 FORMAT JSONEachRow {"datum":1.1,"timeseries_key":101,"timeseries_name":"foo:bar","timestamp":"2023-10-27 00:52:35.614849622"}","v":0,"name":"test_recall_of_all_fields","level":10,"time":"2023-10-27T00:52:35.614945438Z","hostname":"pop-os","pid":846472,"id":"47b615cf-5e40-4ff8-8700-a48cc97a4830","component":"clickhouse-client"}
    {"msg":"executing SQL query: SELECT * FROM oximeter.measurements_f64 LIMIT 2 FORMAT JSONEachRow;","v":0,"name":"test_recall_of_all_fields","level":10,"time":"2023-10-27T00:52:35.624995473Z","hostname":"pop-os","pid":846472,"id":"47b615cf-5e40-4ff8-8700-a48cc97a4830","component":"clickhouse-client"}

    Test output:
    Object {"datum": Number(1), "timeseries_key": Number(101), "timeseries_name": String("foo:bar"), "timestamp": String("2023-10-27 00:52:35.588811834")}
    Object {"datum": Number(1), "timeseries_key": Number(101), "timeseries_name": String("foo:bar"), "timestamp": String("2023-10-27 00:52:35.588811834")}
    {"timeseries_name":"foo:bar","timeseries_key":101,"timestamp":"2023-10-27 00:52:35.601821001","datum":1.1}

    Object {"datum": Number(1.1), "timeseries_key": Number(101), "timeseries_name": String("foo:bar"), "timestamp": String("2023-10-27 00:52:35.601821001")}
    Object {"datum": Number(1.1), "timeseries_key": Number(101), "timeseries_name": String("foo:bar"), "timestamp": String("2023-10-27 00:52:35.601821001")}
    {"timeseries_name":"foo:bar","timeseries_key":101,"timestamp":"2023-10-27 00:52:35.601821001","datum":1.100000023841858}
    {"timeseries_name":"foo:bar","timeseries_key":101,"timestamp":"2023-10-27 00:52:35.614849622","datum":1.1}

UPDATE: Tyniest of typos in the init sql file was to blame for the above bug

Related: #4148
Related: #4001

@karencfv karencfv requested a review from bnaecker October 27, 2023 02:36
oximeter/db/src/client.rs Outdated Show resolved Hide resolved
@bnaecker
Copy link
Collaborator

bnaecker commented Oct 29, 2023

There is a strange behaviour that occurs when running the recall_measurement_f32_test test. This test is unique in the fact that it has to insert an f64 value to the oximeter.measurements_f32 table. For some strange reason the data is inserted into the oximeter.measurements_f64 table as well. I honestly could not find out why. It seems to be some strange ClickHouse quirk, so I have disabled the test temporarily.

This seems like something we need to understand. The test implementation picks the right table here, and if there's an issue with either that function or the code, we should probably get to ground on it.

It may help to use either the system.query_log table, or the DTrace probes we put in a while ago, to understand exactly which queries are running. I.e., if it's really the client inserting this datum into more than one table, or something else.

@@ -252,14 +254,15 @@ impl ClickHouseInstance {
})?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// There seems to be a bug using ipv6 and localhost with a replicated
// set up when installing all servers and coordinator nodes on the same
// server. For this reason we will be using ipv4 for testing.

I can't comment on the line 242 directly, but I think it's important to understand why we can't start multiple servers using IPv6. That sounds like either some issue with the configuration that we'll need to be explicitly aware of, or possibly an upstream ClickHouse bug we will want to file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll file a bug report in the ClickHouse repo :)

@@ -32,7 +32,7 @@ pub struct ClickHouseInstance {
// The HTTP port the server is listening on
port: u16,
// The address the server is listening on
pub address: Option<SocketAddr>,
pub address: SocketAddr,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this change. Does this mean we no longer allow the server(s) to pick a port at any time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change doesn't have to do with the ports themselves, I had this as an Option before because I wasn't returning the address from the keeper nodes. Going over the code again, I realised there was no point in me doing that and it was going to be potentially useful to have the address for keeper nodes in future tests.

test-utils/src/dev/clickhouse.rs Outdated Show resolved Hide resolved
test-utils/src/dev/clickhouse.rs Outdated Show resolved Hide resolved
oximeter/db/src/client.rs Outdated Show resolved Hide resolved
oximeter/db/src/client.rs Outdated Show resolved Hide resolved
oximeter/db/src/client.rs Outdated Show resolved Hide resolved
oximeter/db/src/client.rs Outdated Show resolved Hide resolved
oximeter/db/src/client.rs Show resolved Hide resolved
oximeter/db/src/client.rs Outdated Show resolved Hide resolved
@karencfv karencfv requested a review from bnaecker October 31, 2023 00:46
@karencfv
Copy link
Contributor Author

Thanks for the review @bnaecker 🙇‍♀️ I think I've addressed all of your comments, let me know what you think!

@bnaecker
Copy link
Collaborator

bnaecker commented Oct 31, 2023 via email

@karencfv
Copy link
Contributor Author

karencfv commented Oct 31, 2023

@bnaecker After looking into this for a while I realised ClickHouse servers have a small quirk, where when setting the keeper hosts as IPv6 localhost addresses in the replica configuration file, they must be wrapped in square brackets (only the keeper hosts, the other hosts are set without square brackets 🤷‍♀️ ). Otherwise, when running any query, a "Service not found" error appears.

@karencfv
Copy link
Contributor Author

@bnaecker
Copy link
Collaborator

@karencfv Thank you for tracking that down. It looks to me like this is a ClickHouse bug. On this line, we can see that the hostname for the keeper server (a string) is concatenated directly with the port converted to a string. So for an IPv6 localhost address, provided as a numeric value, this would result in a socket address like ::1:12345. That might be correct for a symbolic hostname, like localhost:12345, but for IPv6, which separates segments with :, one needs to surround it with brackets, so that we get [::1]:12345.

This is good to have discovered. We need to make sure to provide the keeper addresses as either symbolic hostnames for the server to resolve, or surround them with brackets ourselves. I will also file a bug against CH, since this could certainly be fixed internally.

@karencfv
Copy link
Contributor Author

karencfv commented Oct 31, 2023

This is good to have discovered. We need to make sure to provide the keeper addresses as either symbolic hostnames for the server to resolve, or surround them with brackets ourselves. I will also file a bug against CH, since this could certainly be fixed internally.

Thanks @bnaecker! I've already made the changes in this commit df652c8. If everything looks good to you could I get a an approval so I can get this merged?

On the smf manifest we use the addresses retrieved from dnswait, so all good https://github.com/oxidecomputer/omicron/blob/main/smf/clickhouse/method_script.sh#L72-L73

@karencfv karencfv merged commit 1cf6a0f into oxidecomputer:main Oct 31, 2023
15 checks passed
@karencfv karencfv deleted the replicated-testing branch October 31, 2023 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants