-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[oximeter] Port all single node ClickHouse tests to a replicated set up #4372
Conversation
This seems like something we need to understand. The test implementation picks the right table here, and if there's an issue with either that function or the code, we should probably get to ground on it. It may help to use either the |
@@ -252,14 +254,15 @@ impl ClickHouseInstance { | |||
})?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// There seems to be a bug using ipv6 and localhost with a replicated
// set up when installing all servers and coordinator nodes on the same
// server. For this reason we will be using ipv4 for testing.
I can't comment on the line 242 directly, but I think it's important to understand why we can't start multiple servers using IPv6. That sounds like either some issue with the configuration that we'll need to be explicitly aware of, or possibly an upstream ClickHouse bug we will want to file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll file a bug report in the ClickHouse repo :)
@@ -32,7 +32,7 @@ pub struct ClickHouseInstance { | |||
// The HTTP port the server is listening on | |||
port: u16, | |||
// The address the server is listening on | |||
pub address: Option<SocketAddr>, | |||
pub address: SocketAddr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this change. Does this mean we no longer allow the server(s) to pick a port at any time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change doesn't have to do with the ports themselves, I had this as an Option before because I wasn't returning the address from the keeper nodes. Going over the code again, I realised there was no point in me doing that and it was going to be potentially useful to have the address for keeper nodes in future tests.
Thanks for the review @bnaecker 🙇♀️ I think I've addressed all of your comments, let me know what you think! |
My point wasn’t that we need to file a bug, but that we need to understand the behavior. It may indeed be a bug, but I would be a bit surprised that the server programs would care about the specific addresses they’re using, as long as they can listen on them.
Can you describe exactly what behavior you see? How did you run the data and keeper servers, and what was the exact error you saw? If we do file a bug against ClickHouse then they’ll want that information as well, so I think it’ll be time well spent.
|
@bnaecker After looking into this for a while I realised ClickHouse servers have a small quirk, where when setting the keeper hosts as IPv6 localhost addresses in the replica configuration file, they must be wrapped in square brackets (only the keeper hosts, the other hosts are set without square brackets 🤷♀️ ). Otherwise, when running any query, a "Service not found" error appears. |
It's really odd, you use IPv6 addresses without brackets everywhere else. See: https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings#interserver-listen-host, and https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings#listen_host |
@karencfv Thank you for tracking that down. It looks to me like this is a ClickHouse bug. On this line, we can see that the hostname for the keeper server (a string) is concatenated directly with the port converted to a string. So for an IPv6 localhost address, provided as a numeric value, this would result in a socket address like This is good to have discovered. We need to make sure to provide the keeper addresses as either symbolic hostnames for the server to resolve, or surround them with brackets ourselves. I will also file a bug against CH, since this could certainly be fixed internally. |
Thanks @bnaecker! I've already made the changes in this commit df652c8. If everything looks good to you could I get a an approval so I can get this merged? On the smf manifest we use the addresses retrieved from dnswait, so all good https://github.com/oxidecomputer/omicron/blob/main/smf/clickhouse/method_script.sh#L72-L73 |
This is a follow up to #4149 . In this commit all single node tests are now run against a clustered set up. Additionally, a few bugs on the replicated init SQL file have been fixed.
A few things to note:
All of the individual field and measurement tests all run under a single test now. The reason for this is test times. Each test needs to wipe out it's database before the next test runs. In a replicated installation, this means that all nodes must sync, which takes time. Before I merged all the field/measurement tests into a single one, the testing time for a replicated set up took about 10 minutes, which really is too much. Now it takes ~3.5 minutes. It is still a lot, but only a temporary measure until we can run these tests in parallel again. An unintended bonus of this approach is that running the tests this way means that you are testing that the data is being inserted in the correct table.
There is a strange behaviour that occurs when running therecall_measurement_f32_test
test. This test is unique in the fact that it has to insert an f64 value to theoximeter.measurements_f32
table. For some strange reason the data is inserted into theoximeter.measurements_f64
table as well. I honestly could not find out why. It seems to be some strange ClickHouse quirk, so I have disabled the test temporarily.I tweaked the test to return 2 rows instead of one and these are the results. As you can see, an additional row is inserted into theoximeter.measurements_f64
at the same time therecall_measurement_f32_test
test is run.Output from the log:{"msg":"executing SQL query: INSERT INTO oximeter.measurements_f32 FORMAT JSONEachRow {"datum":1.1,"timeseries_key":101,"timeseries_name":"foo:bar","timestamp":"2023-10-27 00:52:35.601821001"}","v":0,"name":"test_recall_of_all_fields","level":10,"time":"2023-10-27T00:52:35.602139961Z","hostname":"pop-os","pid":846472,"id":"47b615cf-5e40-4ff8-8700-a48cc97a4830","component":"clickhouse-client"}{"msg":"executing SQL query: SELECT * FROM oximeter.measurements_f32 LIMIT 2 FORMAT JSONEachRow;","v":0,"name":"test_recall_of_all_fields","level":10,"time":"2023-10-27T00:52:35.613187591Z","hostname":"pop-os","pid":846472,"id":"47b615cf-5e40-4ff8-8700-a48cc97a4830","component":"clickhouse-client"}{"msg":"executing SQL query: INSERT INTO oximeter.measurements_f64 FORMAT JSONEachRow {"datum":1.1,"timeseries_key":101,"timeseries_name":"foo:bar","timestamp":"2023-10-27 00:52:35.614849622"}","v":0,"name":"test_recall_of_all_fields","level":10,"time":"2023-10-27T00:52:35.614945438Z","hostname":"pop-os","pid":846472,"id":"47b615cf-5e40-4ff8-8700-a48cc97a4830","component":"clickhouse-client"}{"msg":"executing SQL query: SELECT * FROM oximeter.measurements_f64 LIMIT 2 FORMAT JSONEachRow;","v":0,"name":"test_recall_of_all_fields","level":10,"time":"2023-10-27T00:52:35.624995473Z","hostname":"pop-os","pid":846472,"id":"47b615cf-5e40-4ff8-8700-a48cc97a4830","component":"clickhouse-client"}Test output:Object {"datum": Number(1), "timeseries_key": Number(101), "timeseries_name": String("foo:bar"), "timestamp": String("2023-10-27 00:52:35.588811834")}Object {"datum": Number(1), "timeseries_key": Number(101), "timeseries_name": String("foo:bar"), "timestamp": String("2023-10-27 00:52:35.588811834")}{"timeseries_name":"foo:bar","timeseries_key":101,"timestamp":"2023-10-27 00:52:35.601821001","datum":1.1}Object {"datum": Number(1.1), "timeseries_key": Number(101), "timeseries_name": String("foo:bar"), "timestamp": String("2023-10-27 00:52:35.601821001")}Object {"datum": Number(1.1), "timeseries_key": Number(101), "timeseries_name": String("foo:bar"), "timestamp": String("2023-10-27 00:52:35.601821001")}{"timeseries_name":"foo:bar","timeseries_key":101,"timestamp":"2023-10-27 00:52:35.601821001","datum":1.100000023841858}{"timeseries_name":"foo:bar","timeseries_key":101,"timestamp":"2023-10-27 00:52:35.614849622","datum":1.1}UPDATE: Tyniest of typos in the init sql file was to blame for the above bug
Related: #4148
Related: #4001