-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some tests in omicron-nexus::test_all stall out and hang #4779
Comments
I have what feels like a similar issue, but with different tests hanging. In this run, the following tests hung.
Here is a link to the CI job. In this run, the illumos integration tests passed, but the Linux ones did not. This also comes from a commit that only changed an unrelated buildomat script and the previous commit on that branch passed CI on the Linux build-and-test job. |
Just observed this failure here: https://github.com/oxidecomputer/omicron/pull/4783/checks?check_run_id=20286750269
Reading what's going on from just the Nexus log is getting kinda difficult - the rendered log is 25,000 lines long, of which ~20,000 of those lines are from |
#4780 should hopefully land very soon, just waiting on the deploy job. |
We're seeing some tests in `test_all` hang forever (#4779). Set a reasonable upper bound on test duration. This will also cause stdout and stderr for failing tests to be printed. Doing so on SIGTERM in general is tracked at nextest-rs/nextest#1208. Also, bump up the required nextest version to 0.9.64 to make use of the `binary_id` predicate.
Couple more cases that just showed up: #4783
(These also look wholly unrelated to the PR, but same symptom. Something must have merged recently to cause this timeout across such a broad set of tests to trigger much more frequently) |
Here's a post-#4780 failure: https://github.com/oxidecomputer/omicron/runs/20299512003 Link to failing stdout/stderr. Sadly this doesn't appear to be helpful. |
While looking at the logs I found this interesting issue:
The port number seems suspiciously low -- it's definitely below 1024 which is strange. Wondering if #4755 is responsible. I'm not sure exactly why but looking at commits on main, it looks like |
That's an interesting correlation, and I also don't see why that commit would cause it. For background, ClickHouse is started with a port of 0; it gets some available one from the OS; and then we fish the actual port it bound out of the log file. That machinery has all worked pretty well for a while now, but it's possible moving the log files around introduced some problem. It also seems possible to get a port < 1024 when run with elevated priveleges, so perhaps this is a red herring if the tests are run that way. |
We sadly don't have logs for clickhouse itself as part of the failed test uploads -- we should probably work to upload them. |
Hmm, I'm not sure why we don't have those ClickHouse log files. It sure looks like |
I don't think that's it. Buildomat uses |
It looks like it's not always a port below 1024 -- in this example this was port 3981. We also don't have success logs to confirm that the ClickHouse issue is actually correlated. However, looking at the code which generates this warning it does seem like it would get stuck here. I believe it gets invoked via
|
This version of nextest has a fix for nextest-rs/nextest#1208, which we encountered while attempting to diagnose #4779.
In #4779 we're tracking what appears to be a ClickHouse initialization failure during Nexus startup. Set a timeout of 60s for each step in the initialization process. These steps should usually not take more than 5 seconds each, so 60s is a really comfortable buffer.
…xt (#878) While trying to diagnose oxidecomputer/omicron#4779, which only occurs in CI, I'm trying to collect ClickHouse logs. To do that we need a way to uniquely identify the ClickHouse logs per test. It is possible to derive the log prefix from `log_file_for_test`, but it seems cleaner to do it this way.
Part of trying to figure out #4779 -- we now create ClickHouse logs in a well-named directory which gets not-cleaned up (and therefore uploaded) at the end. Tested this by introducing a failure in a test, and then seeing the following files in `/tmp`: ``` % find test_all-c4d6fc05c1fc48de-test_local_users.3329283.1-clickhouse-I4Af08 test_all-c4d6fc05c1fc48de-test_local_users.3329283.1-clickhouse-I4Af08 test_all-c4d6fc05c1fc48de-test_local_users.3329283.1-clickhouse-I4Af08/log test_all-c4d6fc05c1fc48de-test_local_users.3329283.1-clickhouse-I4Af08/clickhouse-server.log test_all-c4d6fc05c1fc48de-test_local_users.3329283.1-clickhouse-I4Af08/clickhouse-server.errlog ```
So I haven't been able to reproduce the flakiness with #4796, in any of the 5 runs that happened on the PR, or on main (https://github.com/oxidecomputer/omicron/runs/20441805293). That is really weird -- in a well-behaved system, 4796 would not have any impact on this. So how could this happen? My current reasoning is: 4796 does two things.
If 4796 does not actually address the flakiness, we continue to dig, now with better logs. If 4796 does fix the flakiness -- then it surely can't be 2, because that code only kicks in after the test has failed. So the only thing it can be is 1. What could 1 be caused by in practice? The best thing I can imagine is something that removes those directories while the clickhouse process is running. That may suggest a reason why #4755 caused the flakiness (it moved the cwd to inside the temp dir, and processes generally don't like their cwds being deleted). Assuming this hypothetical, what could this process be? I can imagine one of two things:
This is all really strange. |
So it does look like 4796 fixed the flakiness... |
We discussed this in the Control Plane sync today and here's what we decided: I was hoping to timebox digging into this for last week. That has been successful in the sense that the flakiness has been addressed, but we haven't root caused the issue. However I have a bigger company priority (automated update) to work on, so I'm going to put this on the backburner for now while working on more pressing issues. If and when we come back to this in the future, a few places to start looking:
|
This changes the logic for parsing the server ports that ClickHouse listens on, so that it waits until we are certain that they've been completely written. This fixes a few flaky tests, currently including at least #4779, #4972, and #5180. There have been others in the past, which we've addressed with less-complete solutions that only narrow the race window. This should eliminate it.
This changes the logic for parsing the server ports that ClickHouse listens on, so that it waits until we are certain that they've been completely written. This fixes a few flaky tests, currently including at least #4779, #4972, and #5180. There have been others in the past, which we've addressed with less-complete solutions that only narrow the race window. This should eliminate it.
This changes the logic for parsing the server ports that ClickHouse listens on, so that it waits until we are certain that they've been completely written. This fixes a few flaky tests, currently including at least #4779, #4972, and #5180. There have been others in the past, which we've addressed with less-complete solutions that only narrow the race window. This should eliminate it.
@sunshowers I'm pretty confident the underlying issue around discovery of the correct ports will be resolved by #6655. But it seems like this issue is also about the failure to upload the log files. What's your opinion? Would you like to keep it open to track that aspect of it; or maybe we rename this issue or open a new one? |
Are ClickHouse log files currently uploaded for failing tests? Sorry it's been a while! |
It sure has!
I think so. At least, one of my recent PRs that had a failed test job does appear to have uploaded the files. |
Some example buildomat runs:
On main: https://github.com/oxidecomputer/omicron/runs/20227366854
On a PR: https://github.com/oxidecomputer/omicron/pull/4773/checks?check_run_id=20231856893
The text was updated successfully, but these errors were encountered: