Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some tests in omicron-nexus::test_all stall out and hang #4779

Open
sunshowers opened this issue Jan 8, 2024 · 17 comments
Open

Some tests in omicron-nexus::test_all stall out and hang #4779

sunshowers opened this issue Jan 8, 2024 · 17 comments
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. Testing & Analysis Tests & Analyzers
Milestone

Comments

@sunshowers
Copy link
Contributor

Some example buildomat runs:

On main: https://github.com/oxidecomputer/omicron/runs/20227366854

omicron-nexus::test_all integration_tests::projects::test_project_deletion_with_vpc
omicron-nexus::test_all integration_tests::silos::test_saml_idp_metadata_data_valid

On a PR: https://github.com/oxidecomputer/omicron/pull/4773/checks?check_run_id=20231856893

omicron-nexus::test_all integration_tests::console_api::test_session_me_groups
@sunshowers sunshowers added Testing & Analysis Tests & Analyzers Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. labels Jan 8, 2024
@rcgoodfellow
Copy link
Contributor

I have what feels like a similar issue, but with different tests hanging. In this run, the following tests hung.

omicron-nexus::test_all integration_tests::rack::test_rack_initialization
omicron-nexus app::update::tests::test_update_deployments

Here is a link to the CI job. In this run, the illumos integration tests passed, but the Linux ones did not. This also comes from a commit that only changed an unrelated buildomat script and the previous commit on that branch passed CI on the Linux build-and-test job.

@smklein
Copy link
Collaborator

smklein commented Jan 9, 2024

Just observed this failure here: https://github.com/oxidecomputer/omicron/pull/4783/checks?check_run_id=20286750269

omicron-nexus::test_all integration_tests::images::test_make_disk_from_image appears to have been running non-stop for almost two hours. Looks like this happened on Ubuntu-22.04.

Reading what's going on from just the Nexus log is getting kinda difficult - the rendered log is 25,000 lines long, of which ~20,000 of those lines are from BackgroundTasks.

@sunshowers
Copy link
Contributor Author

sunshowers commented Jan 9, 2024

#4780 should hopefully land very soon, just waiting on the deploy job.

sunshowers added a commit that referenced this issue Jan 9, 2024
We're seeing some tests in `test_all` hang forever (#4779). Set a
reasonable
upper bound on test duration.

This will also cause stdout and stderr for failing tests to be printed.
Doing
so on SIGTERM in general is tracked at
nextest-rs/nextest#1208.

Also, bump up the required nextest version to 0.9.64 to make use of the
`binary_id` predicate.
@smklein
Copy link
Collaborator

smklein commented Jan 9, 2024

Couple more cases that just showed up: #4783

[5411](https://buildomat.eng.oxide.computer/wg/0/details/01HKPQ0KFEGHYZHDSCY6RC9JMB/1BHKCrXA6UrkcAbAz5lm6dq48ZkaSiQTlDYpWR0cbxtrF0bK/01HKPQ1NSF26P9RMQR6KSEPM7V#S5411)	2024-01-09T11:12:47.656Z	   Canceling due to signal: 2 tests still running
[5412](https://buildomat.eng.oxide.computer/wg/0/details/01HKPQ0KFEGHYZHDSCY6RC9JMB/1BHKCrXA6UrkcAbAz5lm6dq48ZkaSiQTlDYpWR0cbxtrF0bK/01HKPQ1NSF26P9RMQR6KSEPM7V#S5412)	2024-01-09T11:12:47.665Z	     SIGTERM [6351.638s] omicron-nexus::test_all integration_tests::oximeter::test_oximeter_reregistration
[5413](https://buildomat.eng.oxide.computer/wg/0/details/01HKPQ0KFEGHYZHDSCY6RC9JMB/1BHKCrXA6UrkcAbAz5lm6dq48ZkaSiQTlDYpWR0cbxtrF0bK/01HKPQ1NSF26P9RMQR6KSEPM7V#S5413)	2024-01-09T11:12:47.669Z	     SIGTERM [6220.087s] omicron-nexus::test_all integration_tests::role_assignments::test_role_assignments_silo

(These also look wholly unrelated to the PR, but same symptom. Something must have merged recently to cause this timeout across such a broad set of tests to trigger much more frequently)

@sunshowers
Copy link
Contributor Author

Here's a post-#4780 failure: https://github.com/oxidecomputer/omicron/runs/20299512003

Link to failing stdout/stderr. Sadly this doesn't appear to be helpful.

@sunshowers
Copy link
Contributor Author

sunshowers commented Jan 9, 2024

While looking at the logs I found this interesting issue:

2024-01-09T12:32:00.369Z	WARN	test_reject_creating_disk_from_snapshot (clickhouse-client): failed to read version
    collector_id = 39e6175b-4df2-4730-b11d-cbc1e60a2e78
    error = Telemetry database unavailable: error sending request for url (http://[::1]:465/?output_format_json_quote_64bit_integers=0): error trying to connect: tcp connect error: Connection refused (os error 111)
    file = oximeter/db/src/client.rs:868
    id = dcf208b1-e86c-4472-a0ac-8874804fe091
2024-01-09T12:32:00.371Z	WARN	test_reject_creating_disk_from_snapshot (oximeter): failed to create ClickHouse client
    error = Database(DatabaseUnavailable("error sending request for url (http://[::1]:465/?output_format_json_quote_64bit_integers=0): error trying to connect: tcp connect error: Connection refused (os error 111)"))
    file = oximeter/collector/src/lib.rs:213
    retry_after = 167.267012453s

The port number seems suspiciously low -- it's definitely below 1024 which is strange.

Wondering if #4755 is responsible. I'm not sure exactly why but looking at commits on main, it looks like test_all started flaking out after that commit landed.

@bnaecker
Copy link
Collaborator

bnaecker commented Jan 9, 2024

That's an interesting correlation, and I also don't see why that commit would cause it. For background, ClickHouse is started with a port of 0; it gets some available one from the OS; and then we fish the actual port it bound out of the log file. That machinery has all worked pretty well for a while now, but it's possible moving the log files around introduced some problem. It also seems possible to get a port < 1024 when run with elevated priveleges, so perhaps this is a red herring if the tests are run that way.

@sunshowers
Copy link
Contributor Author

sunshowers commented Jan 9, 2024

We sadly don't have logs for clickhouse itself as part of the failed test uploads -- we should probably work to upload them.

@bnaecker
Copy link
Collaborator

bnaecker commented Jan 9, 2024

Hmm, I'm not sure why we don't have those ClickHouse log files. It sure looks like TMPDIR is set to /var/tmp/omicron_tmp in .github/buildomat/build-and-test.sh, and the ClickHouseInstance will use Utf8TempDir for storage. Is it possible that the buildomat rules for uploading artifacts misses those? IIRC, the temporary directories for ClickHouse usually look like /tmp/.tmppEavQr, so perhaps the leading . causes them to be missed.

@bnaecker
Copy link
Collaborator

bnaecker commented Jan 9, 2024

I don't think that's it. Buildomat uses glob to find files matching the output rules, and that crate seems to find tempdirs created with a leading . just fine.

@sunshowers
Copy link
Contributor Author

sunshowers commented Jan 9, 2024

It looks like it's not always a port below 1024 -- in this example this was port 3981.

We also don't have success logs to confirm that the ClickHouse issue is actually correlated.

However, looking at the code which generates this warning it does seem like it would get stuck here. I believe it gets invoked via start_oximeter, which doesn't have an associated timeout. So the next steps are:

  1. Gather ClickHouse logs and ensure they're uploaded: [clickhouse] upload logs on test failure #4796
  2. Add a timeout to start_oximeter: [nexus-test-utils] set 60s timeouts for each init step #4789

sunshowers added a commit that referenced this issue Jan 9, 2024
This version of nextest has a fix for nextest-rs/nextest#1208, which we encountered while attempting to diagnose #4779.
sunshowers added a commit that referenced this issue Jan 10, 2024
In #4779 we're tracking what appears to be a ClickHouse initialization
failure during Nexus startup. Set a timeout of 60s for each step in the
initialization process.

These steps should usually not take more than 5 seconds each, so 60s is
a really comfortable buffer.
sunshowers added a commit to oxidecomputer/dropshot that referenced this issue Jan 12, 2024
…xt (#878)

While trying to diagnose oxidecomputer/omicron#4779,
which only occurs in CI, I'm trying to collect ClickHouse logs. To do that we
need a way to uniquely identify the ClickHouse logs per test.

It is possible to derive the log prefix from `log_file_for_test`, but it seems
cleaner to do it this way.
sunshowers added a commit that referenced this issue Jan 12, 2024
Part of trying to figure out #4779 -- we now create ClickHouse logs in a
well-named directory which gets not-cleaned up (and therefore uploaded) at the end.

Tested this by introducing a failure in a test, and then seeing the following files in `/tmp`:

```
% find test_all-c4d6fc05c1fc48de-test_local_users.3329283.1-clickhouse-I4Af08 
test_all-c4d6fc05c1fc48de-test_local_users.3329283.1-clickhouse-I4Af08
test_all-c4d6fc05c1fc48de-test_local_users.3329283.1-clickhouse-I4Af08/log
test_all-c4d6fc05c1fc48de-test_local_users.3329283.1-clickhouse-I4Af08/clickhouse-server.log
test_all-c4d6fc05c1fc48de-test_local_users.3329283.1-clickhouse-I4Af08/clickhouse-server.errlog
```
@sunshowers
Copy link
Contributor Author

sunshowers commented Jan 12, 2024

So I haven't been able to reproduce the flakiness with #4796, in any of the 5 runs that happened on the PR, or on main (https://github.com/oxidecomputer/omicron/runs/20441805293). That is really weird -- in a well-behaved system, 4796 would not have any impact on this.

So how could this happen?

My current reasoning is: 4796 does two things.

  1. Renames clickhouse temp dirs from $TMPDIR/clickhouse-... to $TMPDIR/<process-name>.<test-name>.<pid>.<id>-clickhouse-....
  2. Preserves ClickHouse logs on test failures.

If 4796 does not actually address the flakiness, we continue to dig, now with better logs.

If 4796 does fix the flakiness -- then it surely can't be 2, because that code only kicks in after the test has failed. So the only thing it can be is 1.

What could 1 be caused by in practice? The best thing I can imagine is something that removes those directories while the clickhouse process is running. That may suggest a reason why #4755 caused the flakiness (it moved the cwd to inside the temp dir, and processes generally don't like their cwds being deleted).

Assuming this hypothetical, what could this process be? I can imagine one of two things:

  1. Something checked into Omicron. But then why would the flakiness only happen on Linux and not on Helios?
  2. Some kind of tempfile cleaner. But why would that cleaner spare the cwd when it's named one way but not the other?

This is all really strange.

@sunshowers
Copy link
Contributor Author

So it does look like 4796 fixed the flakiness...

@sunshowers
Copy link
Contributor Author

We discussed this in the Control Plane sync today and here's what we decided:

I was hoping to timebox digging into this for last week. That has been successful in the sense that the flakiness has been addressed, but we haven't root caused the issue. However I have a bigger company priority (automated update) to work on, so I'm going to put this on the backburner for now while working on more pressing issues.

If and when we come back to this in the future, a few places to start looking:

  1. Trying to revert change 1 in 4796 (renaming dirs) and seeing if we get any useful logs from clickhouse thanks to change 2.
  2. If it is true that something is wiping out the cwd of the clickhouse process at the wrong time, writing an eBPF script which tracks unlinks in /var/tmp/omicron_tmp based on the process that did them.

@sunshowers sunshowers added this to the Unscheduled milestone Jan 16, 2024
bnaecker added a commit that referenced this issue Sep 24, 2024
This changes the logic for parsing the server ports that ClickHouse
listens on, so that it waits until we are certain that they've been
completely written. This fixes a few flaky tests, currently including at
least #4779, #4972, and #5180. There have been others in the past, which
we've addressed with less-complete solutions that only narrow the race
window. This should eliminate it.
bnaecker added a commit that referenced this issue Sep 24, 2024
This changes the logic for parsing the server ports that ClickHouse
listens on, so that it waits until we are certain that they've been
completely written. This fixes a few flaky tests, currently including at
least #4779, #4972, and #5180. There have been others in the past, which
we've addressed with less-complete solutions that only narrow the race
window. This should eliminate it.
bnaecker added a commit that referenced this issue Sep 25, 2024
This changes the logic for parsing the server ports that ClickHouse
listens on, so that it waits until we are certain that they've been
completely written. This fixes a few flaky tests, currently including at
least #4779, #4972, and #5180. There have been others in the past, which
we've addressed with less-complete solutions that only narrow the race
window. This should eliminate it.
@bnaecker
Copy link
Collaborator

@sunshowers I'm pretty confident the underlying issue around discovery of the correct ports will be resolved by #6655. But it seems like this issue is also about the failure to upload the log files. What's your opinion? Would you like to keep it open to track that aspect of it; or maybe we rename this issue or open a new one?

@sunshowers
Copy link
Contributor Author

Are ClickHouse log files currently uploaded for failing tests? Sorry it's been a while!

@bnaecker
Copy link
Collaborator

Sorry it's been a while!

It sure has!

Are ClickHouse log files currently uploaded for failing tests?

I think so. At least, one of my recent PRs that had a failed test job does appear to have uploaded the files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. Testing & Analysis Tests & Analyzers
Projects
None yet
Development

No branches or pull requests

4 participants