Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test flake: Failed to start oximeter during test_console_pages #4972

Closed
smklein opened this issue Feb 3, 2024 · 4 comments
Closed

Test flake: Failed to start oximeter during test_console_pages #4972

smklein opened this issue Feb 3, 2024 · 4 comments
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@smklein
Copy link
Collaborator

smklein commented Feb 3, 2024

Seen on main: https://github.com/oxidecomputer/omicron/runs/21172250627

4744	2024-02-02T22:51:52.437Z	        FAIL [  73.553s] omicron-nexus::test_all integration_tests::console_api::test_console_pages
4745	2024-02-02T22:51:52.437Z	
4746	2024-02-02T22:51:52.437Z	--- STDOUT:              omicron-nexus::test_all integration_tests::console_api::test_console_pages ---
4747	2024-02-02T22:51:52.437Z	
4748	2024-02-02T22:51:52.437Z	running 1 test
4749	2024-02-02T22:51:52.437Z	test integration_tests::console_api::test_console_pages has been running for over 60 seconds
4750	2024-02-02T22:51:52.437Z	test integration_tests::console_api::test_console_pages ... FAILED
4751	2024-02-02T22:51:52.437Z	
4752	2024-02-02T22:51:52.437Z	failures:
4753	2024-02-02T22:51:52.437Z	
4754	2024-02-02T22:51:52.437Z	failures:
4755	2024-02-02T22:51:52.437Z	    integration_tests::console_api::test_console_pages
4756	2024-02-02T22:51:52.437Z	
4757	2024-02-02T22:51:52.437Z	test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 321 filtered out; finished in 73.45s
4758	2024-02-02T22:51:52.437Z	
4759	2024-02-02T22:51:52.437Z	
4760	2024-02-02T22:51:52.437Z	--- STDERR:              omicron-nexus::test_all integration_tests::console_api::test_console_pages ---
4761	2024-02-02T22:51:52.437Z	log file: /var/tmp/omicron_tmp/test_all-1ad812dfecb2bd6b-test_console_pages.44515.0.log
4762	2024-02-02T22:51:52.437Z	note: configured to log to "/var/tmp/omicron_tmp/test_all-1ad812dfecb2bd6b-test_console_pages.44515.0.log"
4763	2024-02-02T22:51:52.437Z	DB URL: postgresql://root@[::1]:36655/omicron?sslmode=disable
4764	2024-02-02T22:51:52.437Z	DB address: [::1]:36655
4765	2024-02-02T22:51:52.437Z	log file: /var/tmp/omicron_tmp/test_all-1ad812dfecb2bd6b-test_console_pages.44515.2.log
4766	2024-02-02T22:51:52.437Z	note: configured to log to "/var/tmp/omicron_tmp/test_all-1ad812dfecb2bd6b-test_console_pages.44515.2.log"
4767	2024-02-02T22:51:52.438Z	thread 'integration_tests::console_api::test_console_pages' panicked at /work/oxidecomputer/omicron/nexus/test-utils/src/lib.rs:325:21:
4768	2024-02-02T22:51:52.438Z	Timed out after 60s while running step start_oximeter
4769	2024-02-02T22:51:52.438Z	stack backtrace:
4770	2024-02-02T22:51:52.438Z	   0: rust_begin_unwind
...
4824	2024-02-02T22:51:52.445Z	  27: test_console_pages
4825	2024-02-02T22:51:52.445Z	             at ./tests/integration_tests/console_api.rs:183:1
4826	2024-02-02T22:51:52.445Z	  28: test_all::integration_tests::console_api::test_console_pages::{{closure}}
4827	2024-02-02T22:51:52.445Z	             at ./tests/integration_tests/console_api.rs:183:1
4828	2024-02-02T22:51:52.445Z	  29: core::ops::function::FnOnce::call_once
4829	2024-02-02T22:51:52.445Z	             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/ops/function.rs:250:5
4830	2024-02-02T22:51:52.445Z	  30: core::ops::function::FnOnce::call_once
4831	2024-02-02T22:51:52.445Z	             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/ops/function.rs:250:5
4832	2024-02-02T22:51:52.445Z	note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
4833	2024-02-02T22:51:52.445Z	WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
4834	2024-02-02T22:51:52.445Z	WARN: temporary directory leaked: /var/tmp/omicron_tmp/.tmpkYxTAk
4835	2024-02-02T22:51:52.445Z	WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
4836	2024-02-02T22:51:52.445Z	WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
4837	2024-02-02T22:51:52.445Z	WARN: dendrite temporary directory leaked: /var/tmp/omicron_tmp/.tmppWH457
4838	2024-02-02T22:51:52.445Z	WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
4839	2024-02-02T22:51:52.445Z	WARN: dendrite temporary directory leaked: /var/tmp/omicron_tmp/.tmp4LaP3C
4840	2024-02-02T22:51:52.445Z	WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
4841	2024-02-02T22:51:52.445Z	WARN: mgd temporary directory leaked: /var/tmp/omicron_tmp/.tmp4SfDx2
4842	2024-02-02T22:51:52.446Z	WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
4843	2024-02-02T22:51:52.446Z	WARN: mgd temporary directory leaked: /var/tmp/omicron_tmp/.tmp2FbdRB

I think this is a failure of test setup, not the test itself.

On the list of artifacts, I'm not seeing any Clckhouse / Oximeter logs. Based on the test output, these directories are being leaked to /var/tmp, but I'm not sure if we're picking those up.

@smklein smklein added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Feb 3, 2024
@sunshowers
Copy link
Contributor

Yeah, it's weird there are no clickhouse logs -- did clickhouse not start? We should investigate.

@sunshowers
Copy link
Contributor

Previously: #4779. We'd stopped seeing these but apparently they can still occur.

These lines show that a port was assigned (which means the log file should have been created, because that's where the port is read from I believe) but oximeter couldn't connect to it.

This is similar to the issue in #4779 (example).

So why wasn't the clickhouse log file uploaded? That's a good question and one I don't have an answer to. Will need more debugging to find out why.

@bnaecker
Copy link
Collaborator

I've been looking at these logs and the parsing code recently, and I have a theory for how this happens. It also explains basically any flake we have seen around starting up ClickHouse and oximeter.

We start ClickHouse with a port of 0 in tests, and then read the log file to find the port that was actually bound. The problem is basically that those writes can be split at any point inside the line containing the port, including right before, right after, or at any point in the middle of the port itself.

For example, in this case, oximeter is trying to connect on port 454. But it's possible that ClickHouse is actually listening on port 4545. The parsing code would "succeed", if the line it parsed the port out of ended with 454, even if ClickHouse wrote the rest of the port right after that.

I've fixed a number of issues related to this recently. But they're all band-aids, and don't really account for the underlying problem -- the code as it exists today can't really know when the port number is fully written. It splits the log file into lines, but we need to use that newline to know if the port is completed.

There are two choices, I think. We can wait until a later sentinel is written to the log file before trying to parse the port at all. Or we can not split on lines, which removes the actual newline characters, and instead try to parse a port iff that's followed by a newline. I'm leaning towards the first, since it's much simpler and will let us delete a lot of code.

bnaecker added a commit that referenced this issue Sep 24, 2024
This changes the logic for parsing the server ports that ClickHouse
listens on, so that it waits until we are certain that they've been
completely written. This fixes a few flaky tests, currently including at
least #4779, #4972, and #5180. There have been others in the past, which
we've addressed with less-complete solutions that only narrow the race
window. This should eliminate it.
bnaecker added a commit that referenced this issue Sep 24, 2024
This changes the logic for parsing the server ports that ClickHouse
listens on, so that it waits until we are certain that they've been
completely written. This fixes a few flaky tests, currently including at
least #4779, #4972, and #5180. There have been others in the past, which
we've addressed with less-complete solutions that only narrow the race
window. This should eliminate it.
bnaecker added a commit that referenced this issue Sep 25, 2024
This changes the logic for parsing the server ports that ClickHouse
listens on, so that it waits until we are certain that they've been
completely written. This fixes a few flaky tests, currently including at
least #4779, #4972, and #5180. There have been others in the past, which
we've addressed with less-complete solutions that only narrow the race
window. This should eliminate it.
@bnaecker
Copy link
Collaborator

This should be resolved by #6655.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

No branches or pull requests

3 participants