-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test flake: Failed to start oximeter during test_console_pages
#4972
Comments
Yeah, it's weird there are no clickhouse logs -- did clickhouse not start? We should investigate. |
Previously: #4779. We'd stopped seeing these but apparently they can still occur. These lines show that a port was assigned (which means the log file should have been created, because that's where the port is read from I believe) but oximeter couldn't connect to it. This is similar to the issue in #4779 (example). So why wasn't the clickhouse log file uploaded? That's a good question and one I don't have an answer to. Will need more debugging to find out why. |
I've been looking at these logs and the parsing code recently, and I have a theory for how this happens. It also explains basically any flake we have seen around starting up ClickHouse and We start ClickHouse with a port of 0 in tests, and then read the log file to find the port that was actually bound. The problem is basically that those writes can be split at any point inside the line containing the port, including right before, right after, or at any point in the middle of the port itself. For example, in this case, I've fixed a number of issues related to this recently. But they're all band-aids, and don't really account for the underlying problem -- the code as it exists today can't really know when the port number is fully written. It splits the log file into lines, but we need to use that newline to know if the port is completed. There are two choices, I think. We can wait until a later sentinel is written to the log file before trying to parse the port at all. Or we can not split on lines, which removes the actual newline characters, and instead try to parse a port iff that's followed by a newline. I'm leaning towards the first, since it's much simpler and will let us delete a lot of code. |
This changes the logic for parsing the server ports that ClickHouse listens on, so that it waits until we are certain that they've been completely written. This fixes a few flaky tests, currently including at least #4779, #4972, and #5180. There have been others in the past, which we've addressed with less-complete solutions that only narrow the race window. This should eliminate it.
This changes the logic for parsing the server ports that ClickHouse listens on, so that it waits until we are certain that they've been completely written. This fixes a few flaky tests, currently including at least #4779, #4972, and #5180. There have been others in the past, which we've addressed with less-complete solutions that only narrow the race window. This should eliminate it.
This changes the logic for parsing the server ports that ClickHouse listens on, so that it waits until we are certain that they've been completely written. This fixes a few flaky tests, currently including at least #4779, #4972, and #5180. There have been others in the past, which we've addressed with less-complete solutions that only narrow the race window. This should eliminate it.
This should be resolved by #6655. |
Seen on
main
: https://github.com/oxidecomputer/omicron/runs/21172250627I think this is a failure of test setup, not the test itself.
On the list of artifacts, I'm not seeing any Clckhouse / Oximeter logs. Based on the test output, these directories are being leaked to
/var/tmp
, but I'm not sure if we're picking those up.The text was updated successfully, but these errors were encountered: