-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test flake: silent failure to initialize cockroachdb? #3889
Comments
I wanted to look at the CockroachDB SMF log files, but they're all empty. I compared with dogfood and they're empty there, too. There are no rotated files in /var/svc/log. I found #3713 and found some useful log files on dogfood (sled 9):
(is there a better way to find archived log files?) These do have the stuff I expected to be there (svc.startd messages, xtrace output from the start method, etc.). We don't have these files from the failed job, unfortunately. It seems like we should add those to the pattern of logs that we have buildomat save on failure. |
I would expect that the "rack" objects are inserted during rack initialization at Nexus (i.e., handoff to Nexus), not when the database is initialized. It's possible that the database was initialized but handoff never happened for some reason. n the Sled Agent log, I do see a lot of failures to handoff to Nexus, including this last one:
and I see no indication after that of a successful handoff. So I think that explains the warnings and the ultimate failure after the 10-minute timeout. Next will be to look at the Nexus logs to see why these requests are failing. |
It looks like of the initialization-complete requests went to the Nexus
and here's the details of one of those 500s on
There's not much to go on here. This looks like an error path during rack initialization that's not annotating the error object it got back from the database with anything useful about what it was doing. The string "Record not found" does not appear in Omicron. I guessed this was coming from use of Diesel's |
I started looking around for calls to So I started with the rack initialization code path, which opens here: omicron/nexus/db-queries/src/db/datastore/rack.rs Lines 398 to 416 in 496b854
We can infer something about where the error was generated based on the log entries that we do have for one of the failed requests. I grabbed the request id from the last 500 error and filtered all the messages in the Nexus log for that req_id:
We can summarize the actions and results:
That omicron/nexus/db-queries/src/db/datastore/ip_pool.rs Lines 108 to 151 in 496b854
which is called early in What's responsible for setting that up? I didn't find anything in omicron/nexus/db-queries/src/db/datastore/rack.rs Lines 550 to 588 in 496b854
Nexus we do spawn a background task to wait for this and that does log a message when it finishes or fails:Lines 360 to 375 in 496b854
I don't see that log message in this log. It seems like the "populate" step maybe got stuck? More investigation is needed here. |
I just noticed that the Nexus log that I've been looking at from buildomat doesn't start at the beginning of Nexus's lifetime. Like the CockroachDB log, it must have been rotated and archived already and we don't have that earlier file. That makes it hard to say anything about which populate steps may have run, failed, finished, etc. There are several bugs/improvements that would help us debug this kind of problem the next time it happens:
We should definitely fix the buildomat job. I'm not sure if it's worth debugging this further until we do that. We might be able to find more information but it'll probably be a lot clearer with the logs in hand. |
These were archived out to the U.2s as part of #3713, I believe. They're still named by zone, so filtering by that would help. I think the dataset they end up in is unpredictable, though the archiver continues to use the same location until it runs out of space. It unfortunately does not help post-mortem, but if you're debugging a live system, the $ zone-bundle --host <SLED_AGENT_IP> get --create oxz_nexus_2c05b297-cc92-4a0a-8378-946af887ec91 That command will create a tarball with all the log files, either still in |
https://github.com/oxidecomputer/omicron/pull/3887/checks?check_run_id=15946171931 failed; the immediate symptom in the deploy job a timeout after trying to log in for 10 minutes:
All three nexus logs have many warning that seem to indicate the database wasn't set up correctly, I think?
but the sled-agent logs do not report any issues with initial CRDB setup:
The text was updated successfully, but these errors were encountered: