-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test failed in CI: helios/deploy CLI SendError
#6771
Comments
SendError
Figuring that kind of failure might result from a Nexus crash, I took a look at the Nexus logs. I don't see any crashes but I do see some related-looking request failures. All the import-related request log entries seem to be in this log: and I see a bunch of messages like this one:
and a couple where the state is |
I'm unable to reproduce this now, of course, though I noticed that the
|
I was finally able to reproduce this using the old CLI binary:
The trick was to set a
causing the In contrast, the latest CLI does responds to a timeout differently:
|
This is almost certainly the case: Nexus reports that a client disconnected during a request to the bulk write endpoint:
|
Do we think updating the CLI binary is useful here, or will it similarly timeout? |
We probably should update it so we're testing the latest binary in general, but I suspect it will similarly time out.
I'm not sure: If I'm misreading this and it's running on AWS, are these t2.nano instances or something haha? Again, 15 seconds to write 512kb is huge, unless it's something like we don't have TRIM enabled and there's a big reclaim going on? |
It is running on lab hardware, I'm just not sure what the SSDs in the machine are. (Also keep in mind that the ZFS pools used by the deployed control plane are file-backed.) |
I'm using stock omicron bits on my bench gimlet and I can get this to happen 100% of the time:
I'm on omicron commit: 0640bb2
And, the cli:
I added a |
I got another one of these: https://github.com/oxidecomputer/omicron/pull/6810/checks?check_run_id=31321074093 |
I updated my bench gimlet to:
And, it still fails the same way. |
I'm unclear if this is the same error, but it seems like it might be related:
|
I saw the same variant Andrew did on https://buildomat.eng.oxide.computer/wg/0/details/01J9Y2BWQWFB0HPRRFCWC6GHAN/HzxHMKpHk2lzsmDBBvStwfjrwTZfnljEjTH4TLjd2l3KTopv/01J9Y2CEDVND9FA5DK9G0AX5FY. That run is on #6822 which does modify the pantry code path, but I think the modifications it makes (changing how Nexus chooses a pantry) would result in a different kind of failure if they were wrong.
|
I hit this a few times in a4x2 yesterday. I'll try Alan's workaround of a higher timeout next time I see it and report back. |
Observed this again on #6879 after it merged to Seems like the same variant John observed in #6771 (comment), but I figured I'd drop it here anyway in case it's useful. |
In the interest of getting PRs merged more readily, I've increased the timeout to 60s in d69155e. This obviously sucks, and isn't a substitute for a deeper understanding of and better solution to the issue. |
On the bench gimlet setup, I captured the pantry log while reproducing the error. The full log is here: /staff/cores/omicron-6771 Of interest, we have the initial creation and setup request:
Things start up and downstairs connect:
We activate, and bulk uploads start:
The bulk-write requests continue on an expected cadence, but then, it appears as if the pantry receives a message to disconnect from the downstairs:
So, now the question is who is requesting the disconnect, and why? |
In the nexus log during the same window (also at /staff/core/omicron-6771 ) I see the disk created okay. I see the bulk import start:
I see bulk writes going:
Here are the last two bulk writes that Nexus sends:
Those look to line up with the pantry logs for the last two writes it receives. Next, it looks like Nexus has received a request to stopt the bulk writes:
The remote addr: Eventually, we see Nexus send the detach to the pantry:
This detach lines up with when the pantry is told to detach the disk. |
My CLI version is:
|
More observations from your Nexus log file: The CLI starts 8 upload threads. Attempting to identify those 8 threads by their remote address looks like it's possible:
Two things jump out there: we have 9 remote addrs, and 2 of them have suspiciously low counts. The 1 from
Both the timestamps and the error message on the final request from the CLI thread at :55340 with a count of 4 are both interesting:
In particular:
|
After some focused debugging I think we've got a much better understanding of the problem here: it appears that:
Alan was able to reliably reproduce this issue on a bench Gimlet and we did this investigation there. (@leftwo can you post here how you did that?) In this situation:
Some observations that got us here:
We went looking for any kstats that would point to large drop counts:
Those don't seem large enough to explain the problem. It was observed that we'd only seen this in CI and lab hardware, not on production hardware. We suspected the softnpu data path (not with great reason at this point) and started there to look for packet loss. So we went looking for the softnpu process:
and we looked for its DTrace probes:
We enabled those probes (coarsely) to look for drops:
It doesn't look softnpu itself is dropping these packets (at least, it's "drop" probes aren't firing). At this point, @rmustacc went looking through the kernel STREAMS data path, reverse engineering and exploring using cscope, mdb, and DTrace. We found we were going through
That's a lot of drops! What's
and:
That's one of the softnpu devices. We did an experiment to increase the max buffer size. First we found the address of the
We decided to set the cap to 16 MiB, figuring if we've got 8 threads trying to send at most 1 MiB each, this should be plenty. Let's try it:
At this point, we tried unsuccessfully several times to reproduce the problem and saw no drops:
We also looked for how big the buffer got during this experiment:
We think this pretty well explains all the behavior. Assuming that's true, the proposed fix is to have softnpu use the STREAMS equivalent of a sockopt to increase this buffer size. I vaguely suspect that the recent change to the socket buffer size (#6690) made this problem more likely for two reasons:
|
For reproduction and analysis steps (this was in the middle of the debugging session @davepacheco mentions above) On the bench gimlet sn05 talking from the global zone to Nexus (through the softnpu zone/layer), I would upload an image that was local to the global zone like this:
The upload would fail, usually within a minute, but sometimes as long as 3 minutes. Some initial debugging (recorded in /staff/core/omicron-6771/nexus-while-uploading.log ) indicated that the connection between the cli and the nexus dropshot was happening even before Nexus attempted to contact the pantry, or even do any local database lookups. @jgallagher added additional logging to the oxide CLI and I added some logging to nexus to determine where the time was spent that resulted in the dropshot timeout occourning. Those logs are in:
From those logs we were able to determine (some of this was in John's analysis above).
For example, the CLI log has:
Then, 15 seconds later (with other thread activity during this time):
To further determine where the 15 seconds was going, we gatherd traffic from the nexus zone using snoop and recorded it in the following files (in the same directory).
Inspection of the snoop output showed us a large number of retries/retransmissions and what appeared to be dropped packets. This lead us to look deeper into the network stack to determine what could be the cause of this issue. |
i ran into this again and after reading the very helpful analysis above i'm trying to figure out what to do with
i spent some time fishing through illumos' |
I think @rmustacc and @rcgoodfellow discussed an approach here but I'm not sure the status of that? |
This test failed on a CI run on "main":
https://github.com/oxidecomputer/omicron/runs/31088964323
Log showing the specific test failure:
https://buildomat.eng.oxide.computer/wg/0/details/01J9C1NYC3BNWDE017JZ1NT99K/A3qZeEFGXoXmvc5zGw6ECtKYko93V56qQ9uL79n9NYzLyoQK/01J9C1PBV7PARZAD3391DHDKRQ
Excerpt from the log showing the failure:
The text was updated successfully, but these errors were encountered: