-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to fully add a new sled on madrid #5111
Comments
The sled-agent log: |
Nexus log from problem 1 is too big to attach to github; it's at
|
Contents of |
This is the sled-agent log from Grepping for the Nexus source port, it looks like the request took just over 2 minutes: % grep :59386 BRM44220001-sled-agent.log | looker
18:15:47.679Z INFO SledAgent (dropshot (SledAgent)): accepted connection
file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/711a749/dropshot/src/server.rs:765
local_addr = [fd00:1122:3344:101::1]:12345
remote_addr = [fd00:1122:3344:102::5]:59386
18:17:50.861Z WARN SledAgent (dropshot (SledAgent)): client disconnected before response returned
file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/711a749/dropshot/src/server.rs:923
local_addr = [fd00:1122:3344:101::1]:12345
method = PUT
remote_addr = [fd00:1122:3344:102::5]:59386
req_id = dc8a5c33-eb99-472c-b9dc-357b82b14760
uri = /sleds |
I took a look at why we have so many inventory collections for only three Nexus instances. I think we're getting somewhat unlucky on timing. When I walked up I saw these collections:
(All collections have 8 errors related to their SPs; I believe this is known and expected.) I found that two Nexus instances always found few enough collections that there was nothing to prune. Their spitting out messages like this each time:
The last one is doing the pruning, spitting out messages like this:
I happened to catch a moment when there were only three:
By the time I'd typed the above query to print the time and ran the first one again, there were six again:
The timestamps reflect that (1) the Nexus instances are all starting and finishing their collections within a second or so of each other, and (2) my query above happened to be right when they were finishing a collection, which is when they go to insert it, which is when they prune. So I think it's simply that all three are ready to insert around the same time, one prunes it down to the expected 3, the other two check the count during this window, and then all three wind up inserting theirs. I think this is fine and not a bug. It's a little harder to be sure why there were six right away without doing more digging into the historical Nexus logs to see why they were activated. |
Small problem to add to the list: I don't see a "request completed" log entry for the sled agent request where Nexus gave up. Looking at dropshot's server.rs, it looks like that (along with USDT probe firing) happens in the context that got cancelled. That seems not great. I feel like we should move that stuff into the async context so that it always happens. |
It seems like a big part of the problem here was the 2+ minute request to add the sled.
So regarding that 2-minute window: here's a timeline of events from the Nexus log and the Sled Agent BRM44220001 log (not the new Sled):
I'm currently putting together a timeline from the new sled's log for the two-minute window to understand why that took so long. So far I've been able to determine the bulk of the gap is between the call to |
For the new sled (BRM42220004), the times are all on 1986-12-28 because it hasn't set up NTP yet. Here are some relevant log entries:
Then we have a gap of 1m55.8s. Then:
Inside Still inside So our gap is between those two lines of code. The only thing of substance that happens there is the
I don't see any of the "failed to get network config from bootstore" message. So it looks like we spent all this time in this one call: omicron/sled-agent/src/sled_agent.rs Lines 473 to 476 in 55ef8da
That calls out to the bootstore task, but the implementation of that is synchronous, so I can only guess that the bootstore task was busy (or more likely blocked) doing something else for all that time? But I'm not sure what. @andrewjstone do you have any ideas here? |
So, I don't think this is a bootstore issue. The bootstore itself, for better or worse receives the network config well before reconfiguration starts. The entire gossip process is shown below.
The call to Then, the bootstore node becomes a learner so it can participate in lrtq, retrieves its keyshare, and persists its fsm state:
Then immediately, the node starts to retrieve key shares so that it can configure the U.2 drives. It does this for each drive. It only takes ~2ms to load the rack secret, but I'm thinking about caching it for up to a minute since all drives typically get loaded at once in the real trust quorum.
The first disk, that I copied here just happened to be the one with the problem that John reported. As far as I can tell, the last thing related to U.2 disk management and encryption is logged until the sled-agent starts.
I think it's still unclear why things are taking this long and there doesn't appear to be any more info in the sled-agent log on the added node. |
@andrewjstone pointed out in chat that the actual call to set the reservoir size is also in the gap and could also be what got stuck for so long. |
We spent some time looking through the code to find other events that might tell us post hoc how long the reservoir operation took but we did not find anything. Instead, we decided to try an experiment: we bounced BRM42220004, quickly logged in on the console before sled agent started, and used DTrace to watch how long the reservoir operation took. (thanks to @rmustacc for the D here). The thinking here was: if it took a long time to do the reservoir operation when we added this sled, there's a good chance it might also take that long on a fresh reboot, since the sled presumably hadn't done all that much work prior to the first "add sled" operation. At the very least: if it did take a long time to do the reservoir operation after a fresh boot, it's pretty plausible that we saw the same behavior here.
Bingo -- that's 115s to do this operation. That's exactly the length of our gap. I also happened to catch it in the act:
Putting all that together, we have:
which of course matches the DTrace output. I think there are a few action items here:
|
FWIW, there were 2 periods where I ended up waiting on a collection:
In the second case, even though the sled had already started the NTP zone and successfully timesync'd, I was still waiting for an inventory collection because the planner won't add anything else to the sled until the latest collection shows the NTP zone. I'm not sure where we could trigger a collection here, though - if sled-agent told us in the response whether the |
Looked through the Nexus logs, and I think this is a side effect of my workaround for problem 3 (which has already been fixed by #5113). I hot patched the auth check out and restarted each Nexus instance. Looking at just one (but similar logs are in all three), we see the SMF restart here:
Snipping down to just the relevant log lines, we then see the initial timeout-triggered (because
and then immediately afterwards we see the initial Nexus-triggered inventory collection:
These two collection IDs are two of the six that confused me, and each of the other two Nexuses did this same thing. |
In #5111: - the "add sled" Nexus external API call invokes `PUT /sleds` to some sled agent - `PUT /sleds` itself blocks until the new sled's sled agent has started - sled agent startup blocks on setting the reservoir - on production hardware, setting the reservoir took 115s - the default Progenitor (reqwest) timeout is only 15s So as a result, the "add sled" request failed, even though the operation ultimately succeeded. In this PR, I bump the timeout to 5 minutes. I do wonder if we should remove it altogether, or if we should consider the other changes mentioned in #5111 (like not blocking sled agent startup on this, or not blocking these API calls in this way). But for now, this seems like a low-risk way to improve this situation.
Closing this as each problem in the top-level issue is either understood, fixed, or has its own separate issue:
|
I tried going through the "add sled" process today on madrid, to mixed results. I'm dropping my notes on problems I hit in this issue, and then may spawn additional issues / PRs for specific problems.
However, after another minute or two, the sled did successfully get added; it no longer appeared in
sled list-uninitialized
, and the sled-agent logs indicated successful self-reporting to Nexus. I'm not sure if we want to raise the timeout on this endpoint, or maybe split it into a start / poll for progress pair?On sled-agent startup, we saw 10 U.2s and 2 M.2s as expected:
However, two of these U.2s induced "persistent errors" from sled-agent's storage manager, for different reasons:
Each of those errors appears in the sled-agent log twice (log attached).
The second of those errors seems particularly odd because when I looked at the sled later, the path it's complaining about did exist:
The specific error:
This check was added in #4989, to prevent non-saga/RPW endpoints from performing expensive database operations. Blueprint work will eventually be automated, but for now we're driving it via manual API calls. Dave suggests we change this check to only apply to the external API, for now. I hot patched the check out altogether to make continue the experiment:
This was a matter of patience: the reconfigurator planner waits until the latest inventory collection shows the NTP zone, so even though the sled had already timesynced, I had to wait ~10 minutes for an inventory collection pass. I wonder if we should be able to trigger an inventory collection manually from
omdb
? Or perhaps the blueprint executor should trigger it if new zones were added? (That might require information from sled-agent about whether thePUT
actually changed anything?)I don't know why we got 5 here when there are only 3 Nexuses. Maybe we're triggering collections too aggressively, as even the latest one still didn't show the NTP zone's presence.
Roughly 10 minutes later, 3 new collections showed up, and the 2 oldest were purged (as expected).
I believe this is covered by item 2 above; the storage manager failed to bring in 2 disks, so sled-agent presumably only reported 8 to Nexus.
I started 10 instances after the new crucible zones were online. 3 of those instances landed on the new sled, and they all seemed to be healthy: I was able to SSH in, they had booted fine, I was able to create files, etc. However, none of the 10 new instances had disks backed on the new sled. The disks for the 10 new instances were:
BRM42220004
is the serial of the new sled. It shows up as the owner of the propolis zone for 3 of the disks (john-add-sled-{1,3,5}-disk
), but none of the crucible zones for any of the above. Poking around the code and CRDB, I believe this is because disk allocation looks at thedataset
table, and there were no rows for the new crucible zones in that table:These 29 rows match the 29 zpools from the 3 sleds used at RSS time. But in the zpool table, there are 37 rows for U.2s; the 29 above plus 8 from the new sled:
(sled-id
c2cfbfdd-df01-4788-abc1-c900226194f3
is the ID ofBRM42220004
, as expected by the gap intime_created
.)The text was updated successfully, but these errors were encountered: