"add sled" needs a longer timeout #5116

davepacheco · 2024-02-22T05:52:41Z

the "add sled" Nexus external API call invokes PUT /sleds to some sled agent
PUT /sleds itself blocks until the new sled's sled agent has started
sled agent startup blocks on setting the reservoir
on production hardware, setting the reservoir took 115s
the default Progenitor (reqwest) timeout is only 15s

So as a result, the "add sled" request failed, even though the operation ultimately succeeded.

In this PR, I bump the timeout to 5 minutes. I do wonder if we should remove it altogether, or if we should consider the other changes mentioned in #5111 (like not blocking sled agent startup on this, or not blocking these API calls in this way). But for now, this seems like a low-risk way to improve this situation.

andrewjstone · 2024-02-22T05:57:34Z

Thanks for the quick fix @davepacheco

davepacheco · 2024-02-22T06:02:39Z

I'd like to test this, ideally on real hardware but maybe even just in the testbed just to make sure it didn't somehow break anything. I'm out for a lot of tomorrow but if it's useful for this to be landed, feel free to do it!

andrewjstone · 2024-02-22T06:08:13Z

I'd like to test this, ideally on real hardware but maybe even just in the testbed just to make sure it didn't somehow break anything. I'm out for a lot of tomorrow but if it's useful for this to be landed, feel free to do it!

I can give it a test on testbed if John doesn't take it for a spin on madrid first. Seems relatively innocuous, but worth testing.

jgallagher · 2024-02-22T13:03:11Z

I can give it a test on testbed if John doesn't take it for a spin on madrid first. Seems relatively innocuous, but worth testing.

I'm going to hold off on another madrid run until we get more of #5111 knocked down. 👍 on giving this a testbed spin, but agreed it looks good.

andrewjstone · 2024-02-22T18:12:01Z

Tested out on testbed. The add works fine. Node gets added to the sled table in CRDB and bootstore learned its share on the added node (which is a prereq to it showing up in CRDB).

andrewjstone · 2024-02-22T18:13:37Z

Test failure also seems like a fluke. It's definitely unrelated to this code.

| curl: (22) The requested URL returned error: 404
| cp: cannot access /tmp/opteadm

"add sled" needs a longer timeout

717e6b0

davepacheco requested review from andrewjstone and jgallagher February 22, 2024 05:52

add a comment

f30d56b

andrewjstone approved these changes Feb 22, 2024

View reviewed changes

jgallagher approved these changes Feb 22, 2024

View reviewed changes

smklein mentioned this pull request Feb 22, 2024

Setting VMM Reservoir Takes a While - How do we cope? #5121

Closed

andrewjstone merged commit a5be09f into main Feb 22, 2024
20 checks passed

andrewjstone deleted the dap/add-sled-timeout branch February 22, 2024 19:06

davepacheco mentioned this pull request Feb 23, 2024

"sled add" could be more asynchronous #5132

Open

jgallagher mentioned this pull request Feb 23, 2024

Failed to fully add a new sled on madrid #5111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"add sled" needs a longer timeout #5116

"add sled" needs a longer timeout #5116

davepacheco commented Feb 22, 2024

andrewjstone commented Feb 22, 2024

davepacheco commented Feb 22, 2024

andrewjstone commented Feb 22, 2024

jgallagher commented Feb 22, 2024

andrewjstone commented Feb 22, 2024

andrewjstone commented Feb 22, 2024

"add sled" needs a longer timeout #5116

"add sled" needs a longer timeout #5116

Conversation

davepacheco commented Feb 22, 2024

andrewjstone commented Feb 22, 2024

davepacheco commented Feb 22, 2024

andrewjstone commented Feb 22, 2024

jgallagher commented Feb 22, 2024

andrewjstone commented Feb 22, 2024

andrewjstone commented Feb 22, 2024