Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"add sled" needs a longer timeout #5116

Merged
merged 2 commits into from
Feb 22, 2024
Merged

"add sled" needs a longer timeout #5116

merged 2 commits into from
Feb 22, 2024

Conversation

davepacheco
Copy link
Collaborator

In #5111:

  • the "add sled" Nexus external API call invokes PUT /sleds to some sled agent
  • PUT /sleds itself blocks until the new sled's sled agent has started
  • sled agent startup blocks on setting the reservoir
  • on production hardware, setting the reservoir took 115s
  • the default Progenitor (reqwest) timeout is only 15s

So as a result, the "add sled" request failed, even though the operation ultimately succeeded.

In this PR, I bump the timeout to 5 minutes. I do wonder if we should remove it altogether, or if we should consider the other changes mentioned in #5111 (like not blocking sled agent startup on this, or not blocking these API calls in this way). But for now, this seems like a low-risk way to improve this situation.

@andrewjstone
Copy link
Contributor

Thanks for the quick fix @davepacheco

@davepacheco
Copy link
Collaborator Author

I'd like to test this, ideally on real hardware but maybe even just in the testbed just to make sure it didn't somehow break anything. I'm out for a lot of tomorrow but if it's useful for this to be landed, feel free to do it!

@andrewjstone
Copy link
Contributor

I'd like to test this, ideally on real hardware but maybe even just in the testbed just to make sure it didn't somehow break anything. I'm out for a lot of tomorrow but if it's useful for this to be landed, feel free to do it!

I can give it a test on testbed if John doesn't take it for a spin on madrid first. Seems relatively innocuous, but worth testing.

@jgallagher
Copy link
Contributor

I can give it a test on testbed if John doesn't take it for a spin on madrid first. Seems relatively innocuous, but worth testing.

I'm going to hold off on another madrid run until we get more of #5111 knocked down. 👍 on giving this a testbed spin, but agreed it looks good.

@andrewjstone
Copy link
Contributor

Tested out on testbed. The add works fine. Node gets added to the sled table in CRDB and bootstore learned its share on the added node (which is a prereq to it showing up in CRDB).

@andrewjstone
Copy link
Contributor

Test failure also seems like a fluke. It's definitely unrelated to this code.

| curl: (22) The requested URL returned error: 404
| cp: cannot access /tmp/opteadm

@andrewjstone andrewjstone merged commit a5be09f into main Feb 22, 2024
20 checks passed
@andrewjstone andrewjstone deleted the dap/add-sled-timeout branch February 22, 2024 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants