-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blueprint execution is surprisingly slow #7217
Comments
On dublin with four sleds, running the same build as dogfood I see something similar:
Then, the breakdown has:
With the long poles being:
|
I think the sled-agent
There are calls where each of the three non- Waiting until the sled has been up for longer, we start to see
Two questions:
I think the second question is easy to answer: the sled-agent I'm less sure about the first question. We're ensuring roughly 40-50 datasets per sled (4 per disk for
That seems like a long time, but maybe it's reasonable? We do have to shell out to ZFS at least twice even for datasets that already exist (to confirm they exist and that they have the correct UUID property). |
Might it be possible to issue one |
Major part of #7217 Optimizations made: - `zfs get` is queried, so properties are samples for all datasets of interest ahead-of-time. No subsequent processes are exec'd for each dataset that needs no changes. - The "dataset ensure" process is now concurrent These optimizations should significantly improve the latency of the "no (or few) changes necessary" cases. Optimizations still left to be made: - Making blueprint execution concurrent, rather than blocking one-sled-at-a-time - Patching the illumos_utils "Zfs::ensure_filesystem" to re-use the pre-fetched properties, and minimize re-querying
I expect #7236 to mitigate the performance issues in the common cases, but would still be interested in keeping an eye on this in dogfood. Keeping this open in case further optimizations are warranted. |
Checking in on dogfood (which has #7236), things have improved significantly; the most recent execution was ~95 seconds:
almost all of which was putting zones:
It looks like the zones handler ensure all its datasets one by one, which bypasses the batch improvements made in #7236: omicron/sled-agent/src/sled_agent.rs Lines 955 to 971 in ca21fe7
My gut feeling is that this isn't worth optimizing, though; we landed-then-reverted #7006 which removes this loop entirely, and we want to put that back once we can. (Is that still R12 despite #7229?) |
Thanks for the great teamwork here @smklein @jgallagher |
Poking around on dogfood, I noted blueprint execution was reporting success but taking several minutes:
Looking at the status report for one execution, it's the three steps that send disks, datasets, and zones to each sled agent that are taking almost all of the time (steps 3-5):
Grepping out the start and stop for each
PUT
request for the "Deploy physical disks" step in one execution, we don't have one outlying sled; rather, every sled takes a handful of seconds (which is surprising!), and we don't currently parallelize this step:The text was updated successfully, but these errors were encountered: