Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking: faster cluster troubleshooting and recovery #18058

Open
1 of 2 tasks
kwannoel opened this issue Aug 16, 2024 · 14 comments
Open
1 of 2 tasks

Tracking: faster cluster troubleshooting and recovery #18058

kwannoel opened this issue Aug 16, 2024 · 14 comments
Assignees
Labels
Milestone

Comments

@kwannoel
Copy link
Contributor

kwannoel commented Aug 16, 2024

@kwannoel kwannoel self-assigned this Aug 16, 2024
@github-actions github-actions bot added this to the release-2.0 milestone Aug 16, 2024
@kwannoel kwannoel assigned kwannoel and unassigned kwannoel Aug 16, 2024
@kwannoel kwannoel added type/tracking Tracking issue. and removed type/feature labels Aug 16, 2024
@wjf3121
Copy link

wjf3121 commented Aug 16, 2024

For 1 (Speed up docker build), we may consider moving to a hermetic build system with remove cache support like Bazel in CI to speed up the build (AFAIK cargo itself doesn't have good support for this). Usually for a patch build, only a small portion of the code is changed so the majority of the build can make use of cache. A rough guess is we can reduce the build time to within 5 mins with remote cache for most patch build.

There are tools to auto generate bazel build files from cargo: https://bazelbuild.github.io/rules_rust/crate_universe.html

This could also improve developer productivity as PR checks & developer image building can be sped up.

cc @lmatz @huangjw806 @cyliu0

@kwannoel
Copy link
Contributor Author

For 1 (Speed up docker build), we may consider moving to a hermetic build system with remove cache support like Bazel in CI to speed up the build (AFAIK cargo itself doesn't have good support for this). Usually for a patch build, only a small portion of the code is changed so the majority of the build can make use of cache. A rough guess is we can reduce the build time to within 5 mins with remote cache for most patch build.

There are tools to auto generate bazel build files from cargo: https://bazelbuild.github.io/rules_rust/crate_universe.html

This could also improve developer productivity as PR checks & developer image building can be sped up.

cc @lmatz @huangjw806 @cyliu0

May require a non-trivial amount of work. We use some non-standard things like crate patching for madsim.
Hermetic build systems are also more work to maintain. Will require some investigation.

Another idea shared is to simply add a new buildkite workflow which will use larger machine sizes. @huangjw806 can comment on that, do you know if this work is already in progress?

@BugenZhao
Copy link
Member

Usually for a patch build, only a small portion of the code is changed so the majority of the build can make use of cache. A rough guess is we can reduce the build time to within 5 mins with remote cache for most patch build.

IIRC, a significant contributory factor of the compile time (up to ~8 min) is the final step of linking (specifically LTO), which can barely be speeded up with caching. Though we currently have no idea whether LTO improves performance much in production.

Also, I believe @xxchan is highly experienced in this area. Kindly request for insights. 🙏

@xxchan
Copy link
Member

xxchan commented Aug 19, 2024

previous attempt to speed up docker build: #12193

But

# TODO: cargo-chef doesn't work well now, because we update Cargo.lock very often.

@kwannoel kwannoel modified the milestones: release-2.0, release-2.1 Aug 19, 2024
@huangjw806
Copy link
Contributor

huangjw806 commented Aug 19, 2024

Speed up docker build.

The docker build pipeline needs to build x86 and arm images. Our pipeline jobs run concurrently, so the image upload time depends on the slowest job. Previously, the arm job was much slower than the x86 job, but now it is basically the same after I upgraded the arm instance type.

In addition, since the cloud environment now runs on arm machines, if we need an urgent patch, maybe we only need to build the arm image.

@kwannoel
Copy link
Contributor Author

IIRC, a significant contributory factor of the compile time (up to ~8 min) is the final step of linking (specifically LTO), which can barely be speeded up with caching. Though we currently have no idea whether LTO improves performance much in production.

Sounds good to me to remove lto for patch builds.

@xxchan
Copy link
Member

xxchan commented Aug 19, 2024

use larger machine sizes

This also sounds worth trying to me. We can try to benchmark building images on different instance specs.

@huangjw806
Copy link
Contributor

use larger machine sizes

Maybe we can use a larger arm instance type to do arm docker build, but it is difficult to upgrade the instance type of x86, so we need to innovate and create an aws cloudfromation stack. This is because buildkite can only select machines by stack, not by instance type. The x86 stack is not only used for docker build, but also for other ci pipelines. If the instance type is upgraded, it will cost more.

@lmatz
Copy link
Contributor

lmatz commented Aug 19, 2024

The x86 stack is not only used for docker build, but also for other ci pipelines. If the instance type is upgraded, it will cost more.

what is the cost of having another dedicated x86 stack for docker build only? Is it even possible

@huangjw806
Copy link
Contributor

what is the cost of having another dedicated x86 stack for docker build only? Is it even possible

A new cloudfromation stack needs to be created, and the maintenance cost will increase a bit, but the money won't increase much. In fact, for urgent patches, is it enough to just build the arm image?

@kwannoel
Copy link
Contributor Author

what is the cost of having another dedicated x86 stack for docker build only? Is it even possible

A new cloudfromation stack needs to be created, and the maintenance cost will increase a bit, but the money won't increase much. In fact, for urgent patches, is it enough to just build the arm image?

Larger stack just for arm image lgtm.

@kwannoel
Copy link
Contributor Author

Summarize:

  • @huangjw806 provision a dedicated stack for arm image.
  • @kwannoel add a dedicated pipeline which does not use LTO.

@BugenZhao
Copy link
Member

but
TODO: cargo-chef doesn't work well now, because we update Cargo.lock very often.

I suppose this won't be a problem regarding patching on the release branch and we can still benefit from it. 🤔

@xxchan
Copy link
Member

xxchan commented Aug 30, 2024

Yes, I think it's still a worth exploring direction (to see how technologies incl cargo-chef can work best), but I don't have the bandwidth to do the trial and error currently. I'd be glad to help if someone else want to try it. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants