-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking: faster cluster troubleshooting and recovery #18058
Comments
For 1 ( There are tools to auto generate bazel build files from cargo: https://bazelbuild.github.io/rules_rust/crate_universe.html This could also improve developer productivity as PR checks & developer image building can be sped up. |
May require a non-trivial amount of work. We use some non-standard things like crate patching for madsim. Another idea shared is to simply add a new buildkite workflow which will use larger machine sizes. @huangjw806 can comment on that, do you know if this work is already in progress? |
IIRC, a significant contributory factor of the compile time (up to ~8 min) is the final step of linking (specifically LTO), which can barely be speeded up with caching. Though we currently have no idea whether LTO improves performance much in production. Also, I believe @xxchan is highly experienced in this area. Kindly request for insights. 🙏 |
The docker build pipeline needs to build x86 and arm images. Our pipeline jobs run concurrently, so the image upload time depends on the slowest job. Previously, the arm job was much slower than the x86 job, but now it is basically the same after I upgraded the arm instance type. In addition, since the cloud environment now runs on arm machines, if we need an urgent patch, maybe we only need to build the arm image. |
Sounds good to me to remove lto for patch builds. |
This also sounds worth trying to me. We can try to benchmark building images on different instance specs. |
Maybe we can use a larger arm instance type to do arm docker build, but it is difficult to upgrade the instance type of x86, so we need to innovate and create an aws cloudfromation stack. This is because buildkite can only select machines by stack, not by instance type. The x86 stack is not only used for docker build, but also for other ci pipelines. If the instance type is upgraded, it will cost more. |
what is the cost of having another dedicated x86 stack for docker build only? Is it even possible |
A new cloudfromation stack needs to be created, and the maintenance cost will increase a bit, but the money won't increase much. In fact, for urgent patches, is it enough to just build the arm image? |
Larger stack just for arm image lgtm. |
Summarize:
|
I suppose this won't be a problem regarding patching on the release branch and we can still benefit from it. 🤔 |
Yes, I think it's still a worth exploring direction (to see how technologies incl |
The text was updated successfully, but these errors were encountered: