-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gcDrain cost when building AOSP 13 with reproxy #26
Comments
@Ruoye-W Could you share you configs? |
{
} |
I am assuming the primary question from your description is:
In Android Platform builds particularly, there's |
@gkousik I adjusted the build concurrency to 96 and the buildfarm cluster has 144 cores, and RBE_unified_cas_ops option set to true. With the compilation cache disabled and no dependency files available in remote cache, the remote execution becomes increasingly slow. When the build progress reaches 60%, the terminal displaying the AOSP build shows a task delay of over 50 minutes, and eventually there are no more tasks being executed. The cluster receives fewer and fewer compilation requests, and ultimately no compilation requests are received from RBE (Remote Build Execution). On the second attempt, without the need to upload files remotely and with the compilation cache disabled, the remote compilation is much faster, and the cluster can efficiently handle these tasks without any abnormal slowdown. The build has now been completed quickly. |
I added some logging for troubleshooting when RBE_unified_cas_ops option set to true, and found that the issue was caused by some compilation tasks sending artifact download requests after remote compilation had finished. These tasks send download request to local download processor goroutine, and waiting for download response. Request were continuously downloading artifacts, and it's unclear whether the downloads were timing out or failing. As a result, the AOSP build system kept waiting for these artifacts, causing the number of parallel tasks to decrease over time. Eventually, the system became completely stuck. |
Regarding the issue we encountered earlier, when the build got stuck, it was found that the corresponding task was in a state of waiting for the download of compilation artifacts. This could be due to batch downloading or streaming download timeouts, or encountering other errors that caused the downloadProcessor to not write the downloadResponse. The task would continue to wait for all download responses in a for loop while decrementing a counter. If the counter is not zero, it would enter an infinite loop. While each task is waiting for all artifacts to be downloaded, perhaps we can set a timeout limit to avoid the infinite loop of for count > 0. |
I am using the open-source reproxy(v0.117.1) on a 96-core machine to build the AOSP 13 project, the network bandwidth is 200 Mbps and local cloud disk IO is 180MB/s. Bazel-remote is used for remote caching, and the buildfarm is used for remote execution service. However, I have encountered a problem where the task progress gets stuck at 3% for a long time, causing a delay of over 20 minutes with no change in the build progress. The monitoring of the remote cache shows that the read/write time for files larger than 1MB becomes slower and slower, eventually takes more than 20 seconds to finish write, and clang compiler toolchain binary cost 49s! This may cause the client to initiate timeout retries when uploading files. Additionally, due to the long file write time, the client also performs timeout retries for FindMissingBlobs calls. It seems that too many goroutines are waiting for cas semaphore
from reproxy.INFO when cpp task get stuck over30minutes:
Resource Usage: map[CPU_pct:0 MEM_RES_mbs:4957 MEM_VIRT_mbs:15485 MEM_pct:1 PEAK_NUM_ACTIOINS:0]
To address this issue, I have set unified_cas_ops to true, but the problem still occurs intermittently. When the task is able to continue building, I performed pprof sampling and found that the runtime.gcDrain accounts for 60% of the progress at 60% build completion. This may not be a normal phenomenon. Has anyone else encountered this issue?
In addition, it seems that AOSP defaults to using RBE (Remote Build Execution) with a concurrency level of 500. Setting "m -j32" doesn't seem to have any effect. Do you know of any other ways to reduce the task traffic received by reproxy? Thanks very much!
The text was updated successfully, but these errors were encountered: