You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TensorFlow’s CI tests sometimes fail for no reason, requiring a rerun. Each round takes approximately 4 hours. Any potential improvements we can do to optimize this process?
The text was updated successfully, but these errors were encountered:
The main reason is because few CI nodes running too many unit tests at once and the ROCm driver will drop.
I think we could use same way in XLA CI to parallel the number of TF's unit tests at two pipelines.
Right now TF' CI (pycpp) is triggering the job via bazelrc
We need to think of a good way to separate all unit tests in each bazelrc (maybe create another one?), and then each individual CI pipeline can read it and run the job in parallel.
We need to think of a good way to separate all unit tests in each bazelrc (maybe create another one?), and then each
individual CI pipeline can read it and run the job in parallel.
I'm not sure that will help. We may be able to throttle the test rate. At the moment the jobs use --run_under=//tensorflow/tools/ci_build/gpu_build:parallel_gpu_execute
and the env vars TF_TESTS_PER_GPU, N_TEST_JOBS and TF_GPU_COUNT are used there. The idea is to run one test per GPU.
It checks rocm-smi to get the GPU count. There have been changes in rocm-smi output lately but I fixed them for the scripts.
There is other issue is our tensorflow CI taking long time to finish all unit tests. We think maybe we could reduce the number of unit tests as a half for 2 parallel CI jobs to reduce the CI time.
TensorFlow’s CI tests sometimes fail for no reason, requiring a rerun. Each round takes approximately 4 hours. Any potential improvements we can do to optimize this process?
The text was updated successfully, but these errors were encountered: