Enhancing TensorFlow CI Tests #2698

ScXfjiang · 2024-10-03T10:37:20Z

TensorFlow’s CI tests sometimes fail for no reason, requiring a rerun. Each round takes approximately 4 hours. Any potential improvements we can do to optimize this process?

i-chaochen · 2024-10-03T11:22:38Z

The main reason is because few CI nodes running too many unit tests at once and the ROCm driver will drop.

I think we could use same way in XLA CI to parallel the number of TF's unit tests at two pipelines.

Right now TF' CI (pycpp) is triggering the job via bazelrc

We need to think of a good way to separate all unit tests in each bazelrc (maybe create another one?), and then each individual CI pipeline can read it and run the job in parallel.

jayfurmanek · 2024-10-03T13:49:06Z

We need to think of a good way to separate all unit tests in each bazelrc (maybe create another one?), and then each
individual CI pipeline can read it and run the job in parallel.

I'm not sure that will help. We may be able to throttle the test rate. At the moment the jobs use
--run_under=//tensorflow/tools/ci_build/gpu_build:parallel_gpu_execute

and the env vars TF_TESTS_PER_GPU, N_TEST_JOBS and TF_GPU_COUNT are used there. The idea is to run one test per GPU.
It checks rocm-smi to get the GPU count. There have been changes in rocm-smi output lately but I fixed them for the scripts.

jayfurmanek · 2024-10-03T13:49:15Z

Maybe we throttle to use less GPUs?

i-chaochen · 2024-10-03T14:09:57Z

There is other issue is our tensorflow CI taking long time to finish all unit tests. We think maybe we could reduce the number of unit tests as a half for 2 parallel CI jobs to reduce the CI time.

ScXfjiang assigned ScXfjiang and i-chaochen Oct 3, 2024

ScXfjiang added the enhancement New feature or request label Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing TensorFlow CI Tests #2698

Enhancing TensorFlow CI Tests #2698

ScXfjiang commented Oct 3, 2024

i-chaochen commented Oct 3, 2024

jayfurmanek commented Oct 3, 2024

jayfurmanek commented Oct 3, 2024

i-chaochen commented Oct 3, 2024 •

edited

Loading

Enhancing TensorFlow CI Tests #2698

Enhancing TensorFlow CI Tests #2698

Comments

ScXfjiang commented Oct 3, 2024

i-chaochen commented Oct 3, 2024

jayfurmanek commented Oct 3, 2024

jayfurmanek commented Oct 3, 2024

i-chaochen commented Oct 3, 2024 • edited Loading

i-chaochen commented Oct 3, 2024 •

edited

Loading