Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing TensorFlow CI Tests #2698

Open
ScXfjiang opened this issue Oct 3, 2024 · 4 comments
Open

Enhancing TensorFlow CI Tests #2698

ScXfjiang opened this issue Oct 3, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@ScXfjiang
Copy link

TensorFlow’s CI tests sometimes fail for no reason, requiring a rerun. Each round takes approximately 4 hours. Any potential improvements we can do to optimize this process?

@ScXfjiang ScXfjiang added the enhancement New feature or request label Oct 3, 2024
@i-chaochen
Copy link

The main reason is because few CI nodes running too many unit tests at once and the ROCm driver will drop.

I think we could use same way in XLA CI to parallel the number of TF's unit tests at two pipelines.

Right now TF' CI (pycpp) is triggering the job via bazelrc

We need to think of a good way to separate all unit tests in each bazelrc (maybe create another one?), and then each individual CI pipeline can read it and run the job in parallel.

@jayfurmanek
Copy link

We need to think of a good way to separate all unit tests in each bazelrc (maybe create another one?), and then each
individual CI pipeline can read it and run the job in parallel.

I'm not sure that will help. We may be able to throttle the test rate. At the moment the jobs use
--run_under=//tensorflow/tools/ci_build/gpu_build:parallel_gpu_execute

and the env vars TF_TESTS_PER_GPU, N_TEST_JOBS and TF_GPU_COUNT are used there. The idea is to run one test per GPU.
It checks rocm-smi to get the GPU count. There have been changes in rocm-smi output lately but I fixed them for the scripts.

@jayfurmanek
Copy link

Maybe we throttle to use less GPUs?

@i-chaochen
Copy link

i-chaochen commented Oct 3, 2024

There is other issue is our tensorflow CI taking long time to finish all unit tests. We think maybe we could reduce the number of unit tests as a half for 2 parallel CI jobs to reduce the CI time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants