Add XLA unit tests to pre-submit CI #2545

vanbasten23 · 2024-03-11T18:15:18Z

System Info

Hi team,

Can we add some unit test for XLA (GPU, TPU v4, TPU v2 or v3) to the pre-submit CI? The test can be as simple as accelerate test. The reason for the request is that we have observed a few changes recently from accelerate that broke the accelerate test for TPU such as #2319 and #2176). It takes longer for PyTorch/XLA team to fix them because PyTorch/XLA team is not familiar with the change. And it will be great if the PR author can fix the issue before the PR is merged as they have the most context, so that the users won't see the regression. Thanks!

cc @will-cromar, @JackCaoG, @muellerzr

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

config:

compute_environment: LOCAL_MACHINE
distributed_type: XLA
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

export PJRT_DEVICE=TPU
accelerate test

Expected behavior

na

The text was updated successfully, but these errors were encountered:

muellerzr · 2024-03-11T18:37:58Z

The issue is we don't run runners on pre-submit CI, nor do we have any TPUs to run on merge CI, so we have no way of testing TPUs ourselves with accelerate outside running it manually in Colab.

(We could maybe look at adding GPU XLA tests in there though post-submit)

Note: we also don't run GPU runners on pre-submit, only the main CI has access to those + nightlies

vanbasten23 · 2024-03-11T20:53:51Z

Thanks for the response. In that case, can we add a GPU XLA tests through post-submit? That will help catch issues earlier.

muellerzr · 2024-03-11T21:00:03Z

Certainly.

IIUC all that's needed to get this going is to install torch_xla currently with the state of accelerate, correct? If so then we just need to setup a new xla-gpu specific docker image to call and run the tests on a similar CI runner to what we have now.

vanbasten23 · 2024-03-11T22:14:21Z

Certainly.

IIUC all that's needed to get this going is to install torch_xla currently with the state of accelerate, correct? If so then we just need to setup a new xla-gpu specific docker image to call and run the tests on a similar CI runner to what we have now.

That's correct. Thanks!

muellerzr · 2024-03-13T16:22:43Z

@vanbasten23 do you have a good "hello world" test that can be run on the GPU docker images to check and see if everything works okay? Hitting a few snags just doing accelerate test, and I can't seem to get things working despite doing PJRT_DEVICE=CUDA.

Docker file I'm testing:

FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.2.0_3.10_cuda_12.1
RUN python3 -m pip install --no-cache-dir \
    git+https://github.com/huggingface/accelerate#egg=accelerate[test_prod,test_integrations] \
    --extra-index-url https://download.pytorch.org/whl/cu117

# Activate the virtualenv
CMD ["/bin/bash"]

vanbasten23 · 2024-03-13T16:39:29Z

Yes. You can use this:
PJRT_DEVICE=CUDA python

import torch, torch_xla
import torch_xla.core.xla_model as xm

t1 = torch.randn(1, 128, device='cpu')
t2 = torch.randn(1, 128, device='cpu')

xt1 = t1.to(xm.xla_device())
xt2 = t2.to(xm.xla_device())

expected = t1 + t2
actual = (xt1 + xt2).cpu()
assert torch.allclose(expected, actual)

muellerzr · 2024-03-13T17:18:12Z

BTW just noticing this, we should eventually change the logic so PJRT_DEVICE is auto-set if multi-gpu is enabled through the config file + torch_xla is available.

Running PJRT_DEVICE=CUDA accelerate test eventually leaves me with this trace:

2024-03-13 17:12:13.441463: W external/xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.
2024-03-13 17:12:13.441551: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2716] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.; current tracing scope: all-reduce-start; current profiling annotation: XlaModule:#hlo_module=SyncTensorsGraph.25,program_id=0#.
2024-03-13 17:12:13.441641: W external/xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.
2024-03-13 17:12:13.441712: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2716] Execution of replica 1 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.; current tracing scope: all-reduce-start; current profiling annotation: XlaModule:#hlo_module=SyncTensorsGraph.25,program_id=0#.
Traceback (most recent call last):
  File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 716, in <module>
Traceback (most recent call last):
  File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 716, in <module>
    main()
  File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 664, in main
    main()
  File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 664, in main
    state.wait_for_everyone()
  File "/mnt/accelerate/src/accelerate/state.py", line 934, in wait_for_everyone
    state.wait_for_everyone()
  File "/mnt/accelerate/src/accelerate/state.py", line 934, in wait_for_everyone
    PartialState().wait_for_everyone()
  File "/mnt/accelerate/src/accelerate/state.py", line 423, in wait_for_everyone
    xm.rendezvous("accelerate.utils.wait_for_everyone")
  File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1166, in rendezvous
    PartialState().wait_for_everyone()
  File "/mnt/accelerate/src/accelerate/state.py", line 423, in wait_for_everyone
    xm.rendezvous("accelerate.utils.wait_for_everyone")
  File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1166, in rendezvous
    return xla_rendezvous(payload, replicas or None, tag=tag)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1133, in xla_rendezvous
    return xla_rendezvous(payload, replicas or None, tag=tag)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1133, in xla_rendezvous
    if max_size.item() < 1:
RuntimeError: Bad StatusOr access: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.; current tracing scope: all-reduce-start; current profiling annotation: XlaModule:#hlo_module=SyncTensorsGraph.25,program_id=0#.

Any clue what's going on there? It should certainly not be running out of memory with 2x24gb GPUs and I set --shm-size="48gb"

muellerzr · 2024-03-13T17:19:22Z

If we can get to a point where I can run them locally via Docker and things make sense on a CUDA runtime, then we can integrate it into a CI.

vanbasten23 · 2024-03-13T18:09:46Z

BTW just noticing this, we should eventually change the logic so PJRT_DEVICE is auto-set if multi-gpu is enabled through the config file + torch_xla is available.

Completed agreed.

Any clue what's going on there?

Do you know your cuda runtime version (nvcc --version)? I'm using cuda 12.1 and I got a different error which it seems it accessed the XLA devices before calling the spawn.

vanbasten23 · 2024-03-18T22:13:11Z

I rebase my codebase to get the latest code on the main branch and here is the new error that I got. It fails at

accelerate/src/accelerate/test_utils/scripts/test_script.py

Line 751 in 2ad42e7

training_check(use_seedable_sampler=True)

. Looking like it failed at a later place in the test_script.py than the one in #2545 (comment).

muellerzr · 2024-03-18T22:32:03Z

Yes, that random sampler part I mentioned that could be bad in this PR 😉 #2542 (comment)

vanbasten23 · 2024-03-18T22:46:03Z

I reverted the change locally in https://github.com/huggingface/accelerate/pull/2542/files#diff-d9858283a2ced902233727f6fddde0a00831ad9a66a069e57231a5057d550bf6 and I still got the same error.

muellerzr · 2024-03-19T00:02:31Z

Hmm okay I'll try giving it a look tommorow.

github-actions · 2024-04-12T15:06:25Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add XLA unit tests to pre-submit CI #2545

Add XLA unit tests to pre-submit CI #2545

vanbasten23 commented Mar 11, 2024 •

edited

Loading

muellerzr commented Mar 11, 2024 •

edited

Loading

vanbasten23 commented Mar 11, 2024

muellerzr commented Mar 11, 2024

vanbasten23 commented Mar 11, 2024

muellerzr commented Mar 13, 2024

vanbasten23 commented Mar 13, 2024

muellerzr commented Mar 13, 2024

muellerzr commented Mar 13, 2024

vanbasten23 commented Mar 13, 2024

vanbasten23 commented Mar 18, 2024

muellerzr commented Mar 18, 2024

vanbasten23 commented Mar 18, 2024

muellerzr commented Mar 19, 2024

github-actions bot commented Apr 12, 2024

Add XLA unit tests to pre-submit CI #2545

Add XLA unit tests to pre-submit CI #2545

Comments

vanbasten23 commented Mar 11, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

muellerzr commented Mar 11, 2024 • edited Loading

vanbasten23 commented Mar 11, 2024

muellerzr commented Mar 11, 2024

vanbasten23 commented Mar 11, 2024

muellerzr commented Mar 13, 2024

vanbasten23 commented Mar 13, 2024

muellerzr commented Mar 13, 2024

muellerzr commented Mar 13, 2024

vanbasten23 commented Mar 13, 2024

vanbasten23 commented Mar 18, 2024

muellerzr commented Mar 18, 2024

vanbasten23 commented Mar 18, 2024

muellerzr commented Mar 19, 2024

github-actions bot commented Apr 12, 2024

vanbasten23 commented Mar 11, 2024 •

edited

Loading

muellerzr commented Mar 11, 2024 •

edited

Loading