Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XLA unit tests to pre-submit CI #2545

Closed
4 tasks
vanbasten23 opened this issue Mar 11, 2024 · 14 comments
Closed
4 tasks

Add XLA unit tests to pre-submit CI #2545

vanbasten23 opened this issue Mar 11, 2024 · 14 comments

Comments

@vanbasten23
Copy link
Contributor

vanbasten23 commented Mar 11, 2024

System Info

Hi team,

Can we add some unit test for XLA (GPU, TPU v4, TPU v2 or v3) to the pre-submit CI? The test can be as simple as accelerate test. The reason for the request is that we have observed a few changes recently from accelerate that broke the accelerate test for TPU such as #2319 and #2176). It takes longer for PyTorch/XLA team to fix them because PyTorch/XLA team is not familiar with the change. And it will be great if the PR author can fix the issue before the PR is merged as they have the most context, so that the users won't see the regression. Thanks!

cc @will-cromar, @JackCaoG, @muellerzr

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

config:

compute_environment: LOCAL_MACHINE
distributed_type: XLA
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

export PJRT_DEVICE=TPU
accelerate test

Expected behavior

na

@muellerzr
Copy link
Collaborator

muellerzr commented Mar 11, 2024

The issue is we don't run runners on pre-submit CI, nor do we have any TPUs to run on merge CI, so we have no way of testing TPUs ourselves with accelerate outside running it manually in Colab.

(We could maybe look at adding GPU XLA tests in there though post-submit)

Note: we also don't run GPU runners on pre-submit, only the main CI has access to those + nightlies

@vanbasten23
Copy link
Contributor Author

Thanks for the response. In that case, can we add a GPU XLA tests through post-submit? That will help catch issues earlier.

@muellerzr
Copy link
Collaborator

Certainly.

IIUC all that's needed to get this going is to install torch_xla currently with the state of accelerate, correct? If so then we just need to setup a new xla-gpu specific docker image to call and run the tests on a similar CI runner to what we have now.

@vanbasten23
Copy link
Contributor Author

Certainly.

IIUC all that's needed to get this going is to install torch_xla currently with the state of accelerate, correct? If so then we just need to setup a new xla-gpu specific docker image to call and run the tests on a similar CI runner to what we have now.

That's correct. Thanks!

@muellerzr
Copy link
Collaborator

@vanbasten23 do you have a good "hello world" test that can be run on the GPU docker images to check and see if everything works okay? Hitting a few snags just doing accelerate test, and I can't seem to get things working despite doing PJRT_DEVICE=CUDA.

Docker file I'm testing:

FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.2.0_3.10_cuda_12.1
RUN python3 -m pip install --no-cache-dir \
    git+https://github.com/huggingface/accelerate#egg=accelerate[test_prod,test_integrations] \
    --extra-index-url https://download.pytorch.org/whl/cu117

# Activate the virtualenv
CMD ["/bin/bash"]

@vanbasten23
Copy link
Contributor Author

Yes. You can use this:
PJRT_DEVICE=CUDA python

import torch, torch_xla
import torch_xla.core.xla_model as xm

t1 = torch.randn(1, 128, device='cpu')
t2 = torch.randn(1, 128, device='cpu')

xt1 = t1.to(xm.xla_device())
xt2 = t2.to(xm.xla_device())

expected = t1 + t2
actual = (xt1 + xt2).cpu()
assert torch.allclose(expected, actual)

@muellerzr
Copy link
Collaborator

BTW just noticing this, we should eventually change the logic so PJRT_DEVICE is auto-set if multi-gpu is enabled through the config file + torch_xla is available.

Running PJRT_DEVICE=CUDA accelerate test eventually leaves me with this trace:

2024-03-13 17:12:13.441463: W external/xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.
2024-03-13 17:12:13.441551: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2716] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.; current tracing scope: all-reduce-start; current profiling annotation: XlaModule:#hlo_module=SyncTensorsGraph.25,program_id=0#.
2024-03-13 17:12:13.441641: W external/xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.
2024-03-13 17:12:13.441712: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2716] Execution of replica 1 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.; current tracing scope: all-reduce-start; current profiling annotation: XlaModule:#hlo_module=SyncTensorsGraph.25,program_id=0#.
Traceback (most recent call last):
  File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 716, in <module>
Traceback (most recent call last):
  File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 716, in <module>
    main()
  File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 664, in main
    main()
  File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 664, in main
    state.wait_for_everyone()
  File "/mnt/accelerate/src/accelerate/state.py", line 934, in wait_for_everyone
    state.wait_for_everyone()
  File "/mnt/accelerate/src/accelerate/state.py", line 934, in wait_for_everyone
    PartialState().wait_for_everyone()
  File "/mnt/accelerate/src/accelerate/state.py", line 423, in wait_for_everyone
    xm.rendezvous("accelerate.utils.wait_for_everyone")
  File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1166, in rendezvous
    PartialState().wait_for_everyone()
  File "/mnt/accelerate/src/accelerate/state.py", line 423, in wait_for_everyone
    xm.rendezvous("accelerate.utils.wait_for_everyone")
  File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1166, in rendezvous
    return xla_rendezvous(payload, replicas or None, tag=tag)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1133, in xla_rendezvous
    return xla_rendezvous(payload, replicas or None, tag=tag)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1133, in xla_rendezvous
    if max_size.item() < 1:
RuntimeError: Bad StatusOr access: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.; current tracing scope: all-reduce-start; current profiling annotation: XlaModule:#hlo_module=SyncTensorsGraph.25,program_id=0#.

Any clue what's going on there? It should certainly not be running out of memory with 2x24gb GPUs and I set --shm-size="48gb"

@muellerzr
Copy link
Collaborator

If we can get to a point where I can run them locally via Docker and things make sense on a CUDA runtime, then we can integrate it into a CI.

@vanbasten23
Copy link
Contributor Author

BTW just noticing this, we should eventually change the logic so PJRT_DEVICE is auto-set if multi-gpu is enabled through the config file + torch_xla is available.

Completed agreed.

Any clue what's going on there?

Do you know your cuda runtime version (nvcc --version)? I'm using cuda 12.1 and I got a different error which it seems it accessed the XLA devices before calling the spawn.

@vanbasten23
Copy link
Contributor Author

I rebase my codebase to get the latest code on the main branch and here is the new error that I got. It fails at

training_check(use_seedable_sampler=True)
. Looking like it failed at a later place in the test_script.py than the one in #2545 (comment).

@muellerzr
Copy link
Collaborator

Yes, that random sampler part I mentioned that could be bad in this PR 😉 #2542 (comment)

@vanbasten23
Copy link
Contributor Author

@muellerzr
Copy link
Collaborator

Hmm okay I'll try giving it a look tommorow.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants