Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CI workflow for tests that requires pytorch CUDA. #7073

Closed
wants to merge 34 commits into from

Conversation

vanbasten23
Copy link
Collaborator

@vanbasten23 vanbasten23 commented May 16, 2024

This PR adds a new CI workflow that build pytorch with CUDA enabled from source, build pytorch/xla with CUDA enabled from source, then run tests. The intention is to run tests that requires pytorch with CUDA.

In detail, this PR add 2 more jobs to .github/workflows/build_and_test.yml

  1. build-torch-with-cuda-xla-with-cuda: build pytorch with cuda and build pytorch/xla (It's ok to build torch_xla with cuda enabled now because it takes 2537s to build pytorch and takes 352s to build torch_xla).
  2. test-cuda-with-pytorch-cuda-enabled: only run the tests that requires pytorch CUDA.

@vanbasten23 vanbasten23 mentioned this pull request May 16, 2024
@vanbasten23 vanbasten23 force-pushed the xiowei/addCIWorkflow branch 3 times, most recently from 5f27b23 to f79aee7 Compare May 21, 2024 04:50
@vanbasten23
Copy link
Collaborator Author

vanbasten23 commented May 21, 2024

Note to myself: it seems that BUILD XLA CUDA plugin requires the env var USE_CUDA=0. Otherwise, it'll fails with an error

      [5,777 / 9,258] Compiling tsl/platform/default/logging.cc; 0s local, remote-cache ... (17 actions, 2 running)
      ERROR: /github/home/.cache/bazel/_bazel_root/197a057057a49e5811107144e2d78508/external/xla/xla/service/gpu/BUILD:4486:13: Compiling xla/service/gpu/buffer_comparator.cu.cc failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command (from target @xla//xla/service/gpu:buffer_comparator_kernel) external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -MD -MF bazel-out/k8-opt/bin/external/xla/xla/service/gpu/_objs/buffer_comparator_kernel/buffer_comparator.cu.pic.d ... (remaining 65 arguments skipped)
      nvcc fatal   : Unsupported gpu architecture 'compute_35'
      Target @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so failed to build
      Use --verbose_failures to see the command lines of failed build steps.
      INFO: Elapsed time: 40.997s, Critical Path: 1.96s
      INFO: 5806 processes: 949 remote cache hit, 4857 internal.
      FAILED: Build did NOT complete successfully
      FAILED: Build did NOT complete successfully
      INFO: Streaming build results to: https://source.cloud.google.com/results/invocations/bba75106-7be5-4ad8-9a83-b3dd6222c3e1
      Traceback (most recent call last):
        File "/usr/local/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/usr/local/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/usr/local/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
        File "/tmp/pip-build-env-e97mw0k_/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
        File "/tmp/pip-build-env-e97mw0k_/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
          self.run_setup()
        File "/tmp/pip-build-env-e97mw0k_/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 311, in run_setup
          exec(code, locals())
        File "<string>", line 11, in <module>
        File "/__w/xla/xla/pytorch/xla/plugins/cuda/../../build_util.py", line 67, in bazel_build
          subprocess.check_call(bazel_argv, stdout=sys.stdout, stderr=sys.stderr)
        File "/usr/local/lib/python3.10/subprocess.py", line 369, in check_call
          raise CalledProcessError(retcode, cmd)
      subprocess.CalledProcessError: Command '['bazel', 'build', '@xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so', '--symlink_prefix=/__w/xla/xla/pytorch/xla/plugins/cuda/bazel-', '--config=remote_cache', '--jobs=16', '--config=cuda', '--remote_default_exec_properties=cache-silo-key=dev']' returned non-zero exit status 1.
      error: subprocess-exited-with-error

@vanbasten23 vanbasten23 force-pushed the xiowei/addCIWorkflow branch 5 times, most recently from 9a251c7 to 91abb38 Compare May 22, 2024 23:43
@vanbasten23
Copy link
Collaborator Author

Seems without installing the cuda plugin, the tests would fail with error https://gist.github.com/vanbasten23/7dd6ddeaad93843e57653990c43cf476

@vanbasten23 vanbasten23 force-pushed the xiowei/addCIWorkflow branch 2 times, most recently from 9c59552 to 723f9d1 Compare May 23, 2024 17:06
@vanbasten23 vanbasten23 marked this pull request as ready for review May 23, 2024 17:07
@vanbasten23 vanbasten23 requested a review from ysiraichi May 23, 2024 17:12
@@ -2574,15 +2574,16 @@ def test_dlpack_non_default_layout(self):
cuda_t = torch.arange(25, device=torch.device('cuda')).reshape(5, 5)

t1 = cuda_t.t()
print('xw32 t1.device=', t1.device)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

BAZEL_REMOTE_CACHE: 1
BUILD_CPP_TESTS: 1
steps:
- name: Setup gcloud
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setup is repetitive. I was already on the fence about encapsulating it in a new action (since it already appears the build action, test actions, and the docs push. If we add another copy, it really should be encapsulated so we don't have to update a bunch of places at once.

name: torch-with-cuda-xla-with-cuda-wheels
path: /tmp/wheels/
pattern: torch-*.whl
- name: Fetch CUDA plugin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setup is repetitive. I was already on the fence about encapsulating it in a new action (since it already appears the build action, test actions, and the docs push). If we add another copy that's more-or-less identical, it really should be encapsulated so we don't have to update a bunch of places at once.

Stepping back, is there a way to merge this with _test.yml? You would need to add a parameter for some of the test groups to install the torch CUDA build.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see your point. Let me give it a try.

shell: bash
run: |
cd pytorch/xla/infra/ansible
ansible-playbook playbook.yaml -vvv -e "stage=build arch=amd64 accelerator=cuda cuda_compute_capabilities=5.2,7.5 src_root=${GITHUB_WORKSPACE} build_cpp_tests=1 git_versioned_xla_build=1 cache_suffix=-ci build_pytorch_with_cuda=1" --skip-tags=fetch_srcs,install_deps
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this goes against my philosophy of "put everything in ansible", but what if you just build the PyTorch CUDA wheel directly here? I don't think we should build multiple copies of torch_xla and torchvision.

Building PyTorch with CUDA support is only part of our CI workflow, and it will never be part of our release workflow. It's okay in my mind to just directly USE_CUDA=1 python setup.py bdist_wheel here and upload only the torch GPU wheel as an artifact.

The test workflow can then use the same torch-xla, torch-xla-cuda-plugin, and torchvision.

@vanbasten23 vanbasten23 force-pushed the xiowei/addCIWorkflow branch 2 times, most recently from 7576407 to bdb37d7 Compare May 23, 2024 23:27
@bhavya01 bhavya01 self-requested a review May 25, 2024 00:27
@bhavya01
Copy link
Collaborator

Thanks for working on this. Added myself as a reviewer as I also need this for Triton tests.

@vanbasten23 vanbasten23 force-pushed the xiowei/addCIWorkflow branch from cbd190a to 1e910fb Compare May 29, 2024 00:14
@vanbasten23 vanbasten23 force-pushed the xiowei/addCIWorkflow branch from 1e910fb to bcd007c Compare May 29, 2024 13:56
@vanbasten23
Copy link
Collaborator Author

close it in favor of #7140

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants