Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CI on master #6663

Closed
vanbasten23 opened this issue Mar 4, 2024 · 9 comments
Closed

Fix CI on master #6663

vanbasten23 opened this issue Mar 4, 2024 · 9 comments
Assignees

Comments

@vanbasten23
Copy link
Collaborator

🐛 Bug

To Reproduce

na

Expected behavior

CI needs to be green.

Environment

  • Reproducible on XLA backend [CPU/TPU/CUDA]: all
  • torch_xla version: nightly

Additional context

@vanbasten23
Copy link
Collaborator Author

vanbasten23 commented Mar 4, 2024

Currently, a bunch of CPU tests in the CI are failing.
image

If you look at the most recently a few PRs, they fails with various reasons.

@vanbasten23
Copy link
Collaborator Author

vanbasten23 commented Mar 4, 2024

The most recently PR fails with error:

+ mkdir lcov
+ cp .coverage lcov/
+ coverage-lcov --data_file_path lcov/.coverage
Traceback (most recent call last):
  File "/opt/conda/bin/coverage-lcov", line 5, in <module>
    from coverage_lcov.cli import main
  File "/opt/conda/lib/python3.8/site-packages/coverage_lcov/__init__.py", line 3, in <module>
    import toml
ModuleNotFoundError: No module named 'toml'
Error: Process completed with exit code 1.

The PR #6664 added the missing package.

@vanbasten23
Copy link
Collaborator Author

The error still persist. Upon further look, it looks the CI is using the base image: gcr.io/tpu-pytorch/xla_base:dev-3.8_cuda_12.1 during the build phase and the base image is built from .circleci/docker/Dockerfile which based off of PyTorch/XLA cuda develop container

@vanbasten23
Copy link
Collaborator Author

vanbasten23 commented Mar 4, 2024

The PR correctly installs toml package to the dev docker image:

(base) xiowei@xiowei-gpu:~$ sudo docker run --shm-size=16g --net=host --gpus all -it -d us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.8_cuda_12.1
37ce9b04aa447f702cf096bed3d7a3133fe6d8a37f51daca8117b788781c685e
(base) xiowei@xiowei-gpu:~$ sudo docker exec -it 37ce9b04aa44 /bin/bash
root@xiowei-gpu:/ansible# pip freeze | grep toml
toml==0.10.2

Strangely, the dev image build doc doesn't have such info as which pip packages have been installed.

@vanbasten23
Copy link
Collaborator Author

#6670 should be able to fix it.

@vanbasten23
Copy link
Collaborator Author

vanbasten23 commented Mar 5, 2024

Verified that the post-submit CI failure on CPU has been fixed. All left is the TPU CI failure.

@vanbasten23
Copy link
Collaborator Author

vanbasten23 commented Mar 5, 2024

Current TPU CI failure is due to

Step #4 - "run_e2e_tests": ======================================================================
Step #4 - "run_e2e_tests": FAIL: test_metadata (__main__.TestHloMetaData)
Step #4 - "run_e2e_tests": ----------------------------------------------------------------------
Step #4 - "run_e2e_tests": Traceback (most recent call last):
Step #4 - "run_e2e_tests":   File "/src/pytorch/xla/test/test_hlo_metadata.py", line 158, in test_metadata
Step #4 - "run_e2e_tests":     assert v, f"Keyword {k} was not found as expected in HLO metadata for simple test"
Step #4 - "run_e2e_tests": AssertionError: Keyword aten__addmm was not found as expected in HLO metadata for simple test
Step #4 - "run_e2e_tests": 
Step #4 - "run_e2e_tests": ----------------------------------------------------------------------
Step #4 - "run_e2e_tests": Ran 1 test in 3.125s
Step #4 - "run_e2e_tests": 
Step #4 - "run_e2e_tests": FAILED (failures=1)
Step #4 - "run_e2e_tests": {'opType': 'aten__tan_', 'opName': 'TestProgram[.1]/TextTestRunner[testRunner]/TestSuite[test]/TestSuite[_tests.0]/TestHloMetaData[_tests.0]/TestHloMetaData[pre_test_ir_debug.errors.0.0]/Sequential[model]/Tanh[3]/aten__tan_', 'sourceFile': '/usr/local/lib/python3.10/site-packages/torch/nn/modules/activation.py', 'sourceLine': 357, 'stackFrameId': 27}
Step #4 - "run_e2e_tests": {}
Step #4 - "run_e2e_tests": I0000 00:00:1709669224.981972  163477 cpu_client.cc:407] TfrtCpuClient destroyed.
Step #4 - "run_e2e_tests": ++ kubectl get pod/xla-test-job-8mccs -o 'jsonpath={.status.containerStatuses[?(@.name=="xla-test")].state.terminated.exitCode}'
Step #4 - "run_e2e_tests": + exit 1
Finished Step #4 - "run_e2e_tests"
ERROR

Hi @mrnikwaws , could you take a look at the above test failure? It's failing PyTorch/XLA's TPU CI.
cc @JackCaoG @miladm @yeounoh

Edit: Sorry, @mrnikwaws , it seems the test was added long time ago. Perhaps some other changes broke it. Let me check more.

@vanbasten23
Copy link
Collaborator Author

Another fix #6672 required to make the TPU CI green.

@vanbasten23
Copy link
Collaborator Author

Our post-submit CI is green now https://screenshot.googleplex.com/8ckvnUWydME9euU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant