Fix CI on master #6663

vanbasten23 · 2024-03-04T18:35:04Z

🐛 Bug

To Reproduce

na

Expected behavior

CI needs to be green.

Environment

Reproducible on XLA backend [CPU/TPU/CUDA]: all
torch_xla version: nightly

Additional context

vanbasten23 · 2024-03-04T18:37:21Z

Currently, a bunch of CPU tests in the CI are failing.

If you look at the most recently a few PRs, they fails with various reasons.

vanbasten23 · 2024-03-04T19:13:04Z

The most recently PR fails with error:

+ mkdir lcov
+ cp .coverage lcov/
+ coverage-lcov --data_file_path lcov/.coverage
Traceback (most recent call last):
  File "/opt/conda/bin/coverage-lcov", line 5, in <module>
    from coverage_lcov.cli import main
  File "/opt/conda/lib/python3.8/site-packages/coverage_lcov/__init__.py", line 3, in <module>
    import toml
ModuleNotFoundError: No module named 'toml'
Error: Process completed with exit code 1.

The PR #6664 added the missing package.

vanbasten23 · 2024-03-04T21:38:55Z

The error still persist. Upon further look, it looks the CI is using the base image: gcr.io/tpu-pytorch/xla_base:dev-3.8_cuda_12.1 during the build phase and the base image is built from .circleci/docker/Dockerfile which based off of PyTorch/XLA cuda develop container

vanbasten23 · 2024-03-04T22:26:50Z

The PR correctly installs toml package to the dev docker image:

(base) xiowei@xiowei-gpu:~$ sudo docker run --shm-size=16g --net=host --gpus all -it -d us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.8_cuda_12.1
37ce9b04aa447f702cf096bed3d7a3133fe6d8a37f51daca8117b788781c685e
(base) xiowei@xiowei-gpu:~$ sudo docker exec -it 37ce9b04aa44 /bin/bash
root@xiowei-gpu:/ansible# pip freeze | grep toml
toml==0.10.2

Strangely, the dev image build doc doesn't have such info as which pip packages have been installed.

vanbasten23 · 2024-03-05T15:04:44Z

#6670 should be able to fix it.

vanbasten23 · 2024-03-05T22:05:36Z

Verified that the post-submit CI failure on CPU has been fixed. All left is the TPU CI failure.

vanbasten23 · 2024-03-05T22:08:54Z

Current TPU CI failure is due to

Step #4 - "run_e2e_tests": ======================================================================
Step #4 - "run_e2e_tests": FAIL: test_metadata (__main__.TestHloMetaData)
Step #4 - "run_e2e_tests": ----------------------------------------------------------------------
Step #4 - "run_e2e_tests": Traceback (most recent call last):
Step #4 - "run_e2e_tests":   File "/src/pytorch/xla/test/test_hlo_metadata.py", line 158, in test_metadata
Step #4 - "run_e2e_tests":     assert v, f"Keyword {k} was not found as expected in HLO metadata for simple test"
Step #4 - "run_e2e_tests": AssertionError: Keyword aten__addmm was not found as expected in HLO metadata for simple test
Step #4 - "run_e2e_tests": 
Step #4 - "run_e2e_tests": ----------------------------------------------------------------------
Step #4 - "run_e2e_tests": Ran 1 test in 3.125s
Step #4 - "run_e2e_tests": 
Step #4 - "run_e2e_tests": FAILED (failures=1)
Step #4 - "run_e2e_tests": {'opType': 'aten__tan_', 'opName': 'TestProgram[.1]/TextTestRunner[testRunner]/TestSuite[test]/TestSuite[_tests.0]/TestHloMetaData[_tests.0]/TestHloMetaData[pre_test_ir_debug.errors.0.0]/Sequential[model]/Tanh[3]/aten__tan_', 'sourceFile': '/usr/local/lib/python3.10/site-packages/torch/nn/modules/activation.py', 'sourceLine': 357, 'stackFrameId': 27}
Step #4 - "run_e2e_tests": {}
Step #4 - "run_e2e_tests": I0000 00:00:1709669224.981972  163477 cpu_client.cc:407] TfrtCpuClient destroyed.
Step #4 - "run_e2e_tests": ++ kubectl get pod/xla-test-job-8mccs -o 'jsonpath={.status.containerStatuses[?(@.name=="xla-test")].state.terminated.exitCode}'
Step #4 - "run_e2e_tests": + exit 1
Finished Step #4 - "run_e2e_tests"
ERROR

Hi @mrnikwaws , could you take a look at the above test failure? It's failing PyTorch/XLA's TPU CI.
cc @JackCaoG @miladm @yeounoh

Edit: Sorry, @mrnikwaws , it seems the test was added long time ago. Perhaps some other changes broke it. Let me check more.

vanbasten23 · 2024-03-06T18:16:49Z

Another fix #6672 required to make the TPU CI green.

vanbasten23 · 2024-03-07T05:36:33Z

Our post-submit CI is green now https://screenshot.googleplex.com/8ckvnUWydME9euU

yeounoh assigned vanbasten23 Mar 4, 2024

vanbasten23 closed this as completed Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CI on master #6663

Fix CI on master #6663

vanbasten23 commented Mar 4, 2024

vanbasten23 commented Mar 4, 2024 •

edited

Loading

vanbasten23 commented Mar 4, 2024 •

edited

Loading

vanbasten23 commented Mar 4, 2024

vanbasten23 commented Mar 4, 2024 •

edited

Loading

vanbasten23 commented Mar 5, 2024

vanbasten23 commented Mar 5, 2024 •

edited

Loading

vanbasten23 commented Mar 5, 2024 •

edited

Loading

vanbasten23 commented Mar 6, 2024

vanbasten23 commented Mar 7, 2024

Fix CI on master #6663

Fix CI on master #6663

Comments

vanbasten23 commented Mar 4, 2024

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

vanbasten23 commented Mar 4, 2024 • edited Loading

vanbasten23 commented Mar 4, 2024 • edited Loading

vanbasten23 commented Mar 4, 2024

vanbasten23 commented Mar 4, 2024 • edited Loading

vanbasten23 commented Mar 5, 2024

vanbasten23 commented Mar 5, 2024 • edited Loading

vanbasten23 commented Mar 5, 2024 • edited Loading

vanbasten23 commented Mar 6, 2024

vanbasten23 commented Mar 7, 2024

vanbasten23 commented Mar 4, 2024 •

edited

Loading

vanbasten23 commented Mar 4, 2024 •

edited

Loading

vanbasten23 commented Mar 4, 2024 •

edited

Loading

vanbasten23 commented Mar 5, 2024 •

edited

Loading

vanbasten23 commented Mar 5, 2024 •

edited

Loading