Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream PyTorch 2.5 release branch's XLA CI is failing #8065

Closed
JackCaoG opened this issue Sep 24, 2024 · 5 comments
Closed

Upstream PyTorch 2.5 release branch's XLA CI is failing #8065

JackCaoG opened this issue Sep 24, 2024 · 5 comments
Assignees

Comments

@JackCaoG
Copy link
Collaborator

🐛 Bug

link to the ci hub https://hud2.pytorch.org/hud/pytorch/pytorch/release%2F2.5/1?per_page=50&name_filter=inux-focal-py3_9-clang9-xla

error message

No module name torch.version

most likely just a setup issue. @ManfeiBai can you take a look?

@ManfeiBai
Copy link
Collaborator

ManfeiBai commented Sep 25, 2024

Hi, looking at the log now, and found these two logs:

these logs looks like test start to fail with error:

+ run_test /var/lib/jenkins/workspace/xla/test/test_scan.py
2024-09-24T22:37:24.6276294Z + echo 'Running in PjRt runtime: /var/lib/jenkins/workspace/xla/test/test_scan.py'
2024-09-24T22:37:24.6277153Z Running in PjRt runtime: /var/lib/jenkins/workspace/xla/test/test_scan.py
2024-09-24T22:37:24.6277793Z ++ command -v nvidia-smi
2024-09-24T22:37:24.6278246Z + '[' -x '' ']'
2024-09-24T22:37:24.6278539Z + PJRT_DEVICE=CPU
2024-09-24T22:37:24.6278841Z + CPU_NUM_DEVICES=1
2024-09-24T22:37:24.6279283Z + run_coverage /var/lib/jenkins/workspace/xla/test/test_scan.py
2024-09-24T22:37:24.6279842Z + '[' 0 '!=' 0 ']'
2024-09-24T22:37:24.6280240Z + python3 /var/lib/jenkins/workspace/xla/test/test_scan.py
2024-09-24T22:37:27.3463750Z Traceback (most recent call last):
2024-09-24T22:37:27.3464695Z   File "/var/lib/jenkins/workspace/xla/test/test_scan.py", line 5, in <module>
2024-09-24T22:37:27.3465394Z     from torch_xla.experimental.scan import scan
2024-09-24T22:37:27.3466846Z   File "/opt/conda/lib/python3.8/site-packages/torch_xla-2.5.0+git3c7daa2-py3.8-linux-x86_64.egg/torch_xla/experimental/scan.py", line 18, in <module>
2024-09-24T22:37:27.3467944Z     fn: Callable[[Carry, X], tuple[Carry, Y]],
2024-09-24T22:37:27.3468495Z TypeError: 'type' object is not subscriptable
2024-09-24T22:37:27.7669148Z + cleanup_workspace
2024-09-24T22:37:27.7670734Z + echo 'sudo may print the following warning message that can be ignored. The chown command will still run.'
2024-09-24T22:37:27.7671864Z sudo may print the following warning message that can be ignored. The chown command will still run.
2024-09-24T22:37:27.7672814Z + echo '    sudo: setrlimit(RLIMIT_STACK): Operation not permitted'
2024-09-24T22:37:27.7673452Z     sudo: setrlimit(RLIMIT_STACK): Operation not permitted
2024-09-24T22:37:27.7674270Z + echo 'For more details refer to https://github.com/sudo-project/sudo/issues/42'
2024-09-24T22:37:27.7675173Z For more details refer to https://github.com/sudo-project/sudo/issues/42
2024-09-24T22:37:27.7676069Z + sudo chown -R 1000 /var/lib/jenkins/workspace
2024-09-24T22:37:28.2387478Z + sccache_epilogue
2024-09-24T22:37:28.2388291Z + echo '::group::Sccache Compilation Log'
2024-09-24T22:37:28.2389316Z ##[group]Sccache Compilation Log
2024-09-24T22:37:28.2389882Z + echo '=================== sccache compilation log ==================='
2024-09-24T22:37:28.2390472Z =================== sccache compilation log ===================
2024-09-24T22:37:28.2391305Z + python /var/lib/jenkins/workspace/.ci/pytorch/print_sccache_log.py /home/jenkins/sccache_error.log
2024-09-24T22:37:28.2573113Z + echo '=========== If your build fails, please take a look at the log above for possible reasons ==========='
2024-09-24T22:37:28.2574141Z =========== If your build fails, please take a look at the log above for possible reasons ===========
2024-09-24T22:37:28.2575081Z + sccache --show-stats

didn't found No module name torch.version yet,

hi, @JackCaoG, would you mind redirect me to the detailed log link contain these message?

@JackCaoG
Copy link
Collaborator Author

I am looking at the 10:58 one and I saw
image

@ManfeiBai
Copy link
Collaborator

ManfeiBai commented Sep 25, 2024

thanks, checked old passed rawlog: https://ossci-raw-job-status.s3.amazonaws.com/log/29851185194, and found the same error message too, so No module named 'torch.version' looks like not blocking now:
Screenshot 2024-09-24 at 5 36 44 PM

@ManfeiBai
Copy link
Collaborator

ManfeiBai commented Sep 25, 2024

@yifei pushed fix PR #8067 to r2.5 branch, let's wait for next run result

@ManfeiBai ManfeiBai self-assigned this Sep 25, 2024
@ManfeiBai
Copy link
Collaborator

ManfeiBai commented Sep 25, 2024

looks like the newest run in today passed with hot fix PR: Screenshot 2024-09-25 at 2 18 47 PM

will close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants