-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Pin Update] Version 20230826 #5527
Conversation
This is just to test the xla pin update on pytorch/xla#5527
For ResNet MP on V4-8:
Before the change:
For ResNet SPMD on V4-8:
Before the change:
No performance regression. Will do a LLaMA2 training test later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Looks like one of the GPU test has failed. There could be a regression on GPU. Will need to setup a GPU env to further debug. |
|
Yea, I need to setup a GPU env to repro. |
@ManfeiBai, wondering if you have any GPU that we can use to repro this? |
yes, we have some GPU to repro this, let me pull to these GPU instance so that we could repro this |
Thanks @ManfeiBai, let me know if you need any help. |
this might be the GPU CI issue log: https://source.cloud.google.com/results/invocations/a2d48ef1-784f-43d8-b4e5-69caa341e906/targets/%2F%2Ftest%2Fcpp:test_tensor/log |
So you are suggesting the time out could be unrelated? and we should increase the test time out limit? |
yes, agree with the |
The issue is consistent. I have ran it twice. |
then I would agree with the |
Until we can debug the GPU CI for certain, we should probably not cherry-pick this into the r2.1 branch. That means we'll need to re-open your PR to fix the FSDP GPU unit tests, @ManfeiBai. Let me know if you think otherwise, @alanwaketan. |
Let me set the time out now. If it doesn't work, then let's re-open Manfei's PR. |
a24488e
to
55a0823
Compare
It seems we need a longer time limit. @wonjoolee95 @JackCaoG @ManfeiBai Please take a look to see if the current way is preferable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks all for reviewing! |
Co-authored-by: Jiewen Tan <[email protected]>
Summary:
This change bumps libTPU.so to 20230826 and update the corresponding openxla/xla pin.
Test Plan:
CI.