Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pin Update] Version 20230826 #5527

Merged
merged 7 commits into from
Sep 11, 2023
Merged

[Pin Update] Version 20230826 #5527

merged 7 commits into from
Sep 11, 2023

Conversation

alanwaketan
Copy link
Collaborator

Summary:
This change bumps libTPU.so to 20230826 and update the corresponding openxla/xla pin.

Test Plan:
CI.

@alanwaketan alanwaketan requested review from JackCaoG and lsy323 August 31, 2023 02:06
@alanwaketan alanwaketan self-assigned this Aug 31, 2023
alanwaketan added a commit to pytorch/pytorch that referenced this pull request Aug 31, 2023
This is just to test the xla pin update on pytorch/xla#5527
@alanwaketan
Copy link
Collaborator Author

alanwaketan commented Aug 31, 2023

For ResNet MP on V4-8:
After the change:

| Training Device=xla:0/1 Epoch=1 Step=2340 Loss=0.00135 Rate=1776.80 GlobalRate=993.25 Time=02:13:37

Before the change:

| Training Device=xla:0/0 Epoch=1 Step=260 Loss=0.01169 Rate=1773.55 GlobalRate=221.43 Time=02:31:15

For ResNet SPMD on V4-8:
After the change:

| Training Device=xla:0/0 Epoch=1 Step=4680 Loss=0.00135 Rate=1882.09 GlobalRate=1552.79 Time=02:22:06

Before the change:

| Training Device=xla:0/0 Epoch=1 Step=1020 Loss=0.00385 Rate=1881.47 GlobalRate=951.80 Time=02:27:09

No performance regression. Will do a LLaMA2 training test later.

Copy link
Collaborator

@lsy323 lsy323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alanwaketan
Copy link
Collaborator Author

Looks like one of the GPU test has failed. There could be a regression on GPU. Will need to setup a GPU env to further debug.

@JackCaoG
Copy link
Collaborator

JackCaoG commented Sep 5, 2023

//test/cpp:test_tensor timed out, wondering if it is a real error or some flakeyness. I bump up the test timeout for GPU to 240 minutes in 76495f0 so I think it is not the workflow itself timedout but that specified test has some timeout threashold.

@alanwaketan
Copy link
Collaborator Author

//test/cpp:test_tensor timed out, wondering if it is a real error or some flakeyness. I bump up the test timeout for GPU to 240 minutes in 76495f0 so I think it is not the workflow itself timedout but that specified test has some timeout threashold.

Yea, I need to setup a GPU env to repro.

@wonjoolee95
Copy link
Collaborator

@ManfeiBai, wondering if you have any GPU that we can use to repro this?

@ManfeiBai
Copy link
Collaborator

ManfeiBai commented Sep 6, 2023

@ManfeiBai, wondering if you have any GPU that we can use to repro this?

yes, we have some GPU to repro this, let me pull to these GPU instance so that we could repro this

@wonjoolee95
Copy link
Collaborator

Thanks @ManfeiBai, let me know if you need any help.

@ManfeiBai
Copy link
Collaborator

@alanwaketan
Copy link
Collaborator Author

76495f0

this might be the GPU CI issue log: https://source.cloud.google.com/results/invocations/a2d48ef1-784f-43d8-b4e5-69caa341e906/targets/%2F%2Ftest%2Fcpp:test_tensor/log

So you are suggesting the time out could be unrelated? and we should increase the test time out limit?

@ManfeiBai
Copy link
Collaborator

76495f0

yes, agree with the increase the test time out limit plan, and let's also trigger GPU test again to confirm issue consistantly too?

@alanwaketan
Copy link
Collaborator Author

76495f0

yes, agree with the increase the test time out limit plan, and let's also trigger GPU test again to confirm issue consistantly too?

The issue is consistent. I have ran it twice.

@ManfeiBai
Copy link
Collaborator

76495f0

yes, agree with the increase the test time out limit plan, and let's also trigger GPU test again to confirm issue consistantly too?

The issue is consistent. I have ran it twice.

then I would agree with the increase the test time out limit plan, it may help us for more info

@wonjoolee95
Copy link
Collaborator

Until we can debug the GPU CI for certain, we should probably not cherry-pick this into the r2.1 branch. That means we'll need to re-open your PR to fix the FSDP GPU unit tests, @ManfeiBai. Let me know if you think otherwise, @alanwaketan.

@alanwaketan
Copy link
Collaborator Author

Until we can debug the GPU CI for certain, we should probably not cherry-pick this into the r2.1 branch. That means we'll need to re-open your PR to fix the FSDP GPU unit tests, @ManfeiBai. Let me know if you think otherwise, @alanwaketan.

Let me set the time out now. If it doesn't work, then let's re-open Manfei's PR.

@alanwaketan
Copy link
Collaborator Author

It seems we need a longer time limit. @wonjoolee95 @JackCaoG @ManfeiBai Please take a look to see if the current way is preferable?

Copy link
Collaborator

@ManfeiBai ManfeiBai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alanwaketan
Copy link
Collaborator Author

Thanks all for reviewing!

@alanwaketan alanwaketan merged commit 9732daf into master Sep 11, 2023
ManfeiBai pushed a commit that referenced this pull request Sep 12, 2023
ManfeiBai added a commit that referenced this pull request Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants