[Pin Update] Version 20230826 #5527

alanwaketan · 2023-08-31T02:06:34Z

Summary:
This change bumps libTPU.so to 20230826 and update the corresponding openxla/xla pin.

Test Plan:
CI.

This is just to test the xla pin update on pytorch/xla#5527

alanwaketan · 2023-08-31T02:22:58Z

For ResNet MP on V4-8:
After the change:

| Training Device=xla:0/1 Epoch=1 Step=2340 Loss=0.00135 Rate=1776.80 GlobalRate=993.25 Time=02:13:37

Before the change:

| Training Device=xla:0/0 Epoch=1 Step=260 Loss=0.01169 Rate=1773.55 GlobalRate=221.43 Time=02:31:15

For ResNet SPMD on V4-8:
After the change:

| Training Device=xla:0/0 Epoch=1 Step=4680 Loss=0.00135 Rate=1882.09 GlobalRate=1552.79 Time=02:22:06

Before the change:

| Training Device=xla:0/0 Epoch=1 Step=1020 Loss=0.00385 Rate=1881.47 GlobalRate=951.80 Time=02:27:09

No performance regression. Will do a LLaMA2 training test later.

lsy323

LGTM

alanwaketan · 2023-08-31T21:22:30Z

Looks like one of the GPU test has failed. There could be a regression on GPU. Will need to setup a GPU env to further debug.

JackCaoG · 2023-09-05T20:17:54Z

//test/cpp:test_tensor timed out, wondering if it is a real error or some flakeyness. I bump up the test timeout for GPU to 240 minutes in 76495f0 so I think it is not the workflow itself timedout but that specified test has some timeout threashold.

alanwaketan · 2023-09-05T20:31:55Z

//test/cpp:test_tensor timed out, wondering if it is a real error or some flakeyness. I bump up the test timeout for GPU to 240 minutes in 76495f0 so I think it is not the workflow itself timedout but that specified test has some timeout threashold.

Yea, I need to setup a GPU env to repro.

wonjoolee95 · 2023-09-05T20:44:52Z

@ManfeiBai, wondering if you have any GPU that we can use to repro this?

ManfeiBai · 2023-09-06T17:22:09Z

@ManfeiBai, wondering if you have any GPU that we can use to repro this?

yes, we have some GPU to repro this, let me pull to these GPU instance so that we could repro this

wonjoolee95 · 2023-09-06T20:06:09Z

Thanks @ManfeiBai, let me know if you need any help.

ManfeiBai · 2023-09-07T23:54:28Z

this might be the GPU CI issue log: https://source.cloud.google.com/results/invocations/a2d48ef1-784f-43d8-b4e5-69caa341e906/targets/%2F%2Ftest%2Fcpp:test_tensor/log

alanwaketan · 2023-09-07T23:56:32Z

76495f0

this might be the GPU CI issue log: https://source.cloud.google.com/results/invocations/a2d48ef1-784f-43d8-b4e5-69caa341e906/targets/%2F%2Ftest%2Fcpp:test_tensor/log

So you are suggesting the time out could be unrelated? and we should increase the test time out limit?

ManfeiBai · 2023-09-07T23:58:43Z

76495f0

yes, agree with the increase the test time out limit plan, and let's also trigger GPU test again to confirm issue consistantly too?

alanwaketan · 2023-09-08T00:17:09Z

76495f0

yes, agree with the increase the test time out limit plan, and let's also trigger GPU test again to confirm issue consistantly too?

The issue is consistent. I have ran it twice.

ManfeiBai · 2023-09-08T00:29:44Z

76495f0

yes, agree with the increase the test time out limit plan, and let's also trigger GPU test again to confirm issue consistantly too?

The issue is consistent. I have ran it twice.

then I would agree with the increase the test time out limit plan, it may help us for more info

wonjoolee95 · 2023-09-08T20:29:49Z

Until we can debug the GPU CI for certain, we should probably not cherry-pick this into the r2.1 branch. That means we'll need to re-open your PR to fix the FSDP GPU unit tests, @ManfeiBai. Let me know if you think otherwise, @alanwaketan.

alanwaketan · 2023-09-08T20:40:54Z

Until we can debug the GPU CI for certain, we should probably not cherry-pick this into the r2.1 branch. That means we'll need to re-open your PR to fix the FSDP GPU unit tests, @ManfeiBai. Let me know if you think otherwise, @alanwaketan.

Let me set the time out now. If it doesn't work, then let's re-open Manfei's PR.

alanwaketan · 2023-09-09T07:37:19Z

It seems we need a longer time limit. @wonjoolee95 @JackCaoG @ManfeiBai Please take a look to see if the current way is preferable?

ManfeiBai

LGTM

alanwaketan · 2023-09-11T17:19:44Z

Thanks all for reviewing!

Co-authored-by: Jiewen Tan <[email protected]>

alanwaketan added 4 commits August 30, 2023 23:29

Initial commit

c0e88d8

tmp

f169beb

pick a better commit

3a329b8

Revert local_repository

c5bf94e

alanwaketan requested review from JackCaoG and lsy323 August 31, 2023 02:06

alanwaketan self-assigned this Aug 31, 2023

alanwaketan added a commit to pytorch/pytorch that referenced this pull request Aug 31, 2023

[Don't Merge] Test XLA Pin Update

1a59dbf

This is just to test the xla pin update on pytorch/xla#5527

alanwaketan mentioned this pull request Aug 31, 2023

[Don't Merge] Test XLA Pin Update pytorch/pytorch#108318

Closed

Update the commit

d3ec960

qihqi approved these changes Aug 31, 2023

View reviewed changes

lsy323 approved these changes Aug 31, 2023

View reviewed changes

alanwaketan added 2 commits September 8, 2023 20:44

Increase the cpp test timeout to 1000s

7e65d3d

Increase the cpp test timeout to 1000s

55a0823

alanwaketan force-pushed the alanwaketan/pin branch from a24488e to 55a0823 Compare September 8, 2023 20:45

ManfeiBai approved these changes Sep 9, 2023

View reviewed changes

ManfeiBai requested a review from wonjoolee95 September 11, 2023 16:59

JackCaoG approved these changes Sep 11, 2023

View reviewed changes

wonjoolee95 approved these changes Sep 11, 2023

View reviewed changes

alanwaketan merged commit 9732daf into master Sep 11, 2023

ManfeiBai pushed a commit that referenced this pull request Sep 12, 2023

[Pin Update] Version 20230826(#5527)

68ef245

ManfeiBai mentioned this pull request Sep 12, 2023

[Pin Update] Version 20230826(#5527) #5555

Merged

ManfeiBai added a commit that referenced this pull request Sep 14, 2023

[Pin Update] Version 20230826(#5527) (#5555)

2c07df9

Co-authored-by: Jiewen Tan <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pin Update] Version 20230826 #5527

[Pin Update] Version 20230826 #5527

alanwaketan commented Aug 31, 2023

alanwaketan commented Aug 31, 2023 •

edited

Loading

lsy323 left a comment

alanwaketan commented Aug 31, 2023

JackCaoG commented Sep 5, 2023

alanwaketan commented Sep 5, 2023

wonjoolee95 commented Sep 5, 2023

ManfeiBai commented Sep 6, 2023 •

edited

Loading

wonjoolee95 commented Sep 6, 2023

ManfeiBai commented Sep 7, 2023

alanwaketan commented Sep 7, 2023

ManfeiBai commented Sep 7, 2023

alanwaketan commented Sep 8, 2023

ManfeiBai commented Sep 8, 2023

wonjoolee95 commented Sep 8, 2023

alanwaketan commented Sep 8, 2023

alanwaketan commented Sep 9, 2023

ManfeiBai left a comment

alanwaketan commented Sep 11, 2023

[Pin Update] Version 20230826 #5527

[Pin Update] Version 20230826 #5527

Conversation

alanwaketan commented Aug 31, 2023

alanwaketan commented Aug 31, 2023 • edited Loading

lsy323 left a comment

Choose a reason for hiding this comment

alanwaketan commented Aug 31, 2023

JackCaoG commented Sep 5, 2023

alanwaketan commented Sep 5, 2023

wonjoolee95 commented Sep 5, 2023

ManfeiBai commented Sep 6, 2023 • edited Loading

wonjoolee95 commented Sep 6, 2023

ManfeiBai commented Sep 7, 2023

alanwaketan commented Sep 7, 2023

ManfeiBai commented Sep 7, 2023

alanwaketan commented Sep 8, 2023

ManfeiBai commented Sep 8, 2023

wonjoolee95 commented Sep 8, 2023

alanwaketan commented Sep 8, 2023

alanwaketan commented Sep 9, 2023

ManfeiBai left a comment

Choose a reason for hiding this comment

alanwaketan commented Sep 11, 2023

alanwaketan commented Aug 31, 2023 •

edited

Loading

ManfeiBai commented Sep 6, 2023 •

edited

Loading