Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open XLA pin update #5675

Merged
merged 7 commits into from
Oct 11, 2023
Merged

Open XLA pin update #5675

merged 7 commits into from
Oct 11, 2023

Conversation

qihqi
Copy link
Collaborator

@qihqi qihqi commented Oct 4, 2023

No description provided.

@@ -1,19 +0,0 @@
upstream CI will fail without this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why we were able to remove this patch? Is it because we updated the compiler in the CI?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to kick off upstream CI build targetting this branch and see whether CI will pass

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah turns out I do still need those patches... otherwise the training job hangs.

@@ -1,14 +0,0 @@
diff --git a/xla/service/gpu/gpu_executable.cc b/xla/service/gpu/gpu_executable.cc
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above

WORKSPACE Outdated
],
strip_prefix = "xla-97a5f819faf9ff793b7ba68ff1f31f74f9459c18",
strip_prefix = "xla-7a19856d74569fd1f765cd03bdee84e3b1fdc579",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update the libtpu dependency in setup.py to the same date as this commit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@qihqi
Copy link
Collaborator Author

qihqi commented Oct 5, 2023

tested on v4-8:

with command

LD_LIBRARY_PATH=/home/hanq/miniconda3/envs/py310/lib python3 test/test_train_mp_imagenet.py --model=resnet50          --fake_data --num_epochs=10 --log_steps=300          --profile   --use_optimized_kwargs=tpuv4  --drop_last

result:

Old:
| Training Device=xla:0/3 Epoch=1 Step=1800 Loss=0.00135 Rate=1833.71 GlobalRate=918.89 Time=17:20:14
| Training Device=xla:0/1 Epoch=1 Step=2100 Loss=0.00135 Rate=1843.82 GlobalRate=986.79 Time=17:20:35
| Training Device=xla:0/3 Epoch=1 Step=2100 Loss=0.00135 Rate=1843.82 GlobalRate=990.06 Time=17:20:35
| Training Device=xla:0/0 Epoch=1 Step=2100 Loss=0.00135 Rate=1843.81 GlobalRate=982.20 Time=17:20:35
| Training Device=xla:0/2 Epoch=1 Step=2100 Loss=0.00135 Rate=1843.79 GlobalRate=989.61 Time=17:20:35

===
New:
| Training Device=xla:0/3 Epoch=1 Step=1500 Loss=0.00138 Rate=1803.73 GlobalRate=822.80 Time=18:09:52
| Training Device=xla:0/2 Epoch=1 Step=1500 Loss=0.00138 Rate=1803.72 GlobalRate=821.27 Time=18:09:52
| Training Device=xla:0/1 Epoch=1 Step=1800 Loss=0.00135 Rate=1828.62 GlobalRate=911.50 Time=18:10:12
| Training Device=xla:0/3 Epoch=1 Step=1800 Loss=0.00135 Rate=1828.62 GlobalRate=906.47 Time=18:10:12
| Training Device=xla:0/0 Epoch=1 Step=1800 Loss=0.00135 Rate=1828.62 GlobalRate=910.19 Time=18:10:12
| Training Device=xla:0/2 Epoch=1 Step=1800 Loss=0.00135 Rate=1828.63 GlobalRate=904.92 Time=18:10:12
| Training Device=xla:0/3 Epoch=1 Step=2100 Loss=0.00135 Rate=1837.96 GlobalRate=977.43 Time=18:10:33
| Training Device=xla:0/0 Epoch=1 Step=2100 Loss=0.00135 Rate=1837.97 GlobalRate=981.14 Time=18:10:33
| Training Device=xla:0/2 Epoch=1 Step=2100 Loss=0.00135 Rate=1837.97 GlobalRate=975.89 Time=18:10:33
| Training Device=xla:0/1 Epoch=1 Step=2100 Loss=0.00135 Rate=1837.96 GlobalRate=982.45 Time=18:10:33

"@tsl//tsl/platform:casts",
"@tsl//tsl/platform:errors",
- ] + if_cuda([
+ ] + if_cuda_or_rocm([
Copy link
Collaborator

@ManfeiBai ManfeiBai Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

this patch looks like for openxla/xla@9938bdb, so curious about the reason to skip the modify of load("//xla/stream_executor:build_defs.bzl", "if_cuda_or_rocm", "if_gpu_is_configured")?

since GPU CI failed with the same issue: RuntimeError: torch_xla/csrc/device.cpp:72 : Invalid device specification: CUDA:0, are they related too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason.

I started importing on Oct 3 and this change is Oct 4.

@qihqi qihqi force-pushed the hanq/pin_update branch 5 times, most recently from 6c59c2c to 3f57cd1 Compare October 6, 2023 02:46
@alanwaketan alanwaketan self-requested a review October 9, 2023 17:52
@qihqi qihqi force-pushed the hanq/pin_update branch 2 times, most recently from b97aa10 to 2dc72ab Compare October 10, 2023 20:24
],
strip_prefix = "xla-97a5f819faf9ff793b7ba68ff1f31f74f9459c18",
strip_prefix = "xla-51b59cfb1999c6f1b3ec59851675044b2c502aae",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving the head to this commit!

setup.py Outdated
@@ -72,7 +72,7 @@

base_dir = os.path.dirname(os.path.abspath(__file__))

_libtpu_version = '0.1.dev20230825'
_libtpu_version = '0.1.dev20231009'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this should be 0.1.dev20231010 in order to include the open xla commit you specified.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let me enable TPU CI and wait until it finishes.

@qihqi qihqi merged commit 418c751 into master Oct 11, 2023
18 checks passed
zpcore pushed a commit that referenced this pull request Oct 19, 2023
Open XLA pin update - updated to 20231010
ghpvnist pushed a commit to ghpvnist/xla that referenced this pull request Oct 31, 2023
Open XLA pin update - updated to 20231010
mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023
Open XLA pin update - updated to 20231010
chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023
Open XLA pin update - updated to 20231010
golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024
Open XLA pin update - updated to 20231010
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
Open XLA pin update - updated to 20231010
@qihqi qihqi deleted the hanq/pin_update branch April 29, 2024 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants