Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API to assemble CPU shards to a sharded tensor #5681

Merged
merged 6 commits into from
Oct 9, 2023

Conversation

jonb377
Copy link
Collaborator

@jonb377 jonb377 commented Oct 5, 2023

This PR reintroduces #5630, which was reverted in #5680 due to failing CI on master.

The following patch shows the difference between this and the original PR:

diff --git a/torch_xla/csrc/init_python_bindings.cpp b/torch_xla/csrc/init_python_bindings.cpp
index fe18f9508..421066ba7 100644
--- a/torch_xla/csrc/init_python_bindings.cpp
+++ b/torch_xla/csrc/init_python_bindings.cpp
@@ -1720,8 +1720,8 @@ void InitXlaModuleBindings(py::module m) {
               << " vs " << expected_shard_shape;
         }
 
-        auto data_handle = WrapXlaData(ShardingUtil::CreateShardedData(
-            shards, local_devices, sharding_spec));
+        auto data_handle = ShardingUtil::CreateShardedData(
+            shards, local_devices, sharding_spec);
         XLATensorPtr xla_tensor = XLATensor::Create(std::move(data_handle));
         xla_tensor->SetShardingSpec(*sharding_spec);
         auto tensor = bridge::AtenFromXlaTensor(std::move(xla_tensor));

@jonb377 jonb377 requested a review from alanwaketan October 5, 2023 22:02
Copy link
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jonb377
Copy link
Collaborator Author

jonb377 commented Oct 6, 2023

Kokoro failure is due to a dependency issue:

ERROR: Could not find a version that satisfies the requirement tf-nightly (from versions: none)
ERROR: No matching distribution found for tf-nightly

I'll merge after TPU CI. Thanks Jiewen!

@jonb377
Copy link
Collaborator Author

jonb377 commented Oct 9, 2023

Looking into the TPU CI failure, that's new since the rebase. Passes locally on v4, it may be that the test breaks with 8 devices.

@jonb377
Copy link
Collaborator Author

jonb377 commented Oct 9, 2023

Surprisingly the test actually failed on the original PR, but TPU CI still passed: https://github.com/pytorch/xla/runs/17442958115

@jonb377 jonb377 merged commit d9a9049 into master Oct 9, 2023
@jonb377 jonb377 deleted the jonbolin-assemble-shards branch October 9, 2023 21:15
qihqi pushed a commit that referenced this pull request Oct 10, 2023
* Add API to assemble CPU shards to a sharded tensor

* Handle replicated sharding

* Move validations into get_op_sharding

* Improve tests and error handling

* Don't WrapXlaData

* Fix test for v3
zpcore pushed a commit that referenced this pull request Oct 19, 2023
* Add API to assemble CPU shards to a sharded tensor

* Handle replicated sharding

* Move validations into get_op_sharding

* Improve tests and error handling

* Don't WrapXlaData

* Fix test for v3
ghpvnist pushed a commit to ghpvnist/xla that referenced this pull request Oct 31, 2023
* Add API to assemble CPU shards to a sharded tensor

* Handle replicated sharding

* Move validations into get_op_sharding

* Improve tests and error handling

* Don't WrapXlaData

* Fix test for v3
mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023
* Add API to assemble CPU shards to a sharded tensor

* Handle replicated sharding

* Move validations into get_op_sharding

* Improve tests and error handling

* Don't WrapXlaData

* Fix test for v3
chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023
* Add API to assemble CPU shards to a sharded tensor

* Handle replicated sharding

* Move validations into get_op_sharding

* Improve tests and error handling

* Don't WrapXlaData

* Fix test for v3
golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024
* Add API to assemble CPU shards to a sharded tensor

* Handle replicated sharding

* Move validations into get_op_sharding

* Improve tests and error handling

* Don't WrapXlaData

* Fix test for v3
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
* Add API to assemble CPU shards to a sharded tensor

* Handle replicated sharding

* Move validations into get_op_sharding

* Improve tests and error handling

* Don't WrapXlaData

* Fix test for v3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants