Distribute Literal->Tensor copies across thread pool #5825

jonb377 · 2023-11-20T09:13:18Z

After an xla::Literal has been created in TransferFromServer, it must be copied into an at::Tensor. This incurs a significant amount of overhead (up to 3x the transfer overhead after #5824). This is because the copies still occur synchronously on a single thread.

This change dispatches the copies to a thread pool to speed up the process. When checkpointing a 2B parameter model, the overhead decreases from ~5000ms to ~611ms.*

*Note: These benchmarks were prior to #5799 and used the old threading library.

will-cromar · 2023-11-20T18:06:14Z

torch_xla/csrc/tensor_util.cpp

@@ -796,13 +796,18 @@ std::vector<xla::Literal> ReleaseGilAndTransferData(

 std::vector<at::Tensor> XlaDataToTensors(
    absl::Span<const torch::lazy::BackendDataPtr> xla_data,
-    at::ScalarType dest_element_type) {
+    absl::Span<const at::ScalarType> dest_element_type) {


What's the reason for this change?

it seems like we never really call XlaDataToTensors with different dest_element_type. @jonb377 are you introducing a new use case? if not we can keep it as a singleton?

The reason to make the dest_elem_types a vector is actually the next change - I'm batching the local shard transfers for many tensors into a single XlaDataToTensors call. I probably should have kept this refactor with the upcoming change... But it makes that PR slightly smaller.

will-cromar · 2023-11-20T18:07:07Z

torch_xla/csrc/tensor_util.cpp

+  std::vector<at::Tensor> tensors(literals.size());
+  absl::BlockingCounter counter(literals.size());
+  for (size_t i = 0; i < tensors.size(); ++i) {
+    auto copy_fn = [&, i]() {


Can you capture the variables you need explicitly?

Actually, you need just about every variable in this scope since it's pretty narrow. I take that back.

will-cromar · 2023-11-20T18:08:21Z

torch_xla/csrc/tensor_util.cpp

@@ -796,13 +796,18 @@ std::vector<xla::Literal> ReleaseGilAndTransferData(

 std::vector<at::Tensor> XlaDataToTensors(
    absl::Span<const torch::lazy::BackendDataPtr> xla_data,
-    at::ScalarType dest_element_type) {
+    absl::Span<const at::ScalarType> dest_element_type) {
  std::vector<xla::Literal> literals = ReleaseGilAndTransferData(xla_data);


I wonder if we should just be returning Tensors here

I'm interested in making TransferFromServer return at::Tensor and cut out the xla::Literal middleman, but that's in the idea phase. Opted to keep this change smaller and just distribute the copy work over more cores.

jonb377 · 2023-11-20T21:10:42Z

Thanks for the reviews @will-cromar and @JackCaoG!

yeounoh

LGTM

* Distribute Literal->Tensor copies across thread pool * Update for pytorch#5799

* Distribute Literal->Tensor copies across thread pool * Update for #5799

jonb377 requested review from will-cromar and JackCaoG November 20, 2023 09:13

jonb377 mentioned this pull request Nov 20, 2023

Vectorize local shard retrieval #5826

Merged

jonb377 self-assigned this Nov 20, 2023

jonb377 force-pushed the jonbolin/copy-pool branch from b9d8a7c to 8992668 Compare November 20, 2023 12:06

will-cromar approved these changes Nov 20, 2023

View reviewed changes

JackCaoG approved these changes Nov 20, 2023

View reviewed changes

yeounoh approved these changes Nov 23, 2023

View reviewed changes

jonb377 force-pushed the jonbolin/copy-pool branch from 8992668 to fa65980 Compare November 29, 2023 21:47

jonb377 added 2 commits November 30, 2023 22:37

Distribute Literal->Tensor copies across thread pool

55db8a2

Update for #5799

c8f7315

jonb377 force-pushed the jonbolin/copy-pool branch from fa65980 to c8f7315 Compare November 30, 2023 23:21

jonb377 merged commit ec54fd4 into master Dec 1, 2023
19 checks passed

jonb377 deleted the jonbolin/copy-pool branch December 1, 2023 01:55

ManfeiBai pushed a commit to ManfeiBai/PyTorchXLA that referenced this pull request Dec 1, 2023

Distribute Literal->Tensor copies across thread pool (pytorch#5825)

8c47daf

* Distribute Literal->Tensor copies across thread pool * Update for pytorch#5799

ManfeiBai pushed a commit to ManfeiBai/PyTorchXLA that referenced this pull request Dec 1, 2023

Distribute Literal->Tensor copies across thread pool (pytorch#5825)

c6a4875

* Distribute Literal->Tensor copies across thread pool * Update for pytorch#5799

chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023

Distribute Literal->Tensor copies across thread pool (pytorch#5825)

990fe5a

* Distribute Literal->Tensor copies across thread pool * Update for pytorch#5799

will-cromar added the runtime label Jan 11, 2024

golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024

Distribute Literal->Tensor copies across thread pool (#5825)

14911b4

* Distribute Literal->Tensor copies across thread pool * Update for #5799

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Distribute Literal->Tensor copies across thread pool (#5825)

17eddce

* Distribute Literal->Tensor copies across thread pool * Update for #5799

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distribute Literal->Tensor copies across thread pool #5825

Distribute Literal->Tensor copies across thread pool #5825

jonb377 commented Nov 20, 2023 •

edited

Loading

will-cromar Nov 20, 2023

JackCaoG Nov 20, 2023

jonb377 Nov 20, 2023

will-cromar Nov 20, 2023

will-cromar Nov 20, 2023

will-cromar Nov 20, 2023

jonb377 Nov 20, 2023

jonb377 commented Nov 20, 2023

yeounoh left a comment

Distribute Literal->Tensor copies across thread pool #5825

Distribute Literal->Tensor copies across thread pool #5825

Conversation

jonb377 commented Nov 20, 2023 • edited Loading

will-cromar Nov 20, 2023

Choose a reason for hiding this comment

JackCaoG Nov 20, 2023

Choose a reason for hiding this comment

jonb377 Nov 20, 2023

Choose a reason for hiding this comment

will-cromar Nov 20, 2023

Choose a reason for hiding this comment

will-cromar Nov 20, 2023

Choose a reason for hiding this comment

will-cromar Nov 20, 2023

Choose a reason for hiding this comment

jonb377 Nov 20, 2023

Choose a reason for hiding this comment

jonb377 commented Nov 20, 2023

yeounoh left a comment

Choose a reason for hiding this comment

jonb377 commented Nov 20, 2023 •

edited

Loading