Use upstream XLA concurrency utilities #5799

will-cromar · 2023-11-14T19:54:10Z

@JackCaoG and I both have run into cases where our existing concurrency utilities add significant overhead (e.g. waiting >1ms for lock when completing a MultiWait task), which functionally limits the number of threads we can spawn. This PR replaces two custom implementations of common utilities (thread pool and latch/MultiWait) with more optimized upstream equivalents.

Use tsl::thread::ThreadPool. The underlying implementation more carefully reuses threads and handles NUMA affinity to reduce context switching costs.
- Simplified interface for starting threads. There used to be two thread pools, one for low-latency ops and one for IO ops. However, there was no difference in implementation between them, so I replaced it with one thread pool with the maximum number of threads. I expect the IO thread pool was more important when we had extremely high latency calls through XRT.
- A nice side benefit is that by reusing threads more often, our profiles become much more readable:

Before:

After:

Mostly replace MultiWait with absl::BlockingCounter. Completing a task requires acquiring a lock, which can cost >1ms in practice with enough threads. BlockingCounter instead uses a lockless atomic counter, significantly reducing the latency to decrement the remaining task count.
- MultiWait still exists upstream PyTorch if we need to use it for anything. I left usage of the upstream MultiWait alone.
- Remove cases of MultiWait where we launch one operation and do not wait for it

We already depend on absl and TSL through OpenXLA, so this PR adds no new dependencies.

Tested with SPMD llama inference to confirm no regression in performance:

Baseline:

Totally decoded 1007 tokens in 7.11816 seconds

With this PR:

Totally decoded 1007 tokens in 7.06291 seconds

I found these performance benefits while working on #5737, which spawns significantly more threads and potentially reduces the synchronous time spent in ExecuteReplicated. See my comment about performance. But, I don't expect this PR to provide significant benefits on its own. Benefits of this PR alone:

Clearer profile traces, by grouping operations in fewer threads and actually labeling which ones are PyTorch/XLA.
Deleting code

Future work:

Check if TSL's low_latency_hint makes a difference in performance. Operations through the thread pool are assumed to be low latency by default.
See if upstream TSL can remove the tf_ prefix from thread names.
Use different labeled thread pools for different purposes to make profiles more readable. Example of separating "framework" ops from "runtime" ops in a WIP PR:

alanwaketan

LGTM.

JackCaoG · 2023-11-15T00:59:50Z

test/cpp/test_replication.cpp

@@ -57,7 +57,7 @@ void TestSingleReplication(

  std::vector<std::vector<torch_xla::runtime::ComputationClient::DataPtr>>
      results(device_strings.size());
-  torch_xla::runtime::util::MultiWait mwait(device_strings.size());
+  absl::BlockingCounter mwait(device_strings.size());


let's rename it to bc or something, mwait would be confusing if the type is not MultiWait.

JackCaoG

lgtm, minor nits

* Use TSL threadpool * remove multiwait * fix test build * Move threadpool namespace * formatting * fix test build * Use BlockingCounter

* Distribute Literal->Tensor copies across thread pool * Update for #5799

* Distribute Literal->Tensor copies across thread pool * Update for pytorch#5799

* Use TSL threadpool * remove multiwait * fix test build * Move threadpool namespace * formatting * fix test build * Use BlockingCounter

* Distribute Literal->Tensor copies across thread pool * Update for pytorch#5799

* Use TSL threadpool * remove multiwait * fix test build * Move threadpool namespace * formatting * fix test build * Use BlockingCounter

* Distribute Literal->Tensor copies across thread pool * Update for #5799

* Use TSL threadpool * remove multiwait * fix test build * Move threadpool namespace * formatting * fix test build * Use BlockingCounter

* Distribute Literal->Tensor copies across thread pool * Update for #5799

will-cromar added 3 commits November 14, 2023 19:10

Use TSL threadpool

bf4ff6f

remove multiwait

59de529

fix test build

48e6333

will-cromar added the runtime label Nov 14, 2023

will-cromar changed the title ~~Use upstream concurrency utilities~~ Use upstream XLA concurrency utilities Nov 14, 2023

Move threadpool namespace

b688515

will-cromar requested review from alanwaketan and JackCaoG November 14, 2023 21:03

formatting

8198cb4

will-cromar marked this pull request as ready for review November 14, 2023 21:09

alanwaketan approved these changes Nov 14, 2023

View reviewed changes

fix test build

72c1361

JackCaoG reviewed Nov 15, 2023

View reviewed changes

JackCaoG approved these changes Nov 15, 2023

View reviewed changes

Use BlockingCounter

0712756

will-cromar merged commit 6e66130 into master Nov 15, 2023
18 checks passed

mbzomowski mentioned this pull request Nov 16, 2023

tpu ci module refactor mbzomowski-test-org/xla#7

Merged

jonb377 mentioned this pull request Nov 20, 2023

Distribute Literal->Tensor copies across thread pool #5825

Merged

jonb377 added a commit that referenced this pull request Nov 20, 2023

Update for #5799

8992668

zpcore pushed a commit that referenced this pull request Nov 21, 2023

Use upstream XLA concurrency utilities (#5799)

ab7358c

* Use TSL threadpool * remove multiwait * fix test build * Move threadpool namespace * formatting * fix test build * Use BlockingCounter

lsy323 pushed a commit to lsy323/xla that referenced this pull request Nov 28, 2023

Use upstream XLA concurrency utilities (pytorch#5799)

1ca697a

* Use TSL threadpool * remove multiwait * fix test build * Move threadpool namespace * formatting * fix test build * Use BlockingCounter

jonb377 added a commit that referenced this pull request Nov 29, 2023

Update for #5799

fa65980

jonb377 added a commit that referenced this pull request Nov 30, 2023

Update for #5799

c8f7315

jonb377 added a commit that referenced this pull request Dec 1, 2023

Distribute Literal->Tensor copies across thread pool (#5825)

ec54fd4

* Distribute Literal->Tensor copies across thread pool * Update for #5799

ManfeiBai pushed a commit to ManfeiBai/PyTorchXLA that referenced this pull request Dec 1, 2023

Distribute Literal->Tensor copies across thread pool (pytorch#5825)

8c47daf

* Distribute Literal->Tensor copies across thread pool * Update for pytorch#5799

ManfeiBai pushed a commit to ManfeiBai/PyTorchXLA that referenced this pull request Dec 1, 2023

Distribute Literal->Tensor copies across thread pool (pytorch#5825)

c6a4875

* Distribute Literal->Tensor copies across thread pool * Update for pytorch#5799

chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023

Use upstream XLA concurrency utilities (pytorch#5799)

5fae93d

* Use TSL threadpool * remove multiwait * fix test build * Move threadpool namespace * formatting * fix test build * Use BlockingCounter

chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023

Distribute Literal->Tensor copies across thread pool (pytorch#5825)

990fe5a

* Distribute Literal->Tensor copies across thread pool * Update for pytorch#5799

golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024

Use upstream XLA concurrency utilities (#5799)

d0d8ba3

* Use TSL threadpool * remove multiwait * fix test build * Move threadpool namespace * formatting * fix test build * Use BlockingCounter

golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024

Distribute Literal->Tensor copies across thread pool (#5825)

14911b4

* Distribute Literal->Tensor copies across thread pool * Update for #5799

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Use upstream XLA concurrency utilities (#5799)

b289a72

* Use TSL threadpool * remove multiwait * fix test build * Move threadpool namespace * formatting * fix test build * Use BlockingCounter

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Distribute Literal->Tensor copies across thread pool (#5825)

17eddce

* Distribute Literal->Tensor copies across thread pool * Update for #5799

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use upstream XLA concurrency utilities #5799

Use upstream XLA concurrency utilities #5799

will-cromar commented Nov 14, 2023 •

edited

Loading

alanwaketan left a comment

JackCaoG Nov 15, 2023

JackCaoG left a comment

Use upstream XLA concurrency utilities #5799

Use upstream XLA concurrency utilities #5799

Conversation

will-cromar commented Nov 14, 2023 • edited Loading

alanwaketan left a comment

Choose a reason for hiding this comment

JackCaoG Nov 15, 2023

Choose a reason for hiding this comment

JackCaoG left a comment

Choose a reason for hiding this comment

will-cromar commented Nov 14, 2023 •

edited

Loading