Support PJRT C API `create_options` #6289

will-cromar · 2024-01-10T22:51:30Z

Plumb settings from DevicePlugin implementation to GetPjRtCApiClient. Pybind does most of the hard work here, luckily.
- Re-implement GPU client create options in Python. TODO: Update with changes from Fix global_device_count(), local_device_count() for single process on CUDA #6022
Add option to create the XlaCoordinator before the client if required for client initialization.
- Based on torchrun variables. Note: it may be possible to directly use the torch.distributed store here directly instead.

See #6242 for broader context

will-cromar · 2024-01-12T22:59:48Z

Our resnet example actually works on 4 v100s with the plugin now!

will-cromar · 2024-01-12T23:12:11Z

torch_xla._XLAC._register_pjrt_plugin is starting to get bloated. If we have to add more to it, I'll look into defining a base class in C++ and allowing overrides in Python. Then, we can just pass the entire DevicePlugin to C++.

JackCaoG · 2024-01-13T00:06:25Z

plugins/cuda/torch_xla_cuda_plugin/__init__.py

+      "allocator": "cuda_async" if xu.getenv_as("PJRT_ALLOCATOR_CUDA_ASYNC", bool, False) else "default",
+      "memory_fraction": xu.getenv_as("PJRT_ALLOCATOR_FRACTION", float, .75),
+      "preallocate": xu.getenv_as("PJRT_ALLOCATOR_PREALLOCATE", bool, True),


are these env var new?

These all exist in env_vars.cc/h already

JackCaoG

do you want to add any test for GPU plugin?

will-cromar · 2024-01-16T17:47:51Z

do you want to add any test for GPU plugin?

I don't actually have a good build process set up yet for this plugin. I'll work on that this week.

Unless we find an issue, I'm inclined to just move the whole CI over to use the plugin once the build is automatable. WDYT?

vanbasten23 · 2024-01-16T18:31:56Z

plugins/cuda/torch_xla_cuda_plugin/__init__.py

+    return {
+      "platform_name": "gpu",
+      # TODO(wcromar): make this configurable
+      "allocator": "cuda_async" if xu.getenv_as("PJRT_ALLOCATOR_CUDA_ASYNC", bool, False) else "default",


for these 3 settings, is it possible not to use the hardcoded default settings: False, .75, True, such as

xla/torch_xla/csrc/runtime/pjrt_registry.cc

Lines 22 to 27 in 4bf8d44

auto allocator_config = xla::GpuAllocatorConfig{};

if (sys_util::GetEnvString(env::kEnvPjrtAllocatorCudaAsync, "").empty() &&

sys_util::GetEnvString(env::kEnvPjrtAllocatorPreallocate, "").empty() &&

sys_util::GetEnvString(env::kEnvPjrtAllocatorFraction, "").empty()) {

return allocator_config;

}

. If xla gpu changes the default value, we can piggyback their change and don't need to change our own.

Good catch. Removed some options entirely when the environment variable is not set.

torch_xla/csrc/runtime/pjrt_registry.cc

torch_xla/experimental/plugins.py

JackCaoG · 2024-01-16T19:53:09Z

@will-cromar How will the UX be(for both building from source and installing from whl)? I want to make sure we don't require user to do additional steps to use the plug in after we make it default.

will-cromar · 2024-01-16T20:02:56Z

@will-cromar How will the UX be(for both building from source and installing from whl)? I want to make sure we don't require user to do additional steps to use the plug in after we make it default.

For installing, we can add an extra requirement like we have for libtpu. ie pip install torch_xla[cuda] -f http://index/of/cuda/plugins. This is what JAX does

For building, you'll may have to build both the plugin and torch_xla. Although, in 99% of cases, you can just install the pre-built plugin like we do for TPU today

jonb377

Note: it may be possible to directly use the torch.distributed store here directly instead.

We briefly chatted about this offline, but perhaps we should push XlaCoordinator's distributed KV store up into the Python layer to replace TCPStore in our XLA backend implementation instead of dropping XlaCoordinator. Autocheckpointing will still require XlaCoordinator even if we move to torch.distributed's kv store.

Just raising this for discussion. Overall this looks great, thanks Will!

plugins/cuda/torch_xla_cuda_plugin/__init__.py

torch_xla/csrc/runtime/pjrt_registry.cc

jonb377 · 2024-01-17T01:04:49Z

torch_xla/csrc/runtime/pjrt_registry.cc

+struct PluginEntry {
+  std::string library_path;
+  absl::flat_hash_map<std::string, xla::PjRtValueType> create_options;
+  bool init_coordinator;


I wonder if we should drop init_coordinator and instead always initialize the coordinator when the distributed env vars are set, even on TPU where it's not strictly necessary. As long as torchrun launches the training in a distributed context, the env vars should be set, which I believe covers all GPU use cases since we plan to use torchrun for GPU SPMD (cc @vanbasten23).

On TPU it's not currently required, but if that changes we can always detect the environment from the GCE metadata and set the env vars automatically for the user in a distributed context, since we don't require torchrun for multicontroller execution.

I wonder if we should drop init_coordinator and instead always initialize the coordinator when the distributed env vars are set, even on TPU where it's not strictly necessary.

In this case, I think we still want to keep requires_xla_coordinator option. We would just throw an error immediately if we don't have enough information to start the coordinator.

My other idea initially was that we could ask the plugin for the master IP, local rank, global rank, and world size, perhaps just asking torch.distributed for those values in the default implementation.

I see - that makes sense. Just for context why I brought this up, JAX recently started requiring the coordinator to be initialized before the backend can be used (even on TPUs), but I'm not sure on the reason.

Keeping init_coordinator sounds fine to me. Thanks Will!

vanbasten23 · 2024-01-17T03:30:34Z

When debugging the pjrt c api for gpu, right now I can change pytorch/xla/WORKSPACE to use a local open XLA, add print to open XLA, compile everything, then I can see my print statements. I wonder if/how I can do that with the plugin.

will-cromar · 2024-01-17T03:58:10Z

When debugging the pjrt c api for gpu, right now I can change pytorch/xla/WORKSPACE to use a local open XLA, add print to open XLA, compile everything, then I can see my print statements. I wonder if/how I can do that with the plugin.

The GPU plugin binary still uses the same bazel workspace, so any changes you make there will apply to both the plugin and torch_xla

…e-plugin

will-cromar · 2024-01-22T19:12:36Z

Registering the plugins right away is going to be inevitably flaky, because they will register their create options immediately. This doesn't allow the user to change settings at all after import torch_xla, such as xr.use_spmd(). We should instead defer create_options until the client is actually created.

I'll go learn how to define the plugin base class in C++ so the pjrt_client code can call it directly in a future PR.

will-cromar · 2024-01-23T17:58:03Z

I started work on defining the interface in C++ in #6360.

Originally, I wanted to wait to merge this PR until #6022 goes in, but that is apparently blocked by an XLA bug. What do you all think of merging this PR now to unblock other work? It does not alter the default behavior at all, so it will not be a breaking change.

plugins/cuda/torch_xla_cuda_plugin/__init__.py

JackCaoG · 2024-01-23T21:53:10Z

torch_xla/csrc/runtime/pjrt_registry.cc

+      std::shared_ptr<xla::KeyValueStoreInterface> kv_store = nullptr;
+      if (plugin->init_coordinator) {
+        int global_process_rank = sys_util::GetEnvInt("RANK", 0);
+        int global_world_size = sys_util::GetEnvInt("WORLD_SIZE", 1);


didn't we already get it in the plugin->create_options in python?

create_options doesn't include the rank and world size. In the future, we should actually just be creating the XlaCoordinator in Python and using it at torch.distributed's Store. See #6289 (review)

JackCaoG · 2024-01-23T21:55:00Z

plugins/cuda/torch_xla_cuda_plugin/__init__.py

-    return int(os.getenv('GPU_NUM_DEVICES', '1'))
+    return xu.getenv_as('GPU_NUM_DEVICES', int, 1)
+
+  def client_create_options(self) -> dict:


who will call this client_create_options method here?

plugins.py will call it during plugin registration in this PR. In the follow up, this will be called when the client is created.

…e-plugin

vanbasten23 · 2024-01-25T17:31:30Z

plugins/cuda/torch_xla_cuda_plugin/__init__.py

@@ -10,4 +11,37 @@ def library_path(self) -> str:

  def physical_chip_count(self) -> int:
    # TODO: default to actual device count
-    return int(os.getenv('GPU_NUM_DEVICES', '1'))
+    return xu.getenv_as('GPU_NUM_DEVICES', int, 1)


Perhaps irrelevant to this PR but just want to confirm that the # TODO: default to actual device count still holds, since GPU_NUM_DEVICES is not always set and the default value may not be 1.

I see it's only used in run_multiprocess. So it sound using GPU_NUM_DEVICES preserves the current behavior. Looks good to me then.

Yeah, this is the same as the current behavior. Ideally this should check the PCI device IDs like we do for TPUs.

vanbasten23 · 2024-01-25T17:33:40Z

plugins/cuda/torch_xla_cuda_plugin/__init__.py

+    return {k: v for k, v in options.items() if v is not None}
+
+  def requires_xla_coordinator(self) -> bool:
+    return True


I wonder why it always return True. For single processing, probably we don't need the coordinator? So should it depend on whether it is single processing?

I think in a previous draft, I caught this case in InitializePjrt. But as @jonb377 says, a plugin may always require the coordinator. I'll go ahead and update this to return global_world_size > 1.

vanbasten23 · 2024-01-25T17:47:28Z

torch_xla/csrc/runtime/pjrt_registry.cc

+      std::shared_ptr<xla::KeyValueStoreInterface> kv_store = nullptr;
+      if (plugin->init_coordinator) {
+        int global_process_rank = sys_util::GetEnvInt("RANK", 0);
+        int global_world_size = sys_util::GetEnvInt("WORLD_SIZE", 1);


Sorry I didn't catch it earlier. What if the users start the single-host training in a non-torchrun way such as PJRT_DEVICE=CUDA GPU_NUM_DEVICES=4 python3 xla/test/test_train_mp_imagenet.py, then global_world_size should default to local_world_size? Similar reason for global_process_rank

This case is slightly wrong. The precedence should be $PJRT_LOCAL_PROCESS_COUNT, then $WORLD_SIZE, then 1 (default). I'll fix it.

Shouldn't it be WORLD_SIZE -> PJRT_LOCAL_PROCESS_COUNT or LOCAL_WORLD_SIZE -> 1?
If we use torchrun, then WORLD_SIZE will be set.
If we GPU_NUM_DEVICES=4 python3 , non-torchrun for single-host-multi-process, then WORLD_SIZE is not set and we rely on PJRT_LOCAL_PROCESS_COUNT or LOCAL_WORLD_SIZE.

Yeah, you're right. I tripped over this while testing as well.

vanbasten23

LGTM. Thanks!

jonb377

LGTM!

will-cromar added 4 commits January 10, 2024 22:50

Support create_options

c41d207

add options to DevicePlugin

673db21

init XlaCoordinator

a01c585

implement create options for GPU plugin

5771baa

formatting

ba24333

will-cromar added the runtime label Jan 12, 2024

will-cromar requested review from vanbasten23, jonb377 and JackCaoG January 12, 2024 23:10

link openxla gpu plugin options

a0968ce

JackCaoG reviewed Jan 13, 2024

View reviewed changes

vanbasten23 reviewed Jan 16, 2024

View reviewed changes

torch_xla/csrc/runtime/pjrt_registry.cc Show resolved Hide resolved

vanbasten23 reviewed Jan 16, 2024

View reviewed changes

torch_xla/csrc/runtime/pjrt_registry.cc Show resolved Hide resolved

vanbasten23 reviewed Jan 16, 2024

View reviewed changes

torch_xla/experimental/plugins.py Show resolved Hide resolved

jonb377 reviewed Jan 17, 2024

View reviewed changes

will-cromar added 4 commits January 18, 2024 00:09

gpu->pjrt

36d48e7

remove some create options when unset

0236b0c

Merge branch 'master' of github.com:pytorch/xla into wcromar/configur…

0f13d98

…e-plugin

fix pin update

25d1547

will-cromar mentioned this pull request Jan 19, 2024

Fix global_device_count(), local_device_count() for single process on CUDA #6022

Merged

use all devices in SPMD case (see #6022)

61bf72f

will-cromar added 3 commits January 19, 2024 23:28

Add some logging

775f936

formatting

193951b

more formatting

b9ddd6e

will-cromar requested review from vanbasten23, jonb377 and JackCaoG January 23, 2024 17:58

JackCaoG reviewed Jan 23, 2024

View reviewed changes

plugins/cuda/torch_xla_cuda_plugin/__init__.py Show resolved Hide resolved

JackCaoG reviewed Jan 23, 2024

View reviewed changes

JackCaoG approved these changes Jan 23, 2024

View reviewed changes

Merge branch 'master' of github.com:pytorch/xla into wcromar/configur…

b022da1

…e-plugin

will-cromar mentioned this pull request Jan 24, 2024

Define PJRT plugin interface in C++ #6360

Merged

vanbasten23 reviewed Jan 25, 2024

View reviewed changes

Address review comments

9b6dc0d

will-cromar marked this pull request as ready for review January 25, 2024 20:04

vanbasten23 approved these changes Jan 25, 2024

View reviewed changes

jonb377 approved these changes Jan 25, 2024

View reviewed changes

will-cromar merged commit c738156 into master Jan 25, 2024
17 of 18 checks passed

will-cromar deleted the wcromar/configure-plugin branch January 25, 2024 21:13

anw90 mentioned this pull request Mar 12, 2024

Retrieve the local_process_rank from the LOCAL_RANK environment variable. #6208

Closed

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Support PJRT C API create_options (#6289)

b20a08d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support PJRT C API `create_options` #6289

Support PJRT C API `create_options` #6289

will-cromar commented Jan 10, 2024 •

edited

Loading

will-cromar commented Jan 12, 2024

will-cromar commented Jan 12, 2024

JackCaoG Jan 13, 2024

will-cromar Jan 16, 2024

JackCaoG left a comment

will-cromar commented Jan 16, 2024

vanbasten23 Jan 16, 2024

will-cromar Jan 19, 2024

JackCaoG commented Jan 16, 2024

will-cromar commented Jan 16, 2024

jonb377 left a comment

jonb377 Jan 17, 2024 •

edited

Loading

will-cromar Jan 17, 2024

jonb377 Jan 18, 2024

vanbasten23 commented Jan 17, 2024

will-cromar commented Jan 17, 2024

will-cromar commented Jan 22, 2024

will-cromar commented Jan 23, 2024

JackCaoG Jan 23, 2024

will-cromar Jan 23, 2024

JackCaoG Jan 23, 2024

will-cromar Jan 23, 2024

vanbasten23 Jan 25, 2024

vanbasten23 Jan 25, 2024

will-cromar Jan 25, 2024

vanbasten23 Jan 25, 2024

will-cromar Jan 25, 2024

vanbasten23 Jan 25, 2024

will-cromar Jan 25, 2024

vanbasten23 Jan 25, 2024

will-cromar Jan 25, 2024

vanbasten23 left a comment

jonb377 left a comment

	auto allocator_config = xla::GpuAllocatorConfig{};
	if (sys_util::GetEnvString(env::kEnvPjrtAllocatorCudaAsync, "").empty() &&
	sys_util::GetEnvString(env::kEnvPjrtAllocatorPreallocate, "").empty() &&
	sys_util::GetEnvString(env::kEnvPjrtAllocatorFraction, "").empty()) {
	return allocator_config;
	}

Support PJRT C API create_options #6289

Support PJRT C API create_options #6289

Conversation

will-cromar commented Jan 10, 2024 • edited Loading

will-cromar commented Jan 12, 2024

will-cromar commented Jan 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackCaoG left a comment

Choose a reason for hiding this comment

will-cromar commented Jan 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackCaoG commented Jan 16, 2024

will-cromar commented Jan 16, 2024

jonb377 left a comment

Choose a reason for hiding this comment

jonb377 Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanbasten23 commented Jan 17, 2024

will-cromar commented Jan 17, 2024

will-cromar commented Jan 22, 2024

will-cromar commented Jan 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanbasten23 left a comment

Choose a reason for hiding this comment

jonb377 left a comment

Choose a reason for hiding this comment

Support PJRT C API `create_options` #6289

Support PJRT C API `create_options` #6289

will-cromar commented Jan 10, 2024 •

edited

Loading

jonb377 Jan 17, 2024 •

edited

Loading