Make the pjrt gpu allocator configurable #5759

anw90 · 2023-11-02T01:46:49Z

This PR aims to make the pjrt gpu allocator configurable.
The default value of memory allocator fraction is 0.75, which is too small.

vanbasten23 · 2023-11-02T22:31:28Z

torch_xla/csrc/runtime/pjrt_computation_client.cc

+      sys_util::GetEnvBool(env::kEnvPjrtAllocatorPreallocate, true);
+  allocator_config.memory_fraction =
+      sys_util::GetEnvDouble(env::kEnvPjrtAllocatorFraction, 0.9);
+  return allocator_config;


This seems to change the default behavior. If none of kEnvPjrtAllocatorCudaAsync, kEnvPjrtAllocatorPreallocate, kEnvPjrtAllocatorFraction is set, could you just return xla::GpuAllocatorConfig{} as before?

If kEnvPjrtAllocatorFraction is set to 0.75, it is the same as xla::GpuAllocatorConfig{}. Should the default value be changed from 0.9 to 0.75?

In our previous experience with GPUs, 0.9 is considered a reasonable value for the memory fraction.

Yeah, let's use the default value from xla:gpu https://github.com/openxla/xla/blob/7ab5df624ff1d98804999b03b21abecd14ec57a6/xla/pjrt/gpu/gpu_helpers.h#L41-L60.
With the new flags, it gives you the flexibility to choose the best configuration that suits your needs.

Actually, if none of kEnvPjrtAllocatorCudaAsync, kEnvPjrtAllocatorPreallocate, kEnvPjrtAllocatorFraction is set, could you just return xla::GpuAllocatorConfig{} as before?

The reason is that with the current implementation if xla:GPU change their default value, then we have to update ours and we may forget to do so or don't know they have changed the values. By just returning xla::GpuAllocatorConfig{}, we can use whatever default value xla:GPU has set.

ok. When none of kEnvPjrtAllocatorCudaAsync, kEnvPjrtAllocatorPreallocate, kEnvPjrtAllocatorFraction is set, just return xla::GpuAllocatorConfig{}.

vanbasten23 · 2023-11-02T22:32:45Z

I wonder what problem you run into when memory allocator fraction is set to 0.75.

anw90 · 2023-11-03T02:49:50Z

I wonder what problem you run into when memory allocator fraction is set to 0.75.

For 80G H100/A100, there is 20G memory waste if the memory allocator fraction is set to 0.75. For distributed training job, NCCL does not requires as much memory.

GPU memory is particularly constrained, especially during LLM training. Therefore, setting memory allocator fraction to 0.9 can improve the memory usage more efficiency.

vanbasten23

LGTM

* Make the pjrt gpu allocator configurable * the default value changed from 0.9 to 0.75 * return default GpuAllocatorConfig --------- Co-authored-by: wangang.wa <[email protected]>

anw90 force-pushed the mem_config branch from 2579ced to 82163e9 Compare November 2, 2023 02:04

Make the pjrt gpu allocator configurable

5f81944

anw90 force-pushed the mem_config branch from 82163e9 to 5f81944 Compare November 2, 2023 02:08

JackCaoG requested review from will-cromar and vanbasten23 November 2, 2023 17:33

vanbasten23 reviewed Nov 2, 2023

View reviewed changes

anw90 added 2 commits November 6, 2023 10:12

the default value changed from 0.9 to 0.75

5395901

return default GpuAllocatorConfig

1cb36aa

vanbasten23 approved these changes Nov 8, 2023

View reviewed changes

vanbasten23 merged commit 56733fb into pytorch:master Nov 13, 2023
17 checks passed

mbzomowski mentioned this pull request Nov 16, 2023

tpu ci module refactor mbzomowski-test-org/xla#7

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the pjrt gpu allocator configurable #5759

Make the pjrt gpu allocator configurable #5759

anw90 commented Nov 2, 2023

vanbasten23 Nov 2, 2023

anw90 Nov 3, 2023

anw90 Nov 3, 2023

vanbasten23 Nov 3, 2023

anw90 Nov 6, 2023

vanbasten23 Nov 7, 2023

anw90 Nov 8, 2023

vanbasten23 commented Nov 2, 2023

anw90 commented Nov 3, 2023

vanbasten23 left a comment

Make the pjrt gpu allocator configurable #5759

Make the pjrt gpu allocator configurable #5759

Conversation

anw90 commented Nov 2, 2023

vanbasten23 Nov 2, 2023

Choose a reason for hiding this comment

anw90 Nov 3, 2023

Choose a reason for hiding this comment

anw90 Nov 3, 2023

Choose a reason for hiding this comment

vanbasten23 Nov 3, 2023

Choose a reason for hiding this comment

anw90 Nov 6, 2023

Choose a reason for hiding this comment

vanbasten23 Nov 7, 2023

Choose a reason for hiding this comment

anw90 Nov 8, 2023

Choose a reason for hiding this comment

vanbasten23 commented Nov 2, 2023

anw90 commented Nov 3, 2023

vanbasten23 left a comment

Choose a reason for hiding this comment